数据分组bucket
。
准备数据
- 建索引tvs
curl PUT ip:port/tvs
{
"mappings": {
"properties": {
"price": {
"type": "long"
},
"color": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"sold_date": {
"type": "date"
}
}
}
}
复制代码
- 测试数据
curl POST ip:post/tvs/_bulk
{
"index": {}
}
{
"price": 1000,
"color": "红色",
"brand": "长虹",
"sold_date": "2016-10-28"
}
{
"index": {}
}
{
"price": 2000,
"color": "红色",
"brand": "长虹",
"sold_date": "2016-11-05"
}
{
"index": {}
}
{
"price": 3000,
"color": "绿色",
"brand": "小米",
"sold_date": "2016-05-18"
}
{
"index": {}
}
{
"price": 1500,
"color": "蓝色",
"brand": "TCL",
"sold_date": "2016-07-02"
}
{
"index": {}
}
{
"price": 1200,
"color": "绿色",
"brand": "TCL",
"sold_date": "2016-08-19"
}
{
"index": {}
}
{
"price": 2000,
"color": "红色",
"brand": "长虹",
"sold_date": "2016-11-05"
}
{
"index": {}
}
{
"price": 8000,
"color": "红色",
"brand": "三星",
"sold_date": "2017-01-01"
}
{
"index": {}
}
{
"price": 2500,
"color": "蓝色",
"brand": "小米",
"sold_date": "2017-02-12"
}
复制代码
1. 基础函数
metric
是对一个bucket
执行的某种聚合分析操作。 count
avg
max
min
sum
等操作。
按数量分组
统计某种颜色电视机销量最高
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs" : {
"popular_colors" : {
"terms" : {
"field" : "color"
}
}
}
}
复制代码
请求参数
- size:只获取聚合结果,而不需要返回执行聚合的那些原始数据;
- aggs:固定语法,表示要对一份数据执行分组聚合操作;
- popular_colors:每个aggs的名字,自定义;
- terms:根据字段值进行分组;
- field:进行分组的字段。
返回结果
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"popular_colors" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "红色",
"doc_count" : 4
},
{
"key" : "绿色",
"doc_count" : 2
},
{
"key" : "蓝色",
"doc_count" : 2
}
]
}
}
}
复制代码
出参解释
hits.hits
我们在请求中指定了size=0,所以hits.hits就是空的,否则会把执行聚合的那些原始数据返回。aggregations
聚合结果。popular_color
自定义的聚合名称。buckets
根据我们指定的field划分出的 buckets。key
field的值。doc_count
这个 bucket 分组内的 doc 条数。
按数量分组其实并不算是一个metric操作,它是Elasticsearch对聚合分析的一种默认操作,利用term实现。
统计平均值
统计每种颜色电视机的平均价格:
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
复制代码
嵌套 aggs
与 terms
平级,对每个bucket执行一次metric操作。
返回结果
{
...
"aggregations" : {
"colors" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "红色",
"doc_count" : 4,
"avg_price" : {
"value" : 3250.0
}
},
{
"key" : "绿色",
"doc_count" : 2,
"avg_price" : {
"value" : 2100.0
}
},
{
"key" : "蓝色",
"doc_count" : 2,
"avg_price" : {
"value" : 2000.0
}
}
]
}
}
}
复制代码
avg_price
的 value
为 metric 计算的结果,每个 bucket 中的所有 doc 的 price 字段值的平均值。
下钻分析
对bucket再分组,再对每个最小粒度分组执行聚合分析操作。例如:按照颜色对电视机进行分组,再对每种颜色下的各个品牌电视机求平均价格。
curl GET ip:port/tvs/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
},
"aggs": {
"color_avg_price": {
"avg": {
"field": "price"
}
},
"group_by_brand": {
"terms": {
"field": "brand"
},
"aggs": {
"brand_avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
}
}
复制代码
嵌套 group_by_brand
按照band字段进行分组,求品牌的平均价格。
{
...
"aggregations" : {
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "红色",
"doc_count" : 4,
"color_avg_price" : {
"value" : 3250.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "长虹",
"doc_count" : 3,
"brand_avg_price" : {
"value" : 1666.6666666666667
}
},
{
"key" : "三星",
"doc_count" : 1,
"brand_avg_price" : {
"value" : 8000.0
}
}
]
}
},
{
"key" : "绿色",
"doc_count" : 2,
"color_avg_price" : {
"value" : 2100.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "TCL",
"doc_count" : 1,
"brand_avg_price" : {
"value" : 1200.0
}
},
{
"key" : "小米",
"doc_count" : 1,
"brand_avg_price" : {
"value" : 3000.0
}
}
]
}
},
{
"key" : "蓝色",
"doc_count" : 2,
"color_avg_price" : {
"value" : 2000.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "TCL",
"doc_count" : 1,
"brand_avg_price" : {
"value" : 1500.0
}
},
{
"key" : "小米",
"doc_count" : 1,
"brand_avg_price" : {
"value" : 2500.0
}
}
]
}
}
]
}
}
}
复制代码
统计极值
统计每种颜色的电视机的最高价和最低价:
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs": {
"min_price" : { "min": { "field": "price"} },
"max_price" : { "max": { "field": "price"} }
}
}
}
}
复制代码
2. 区间分组
histogram
关键字来完成对指定字段值的 区间分组
,如果我们想要分组的字段类型为日期,则需要使用 date_histogram
关键字。
接收一个field,按照field值的各个范围区间,进行bucket分组操作:
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs":{
"price":{
"histogram":{
"field": "price",
"interval": 2000
}
}
}
}
复制代码
上述请求中,我们对“price”字段进行区间分组,区间间隔为2000,返回结果:
{
...
"aggregations" : {
"price" : {
"buckets" : [
{
"key" : 0.0,
"doc_count" : 3
},
{
"key" : 2000.0,
"doc_count" : 4
},
{
"key" : 4000.0,
"doc_count" : 0
},
{
"key" : 6000.0,
"doc_count" : 0
},
{
"key" : 8000.0,
"doc_count" : 1
}
]
}
}
}
复制代码
按照区间分组之后,我们就可以对各个 bucket 执行 metric 操作了,比如计算总和:
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs":{
"price":{
"histogram":{
"field": "price",
"interval": 2000
},
"aggs":{
"revenue": {
"sum": {
"field" : "price"
}
}
}
}
}
}
复制代码
2.1. date_histogram
按区间分组的字段是 date 类型,需要用到 date_histogram 关键字,例如:
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs": {
"sales": {
"date_histogram": {
"field": "sold_date",
"interval": "month",
"format": "yyyy-MM-dd",
"min_doc_count" : 0,
"extended_bounds" : {
"min" : "2016-01-01",
"max" : "2017-12-31"
}
}
}
}
}
复制代码
入参说明
min_doc_count
某个日期区间内的doc数量至少要等于这个参数,这个区间才会返回。extended_bounds
划分bucket的时候,会限定在这个起始日期和截止日期内。
统计每季度每个品牌的电视销售额:
curl GET ip:port/tvs/_search
{
"size": 0,
"aggs": {
"group_by_sold_date": {
"date_histogram": {
"field": "sold_date",
"interval": "quarter",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2016-01-01",
"max": "2017-12-31"
}
},
"aggs": {
"total_sum_price": {
"sum": {
"field": "price"
}
},
"group_by_brand": {
"terms": {
"field": "brand"
},
"aggs": {
"sum_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
}
}
复制代码
先按日期进行分组,然后下钻到组内再按照品牌分组,最后对每个子组执行求和metric操作。结果如下:
{
...
"aggregations" : {
"group_by_sold_date" : {
"buckets" : [
{
"key_as_string" : "2016-01-01",
"key" : 1451606400000,
"doc_count" : 0,
"total_sum_price" : {
"value" : 0.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
},
{
"key_as_string" : "2016-04-01",
"key" : 1459468800000,
"doc_count" : 1,
"total_sum_price" : {
"value" : 3000.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "小米",
"doc_count" : 1,
"sum_price" : {
"value" : 3000.0
}
}
]
}
},
{
"key_as_string" : "2016-07-01",
"key" : 1467331200000,
"doc_count" : 2,
"total_sum_price" : {
"value" : 2700.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "TCL",
"doc_count" : 2,
"sum_price" : {
"value" : 2700.0
}
}
]
}
},
{
"key_as_string" : "2016-10-01",
"key" : 1475280000000,
"doc_count" : 3,
"total_sum_price" : {
"value" : 5000.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "长虹",
"doc_count" : 3,
"sum_price" : {
"value" : 5000.0
}
}
]
}
},
{
"key_as_string" : "2017-01-01",
"key" : 1483228800000,
"doc_count" : 2,
"total_sum_price" : {
"value" : 10500.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "三星",
"doc_count" : 1,
"sum_price" : {
"value" : 8000.0
}
},
{
"key" : "小米",
"doc_count" : 1,
"sum_price" : {
"value" : 2500.0
}
}
]
}
},
{
"key_as_string" : "2017-04-01",
"key" : 1491004800000,
"doc_count" : 0,
"total_sum_price" : {
"value" : 0.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
},
{
"key_as_string" : "2017-07-01",
"key" : 1498867200000,
"doc_count" : 0,
"total_sum_price" : {
"value" : 0.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
},
{
"key_as_string" : "2017-10-01",
"key" : 1506816000000,
"doc_count" : 0,
"total_sum_price" : {
"value" : 0.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
}
}
复制代码
3. 聚合限定
Aggregation Scope
限定进行聚合分析的doc范围,可以和query、filter结合使用。
聚合分析与全文检索结合使用
Elasticsearch中的所有聚合都会在一个scope下执行,结合普通搜索请求后,这个scope就是检索出的结果。
统计指定品牌下每个颜色的销量:
curl GET ip:port/tvs/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "小米"
}
}
},
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
}
}
}
}
复制代码
返回结果
{
"took" : 34,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "绿色",
"doc_count" : 1
},
{
"key" : "蓝色",
"doc_count" : 1
}
]
}
}
}
复制代码
聚合分析与filter结合使用
统计价格大于1200的所有电视机的平均价格:
curl GET ip:port/tvs/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"range": {
"price": {
"gte": 1200
}
}
}
}
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
复制代码
针对某个bucket进行精细化的filter,那么就可以使用aggs.filter。例如:统计长虹电视最近1个月、最近3个月、最近6个月的平均值:
curl GET ip:port/tvs/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "长虹"
}
}
},
"aggs": {
"recent_1m": {
"filter": {
"range": {
"sold_date": {
"gte": "now-1m"
}
}
},
"aggs": {
"recent_1m_avg_price": {
"avg": {
"field": "price"
}
}
}
},
"recent_3m": {
"filter": {
"range": {
"sold_date": {
"gte": "now-3m"
}
}
},
"aggs": {
"recent_3m_avg_price": {
"avg": {
"field": "price"
}
}
}
},
"recent_6m": {
"filter": {
"range": {
"sold_date": {
"gte": "now-6m"
}
}
},
"aggs": {
"recent_6m_avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
复制代码
4. 全局分组
对于一次聚合分析请求,给出两个结果,对于这种需求使用 global bucket
:
- 指定scope范围内的聚合结果;
- 不限定范围的聚合结果。
对比长虹牌电视机的平均销售额和所有品牌电视机的平均销售额:
curl GET ip:port/tvs/_search
{
"size": 0,
"query": {
"term": {
"brand": {
"value": "长虹"
}
}
},
"aggs": {
"single_brand_avg_price": {
"avg": {
"field": "price"
}
},
"all": {
"global": {},
"aggs": {
"all_brand_avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
复制代码
上述请求中 query
用于限定 scope
,对该scope
范围内的doc
执行聚合分析,而内部的global
关键字会将聚合分析的范围指定为所有doc
。
请求结果
{
"took" : 35,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"all" : {
"doc_count" : 8,
"all_brand_avg_price" : {
"value" : 2650.0
}
},
"single_brand_avg_price" : {
"value" : 1666.6666666666667
}
}
}
复制代码
一般来讲,有些聚合分析的metric操作,是很容易在多个shard中并行执行的,比如max、min、avg这种,coordinate node拿到各个shard的返回结果后,只需要经过简单计算就能得出最终结果:
- coordinate node把请求广播到所有shard;
- 每个分片计算本地最大的字段值,返回给coordinate node;
- coordinate node选出所有shard返回的最大值,这就是最终的最大值。
上面这类算法可以随着机器数的线性增长而横向扩展,无须任何协调操作(机器之间不需要讨论中间结果),而且内存消耗很小(一个整型就能代表最大值)。
但是还有些算法,是很难并行执行的,比如说count(distinct),并不是说在每个shard上直接过滤出distinct value就可以了,因为coordinate node需要拿到各个shard返回的结果,在内存中进行筛选操作,如果数据量非常大,这个过程非常耗时。
所以,Elasticsearch为了提升性能,采用了近似算法,它们会提供准确但不是 100% 精确的结果, 以牺牲一点小小的估算错误为代价,这些算法可以为我们换来高速的执行效率和极小的内存消耗。