数据分组bucket。

准备数据

建索引tvs

curl PUT ip:port/tvs
{
  "mappings": {
    "properties": {
      "price": {
      	"type": "long"
      },
      "color": {
      	"type": "keyword"
      },
      "brand": {
      	"type": "keyword"
      },
      "sold_date": {
      	"type": "date"
      }
    }
  }
}
复制代码

测试数据

curl POST ip:post/tvs/_bulk
{
    "index": {}
}
{
    "price": 1000,
    "color": "红色",
    "brand": "长虹",
    "sold_date": "2016-10-28"
}
{
    "index": {}
}
{
    "price": 2000,
    "color": "红色",
    "brand": "长虹",
    "sold_date": "2016-11-05"
}
{
    "index": {}
}
{
    "price": 3000,
    "color": "绿色",
    "brand": "小米",
    "sold_date": "2016-05-18"
}
{
    "index": {}
}
{
    "price": 1500,
    "color": "蓝色",
    "brand": "TCL",
    "sold_date": "2016-07-02"
}
{
    "index": {}
}
{
    "price": 1200,
    "color": "绿色",
    "brand": "TCL",
    "sold_date": "2016-08-19"
}
{
    "index": {}
}
{
    "price": 2000,
    "color": "红色",
    "brand": "长虹",
    "sold_date": "2016-11-05"
}
{
    "index": {}
}
{
    "price": 8000,
    "color": "红色",
    "brand": "三星",
    "sold_date": "2017-01-01"
}
{
    "index": {}
}
{
    "price": 2500,
    "color": "蓝色",
    "brand": "小米",
    "sold_date": "2017-02-12"
}
复制代码

1. 基础函数

metric是对一个bucket执行的某种聚合分析操作。 count avg max min sum 等操作。

按数量分组

统计某种颜色电视机销量最高

curl GET ip:port/tvs/_search
{
    "size" : 0,
    "aggs" : { 
        "popular_colors" : { 
            "terms" : { 
              "field" : "color"
            }
        }
    }
}
复制代码

请求参数

size：只获取聚合结果，而不需要返回执行聚合的那些原始数据；
aggs：固定语法，表示要对一份数据执行分组聚合操作；
popular_colors：每个aggs的名字，自定义；
terms：根据字段值进行分组；
field：进行分组的字段。

返回结果

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "popular_colors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "红色",
          "doc_count" : 4
        },
        {
          "key" : "绿色",
          "doc_count" : 2
        },
        {
          "key" : "蓝色",
          "doc_count" : 2
        }
      ]
    }
  }
}
复制代码

出参解释

hits.hits 我们在请求中指定了size=0，所以hits.hits就是空的，否则会把执行聚合的那些原始数据返回。
aggregations 聚合结果。
popular_color 自定义的聚合名称。
buckets 根据我们指定的field划分出的 buckets。
key field的值。
doc_count 这个 bucket 分组内的 doc 条数。

按数量分组其实并不算是一个metric操作，它是Elasticsearch对聚合分析的一种默认操作，利用term实现。

统计平均值

统计每种颜色电视机的平均价格：

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": { 
            "avg_price": { 
               "avg": {
                  "field": "price" 
               }
            }
         }
      }
   }
}
复制代码

嵌套 aggs 与 terms 平级，对每个bucket执行一次metric操作。

返回结果

{
  ...
  "aggregations" : {
    "colors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "红色",
          "doc_count" : 4,
          "avg_price" : {
            "value" : 3250.0
          }
        },
        {
          "key" : "绿色",
          "doc_count" : 2,
          "avg_price" : {
            "value" : 2100.0
          }
        },
        {
          "key" : "蓝色",
          "doc_count" : 2,
          "avg_price" : {
            "value" : 2000.0
          }
        }
      ]
    }
  }
}
复制代码

avg_price 的 value 为 metric 计算的结果，每个 bucket 中的所有 doc 的 price 字段值的平均值。

下钻分析

对bucket再分组，再对每个最小粒度分组执行聚合分析操作。例如：按照颜色对电视机进行分组，再对每种颜色下的各个品牌电视机求平均价格。

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "aggs": {
    "group_by_color": {
      "terms": {
        "field": "color"
      },
      "aggs": {
        "color_avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "group_by_brand": {
          "terms": {
            "field": "brand"
          },
          "aggs": {
            "brand_avg_price": {
              "avg": {
                "field": "price"
              }
            }
          }
        }
      }
    }
  }
}
复制代码

嵌套 group_by_brand 按照band字段进行分组，求品牌的平均价格。

{
  ...
  "aggregations" : {
    "group_by_color" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "红色",
          "doc_count" : 4,
          "color_avg_price" : {
            "value" : 3250.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "长虹",
                "doc_count" : 3,
                "brand_avg_price" : {
                  "value" : 1666.6666666666667
                }
              },
              {
                "key" : "三星",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 8000.0
                }
              }
            ]
          }
        },
        {
          "key" : "绿色",
          "doc_count" : 2,
          "color_avg_price" : {
            "value" : 2100.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "TCL",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 1200.0
                }
              },
              {
                "key" : "小米",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 3000.0
                }
              }
            ]
          }
        },
        {
          "key" : "蓝色",
          "doc_count" : 2,
          "color_avg_price" : {
            "value" : 2000.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "TCL",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 1500.0
                }
              },
              {
                "key" : "小米",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 2500.0
                }
              }
            ]
          }
        }
      ]
    }
  }
}
复制代码

统计极值

统计每种颜色的电视机的最高价和最低价：

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": {
            "min_price" : { "min": { "field": "price"} }, 
            "max_price" : { "max": { "field": "price"} }
         }
      }
   }
}
复制代码

2. 区间分组

histogram关键字来完成对指定字段值的 区间分组 ，如果我们想要分组的字段类型为日期，则需要使用 date_histogram 关键字。

接收一个field，按照field值的各个范围区间，进行bucket分组操作：

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs":{
      "price":{
         "histogram":{ 
            "field": "price",
            "interval": 2000
         }
      }
   }
}
复制代码

上述请求中，我们对“price”字段进行区间分组，区间间隔为2000，返回结果：

{
  ...
  "aggregations" : {
    "price" : {
      "buckets" : [
        {
          "key" : 0.0,
          "doc_count" : 3
        },
        {
          "key" : 2000.0,
          "doc_count" : 4
        },
        {
          "key" : 4000.0,
          "doc_count" : 0
        },
        {
          "key" : 6000.0,
          "doc_count" : 0
        },
        {
          "key" : 8000.0,
          "doc_count" : 1
        }
      ]
    }
  }
}
复制代码

按照区间分组之后，我们就可以对各个 bucket 执行 metric 操作了，比如计算总和：

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs":{
      "price":{
         "histogram":{ 
            "field": "price",
            "interval": 2000
         },
         "aggs":{
            "revenue": {
               "sum": { 
                 "field" : "price"
               }
             }
         }
      }
   }
}
复制代码

2.1. date_histogram

按区间分组的字段是 date 类型，需要用到 date_histogram 关键字，例如：

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs": {
      "sales": {
         "date_histogram": {
            "field": "sold_date",
            "interval": "month", 
            "format": "yyyy-MM-dd",
            "min_doc_count" : 0, 
            "extended_bounds" : { 
                "min" : "2016-01-01",
                "max" : "2017-12-31"
            }
         }
      }
   }
}
复制代码

入参说明

min_doc_count 某个日期区间内的doc数量至少要等于这个参数，这个区间才会返回。
extended_bounds 划分bucket的时候，会限定在这个起始日期和截止日期内。

统计每季度每个品牌的电视销售额：

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "aggs": {
    "group_by_sold_date": {
      "date_histogram": {
        "field": "sold_date",
        "interval": "quarter",
        "format": "yyyy-MM-dd",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "2016-01-01",
          "max": "2017-12-31"
        }
      },
      "aggs": {
        "total_sum_price": {
          "sum": {
            "field": "price"
          }
        },  
        "group_by_brand": {
          "terms": {
            "field": "brand"
          },
          "aggs": {
            "sum_price": {
              "sum": {
                "field": "price"
              }
            }
          }
        }
      }
    }
  }
}
复制代码

先按日期进行分组，然后下钻到组内再按照品牌分组，最后对每个子组执行求和metric操作。结果如下：

{
  ...
  "aggregations" : {
    "group_by_sold_date" : {
      "buckets" : [
        {
          "key_as_string" : "2016-01-01",
          "key" : 1451606400000,
          "doc_count" : 0,
          "total_sum_price" : {
            "value" : 0.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [ ]
          }
        },
        {
          "key_as_string" : "2016-04-01",
          "key" : 1459468800000,
          "doc_count" : 1,
          "total_sum_price" : {
            "value" : 3000.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "小米",
                "doc_count" : 1,
                "sum_price" : {
                  "value" : 3000.0
                }
              }
            ]
          }
        },
        {
          "key_as_string" : "2016-07-01",
          "key" : 1467331200000,
          "doc_count" : 2,
          "total_sum_price" : {
            "value" : 2700.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "TCL",
                "doc_count" : 2,
                "sum_price" : {
                  "value" : 2700.0
                }
              }
            ]
          }
        },
        {
          "key_as_string" : "2016-10-01",
          "key" : 1475280000000,
          "doc_count" : 3,
          "total_sum_price" : {
            "value" : 5000.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "长虹",
                "doc_count" : 3,
                "sum_price" : {
                  "value" : 5000.0
                }
              }
            ]
          }
        },
        {
          "key_as_string" : "2017-01-01",
          "key" : 1483228800000,
          "doc_count" : 2,
          "total_sum_price" : {
            "value" : 10500.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "三星",
                "doc_count" : 1,
                "sum_price" : {
                  "value" : 8000.0
                }
              },
              {
                "key" : "小米",
                "doc_count" : 1,
                "sum_price" : {
                  "value" : 2500.0
                }
              }
            ]
          }
        },
        {
          "key_as_string" : "2017-04-01",
          "key" : 1491004800000,
          "doc_count" : 0,
          "total_sum_price" : {
            "value" : 0.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [ ]
          }
        },
        {
          "key_as_string" : "2017-07-01",
          "key" : 1498867200000,
          "doc_count" : 0,
          "total_sum_price" : {
            "value" : 0.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [ ]
          }
        },
        {
          "key_as_string" : "2017-10-01",
          "key" : 1506816000000,
          "doc_count" : 0,
          "total_sum_price" : {
            "value" : 0.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [ ]
          }
        }
      ]
    }
  }
}
复制代码

3. 聚合限定

Aggregation Scope 限定进行聚合分析的doc范围，可以和query、filter结合使用。

聚合分析与全文检索结合使用

Elasticsearch中的所有聚合都会在一个scope下执行，结合普通搜索请求后，这个scope就是检索出的结果。

统计指定品牌下每个颜色的销量：

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "query": {
    "term": {
      "brand": {
        "value": "小米"
      }
    }
  },
  "aggs": {
    "group_by_color": {
      "terms": {
        "field": "color"
      }
    }
  }
}
复制代码

返回结果

{
  "took" : 34,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_color" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "绿色",
          "doc_count" : 1
        },
        {
          "key" : "蓝色",
          "doc_count" : 1
        }
      ]
    }
  }
}
复制代码

聚合分析与filter结合使用

统计价格大于1200的所有电视机的平均价格：

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "price": {
            "gte": 1200
          }
        }
      }
    }
  },
  "aggs": {
    "avg_price": {
      "avg": {
        "field": "price"
      }
    }
  }
}
复制代码

针对某个bucket进行精细化的filter，那么就可以使用aggs.filter。例如：统计长虹电视最近1个月、最近3个月、最近6个月的平均值：

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "query": {
    "term": {
      "brand": {
        "value": "长虹"
      }
    }
  },
  "aggs": {
    "recent_1m": {
      "filter": {
        "range": {
          "sold_date": {
            "gte": "now-1m"
          }
        }
      },
      "aggs": {
        "recent_1m_avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    },
    "recent_3m": {
      "filter": {
        "range": {
          "sold_date": {
            "gte": "now-3m"
          }
        }
      },
      "aggs": {
        "recent_3m_avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    },
    "recent_6m": {
      "filter": {
        "range": {
          "sold_date": {
            "gte": "now-6m"
          }
        }
      },
      "aggs": {
        "recent_6m_avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}
复制代码

4. 全局分组

对于一次聚合分析请求，给出两个结果，对于这种需求使用 global bucket ：

指定scope范围内的聚合结果；
不限定范围的聚合结果。

对比长虹牌电视机的平均销售额和所有品牌电视机的平均销售额：

curl GET ip:port/tvs/_search 
{
  "size": 0, 
  "query": {
    "term": {
      "brand": {
        "value": "长虹"
      }
    }
  },
  "aggs": {
    "single_brand_avg_price": {
      "avg": {
        "field": "price"
      }
    },
    "all": {
      "global": {},
      "aggs": {
        "all_brand_avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}
复制代码

上述请求中 query 用于限定 scope，对该scope范围内的doc执行聚合分析，而内部的global 关键字会将聚合分析的范围指定为所有doc 。

请求结果

{
  "took" : 35,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "all" : {
      "doc_count" : 8,
      "all_brand_avg_price" : {
        "value" : 2650.0
      }
    },
    "single_brand_avg_price" : {
      "value" : 1666.6666666666667
    }
  }
}
复制代码

一般来讲，有些聚合分析的metric操作，是很容易在多个shard中并行执行的，比如max、min、avg这种，coordinate node拿到各个shard的返回结果后，只需要经过简单计算就能得出最终结果：