Elasticsearch – 聚合分析篇(上)

  数据分组bucket

准备数据

  • 建索引tvs
curl PUT ip:port/tvs
{
  "mappings": {
    "properties": {
      "price": {
      	"type": "long"
      },
      "color": {
      	"type": "keyword"
      },
      "brand": {
      	"type": "keyword"
      },
      "sold_date": {
      	"type": "date"
      }
    }
  }
}
复制代码
  • 测试数据
curl POST ip:post/tvs/_bulk
{
    "index": {}
}
{
    "price": 1000,
    "color": "红色",
    "brand": "长虹",
    "sold_date": "2016-10-28"
}
{
    "index": {}
}
{
    "price": 2000,
    "color": "红色",
    "brand": "长虹",
    "sold_date": "2016-11-05"
}
{
    "index": {}
}
{
    "price": 3000,
    "color": "绿色",
    "brand": "小米",
    "sold_date": "2016-05-18"
}
{
    "index": {}
}
{
    "price": 1500,
    "color": "蓝色",
    "brand": "TCL",
    "sold_date": "2016-07-02"
}
{
    "index": {}
}
{
    "price": 1200,
    "color": "绿色",
    "brand": "TCL",
    "sold_date": "2016-08-19"
}
{
    "index": {}
}
{
    "price": 2000,
    "color": "红色",
    "brand": "长虹",
    "sold_date": "2016-11-05"
}
{
    "index": {}
}
{
    "price": 8000,
    "color": "红色",
    "brand": "三星",
    "sold_date": "2017-01-01"
}
{
    "index": {}
}
{
    "price": 2500,
    "color": "蓝色",
    "brand": "小米",
    "sold_date": "2017-02-12"
}
复制代码

1. 基础函数

  metric是对一个bucket执行的某种聚合分析操作。 count avg max min sum 等操作。

按数量分组

  统计某种颜色电视机销量最高

curl GET ip:port/tvs/_search
{
    "size" : 0,
    "aggs" : { 
        "popular_colors" : { 
            "terms" : { 
              "field" : "color"
            }
        }
    }
}
复制代码

  请求参数

  • size:只获取聚合结果,而不需要返回执行聚合的那些原始数据;
  • aggs:固定语法,表示要对一份数据执行分组聚合操作;
  • popular_colors:每个aggs的名字,自定义;
  • terms:根据字段值进行分组;
  • field:进行分组的字段。

  返回结果

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "popular_colors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "红色",
          "doc_count" : 4
        },
        {
          "key" : "绿色",
          "doc_count" : 2
        },
        {
          "key" : "蓝色",
          "doc_count" : 2
        }
      ]
    }
  }
}
复制代码

  出参解释

  • hits.hits 我们在请求中指定了size=0,所以hits.hits就是空的,否则会把执行聚合的那些原始数据返回。
  • aggregations 聚合结果。
  • popular_color 自定义的聚合名称。
  • buckets 根据我们指定的field划分出的 buckets。
  • key field的值。
  • doc_count 这个 bucket 分组内的 doc 条数。

  按数量分组其实并不算是一个metric操作,它是Elasticsearch对聚合分析的一种默认操作,利用term实现。

统计平均值

  统计每种颜色电视机的平均价格:

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": { 
            "avg_price": { 
               "avg": {
                  "field": "price" 
               }
            }
         }
      }
   }
}
复制代码

  嵌套 aggsterms 平级,对每个bucket执行一次metric操作。

  返回结果

{
  ...
  "aggregations" : {
    "colors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "红色",
          "doc_count" : 4,
          "avg_price" : {
            "value" : 3250.0
          }
        },
        {
          "key" : "绿色",
          "doc_count" : 2,
          "avg_price" : {
            "value" : 2100.0
          }
        },
        {
          "key" : "蓝色",
          "doc_count" : 2,
          "avg_price" : {
            "value" : 2000.0
          }
        }
      ]
    }
  }
}
复制代码

  avg_pricevalue 为 metric 计算的结果,每个 bucket 中的所有 doc 的 price 字段值的平均值。

下钻分析

  对bucket再分组,再对每个最小粒度分组执行聚合分析操作。例如:按照颜色对电视机进行分组,再对每种颜色下的各个品牌电视机求平均价格。

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "aggs": {
    "group_by_color": {
      "terms": {
        "field": "color"
      },
      "aggs": {
        "color_avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "group_by_brand": {
          "terms": {
            "field": "brand"
          },
          "aggs": {
            "brand_avg_price": {
              "avg": {
                "field": "price"
              }
            }
          }
        }
      }
    }
  }
}
复制代码

  嵌套 group_by_brand 按照band字段进行分组,求品牌的平均价格。

{
  ...
  "aggregations" : {
    "group_by_color" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "红色",
          "doc_count" : 4,
          "color_avg_price" : {
            "value" : 3250.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "长虹",
                "doc_count" : 3,
                "brand_avg_price" : {
                  "value" : 1666.6666666666667
                }
              },
              {
                "key" : "三星",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 8000.0
                }
              }
            ]
          }
        },
        {
          "key" : "绿色",
          "doc_count" : 2,
          "color_avg_price" : {
            "value" : 2100.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "TCL",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 1200.0
                }
              },
              {
                "key" : "小米",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 3000.0
                }
              }
            ]
          }
        },
        {
          "key" : "蓝色",
          "doc_count" : 2,
          "color_avg_price" : {
            "value" : 2000.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "TCL",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 1500.0
                }
              },
              {
                "key" : "小米",
                "doc_count" : 1,
                "brand_avg_price" : {
                  "value" : 2500.0
                }
              }
            ]
          }
        }
      ]
    }
  }
}
复制代码

统计极值

  统计每种颜色的电视机的最高价和最低价:

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": {
            "min_price" : { "min": { "field": "price"} }, 
            "max_price" : { "max": { "field": "price"} }
         }
      }
   }
}
复制代码

2. 区间分组

  histogram关键字来完成对指定字段值的 区间分组 ,如果我们想要分组的字段类型为日期,则需要使用 date_histogram 关键字。

  接收一个field,按照field值的各个范围区间,进行bucket分组操作:

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs":{
      "price":{
         "histogram":{ 
            "field": "price",
            "interval": 2000
         }
      }
   }
}
复制代码

  上述请求中,我们对“price”字段进行区间分组,区间间隔为2000,返回结果:

{
  ...
  "aggregations" : {
    "price" : {
      "buckets" : [
        {
          "key" : 0.0,
          "doc_count" : 3
        },
        {
          "key" : 2000.0,
          "doc_count" : 4
        },
        {
          "key" : 4000.0,
          "doc_count" : 0
        },
        {
          "key" : 6000.0,
          "doc_count" : 0
        },
        {
          "key" : 8000.0,
          "doc_count" : 1
        }
      ]
    }
  }
}
复制代码

  按照区间分组之后,我们就可以对各个 bucket 执行 metric 操作了,比如计算总和:

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs":{
      "price":{
         "histogram":{ 
            "field": "price",
            "interval": 2000
         },
         "aggs":{
            "revenue": {
               "sum": { 
                 "field" : "price"
               }
             }
         }
      }
   }
}
复制代码

2.1. date_histogram

  按区间分组的字段是 date 类型,需要用到 date_histogram 关键字,例如:

curl GET ip:port/tvs/_search
{
   "size" : 0,
   "aggs": {
      "sales": {
         "date_histogram": {
            "field": "sold_date",
            "interval": "month", 
            "format": "yyyy-MM-dd",
            "min_doc_count" : 0, 
            "extended_bounds" : { 
                "min" : "2016-01-01",
                "max" : "2017-12-31"
            }
         }
      }
   }
}
复制代码

  入参说明

  • min_doc_count 某个日期区间内的doc数量至少要等于这个参数,这个区间才会返回。
  • extended_bounds 划分bucket的时候,会限定在这个起始日期和截止日期内。

  统计每季度每个品牌的电视销售额:

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "aggs": {
    "group_by_sold_date": {
      "date_histogram": {
        "field": "sold_date",
        "interval": "quarter",
        "format": "yyyy-MM-dd",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "2016-01-01",
          "max": "2017-12-31"
        }
      },
      "aggs": {
        "total_sum_price": {
          "sum": {
            "field": "price"
          }
        },  
        "group_by_brand": {
          "terms": {
            "field": "brand"
          },
          "aggs": {
            "sum_price": {
              "sum": {
                "field": "price"
              }
            }
          }
        }
      }
    }
  }
}
复制代码

  先按日期进行分组,然后下钻到组内再按照品牌分组,最后对每个子组执行求和metric操作。结果如下:

{
  ...
  "aggregations" : {
    "group_by_sold_date" : {
      "buckets" : [
        {
          "key_as_string" : "2016-01-01",
          "key" : 1451606400000,
          "doc_count" : 0,
          "total_sum_price" : {
            "value" : 0.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [ ]
          }
        },
        {
          "key_as_string" : "2016-04-01",
          "key" : 1459468800000,
          "doc_count" : 1,
          "total_sum_price" : {
            "value" : 3000.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "小米",
                "doc_count" : 1,
                "sum_price" : {
                  "value" : 3000.0
                }
              }
            ]
          }
        },
        {
          "key_as_string" : "2016-07-01",
          "key" : 1467331200000,
          "doc_count" : 2,
          "total_sum_price" : {
            "value" : 2700.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "TCL",
                "doc_count" : 2,
                "sum_price" : {
                  "value" : 2700.0
                }
              }
            ]
          }
        },
        {
          "key_as_string" : "2016-10-01",
          "key" : 1475280000000,
          "doc_count" : 3,
          "total_sum_price" : {
            "value" : 5000.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "长虹",
                "doc_count" : 3,
                "sum_price" : {
                  "value" : 5000.0
                }
              }
            ]
          }
        },
        {
          "key_as_string" : "2017-01-01",
          "key" : 1483228800000,
          "doc_count" : 2,
          "total_sum_price" : {
            "value" : 10500.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "三星",
                "doc_count" : 1,
                "sum_price" : {
                  "value" : 8000.0
                }
              },
              {
                "key" : "小米",
                "doc_count" : 1,
                "sum_price" : {
                  "value" : 2500.0
                }
              }
            ]
          }
        },
        {
          "key_as_string" : "2017-04-01",
          "key" : 1491004800000,
          "doc_count" : 0,
          "total_sum_price" : {
            "value" : 0.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [ ]
          }
        },
        {
          "key_as_string" : "2017-07-01",
          "key" : 1498867200000,
          "doc_count" : 0,
          "total_sum_price" : {
            "value" : 0.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [ ]
          }
        },
        {
          "key_as_string" : "2017-10-01",
          "key" : 1506816000000,
          "doc_count" : 0,
          "total_sum_price" : {
            "value" : 0.0
          },
          "group_by_brand" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [ ]
          }
        }
      ]
    }
  }
}
复制代码

3. 聚合限定

  Aggregation Scope 限定进行聚合分析的doc范围,可以和query、filter结合使用。

聚合分析与全文检索结合使用

  Elasticsearch中的所有聚合都会在一个scope下执行,结合普通搜索请求后,这个scope就是检索出的结果。

  统计指定品牌下每个颜色的销量:

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "query": {
    "term": {
      "brand": {
        "value": "小米"
      }
    }
  },
  "aggs": {
    "group_by_color": {
      "terms": {
        "field": "color"
      }
    }
  }
}
复制代码

  返回结果

{
  "took" : 34,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_color" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "绿色",
          "doc_count" : 1
        },
        {
          "key" : "蓝色",
          "doc_count" : 1
        }
      ]
    }
  }
}
复制代码

聚合分析与filter结合使用

  统计价格大于1200的所有电视机的平均价格:

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "price": {
            "gte": 1200
          }
        }
      }
    }
  },
  "aggs": {
    "avg_price": {
      "avg": {
        "field": "price"
      }
    }
  }
}
复制代码

  针对某个bucket进行精细化的filter,那么就可以使用aggs.filter。例如:统计长虹电视最近1个月、最近3个月、最近6个月的平均值:

curl GET ip:port/tvs/_search 
{
  "size": 0,
  "query": {
    "term": {
      "brand": {
        "value": "长虹"
      }
    }
  },
  "aggs": {
    "recent_1m": {
      "filter": {
        "range": {
          "sold_date": {
            "gte": "now-1m"
          }
        }
      },
      "aggs": {
        "recent_1m_avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    },
    "recent_3m": {
      "filter": {
        "range": {
          "sold_date": {
            "gte": "now-3m"
          }
        }
      },
      "aggs": {
        "recent_3m_avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    },
    "recent_6m": {
      "filter": {
        "range": {
          "sold_date": {
            "gte": "now-6m"
          }
        }
      },
      "aggs": {
        "recent_6m_avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}
复制代码

4. 全局分组

  对于一次聚合分析请求,给出两个结果,对于这种需求使用 global bucket

  1. 指定scope范围内的聚合结果;
  2. 不限定范围的聚合结果。

  对比长虹牌电视机的平均销售额和所有品牌电视机的平均销售额:

curl GET ip:port/tvs/_search 
{
  "size": 0, 
  "query": {
    "term": {
      "brand": {
        "value": "长虹"
      }
    }
  },
  "aggs": {
    "single_brand_avg_price": {
      "avg": {
        "field": "price"
      }
    },
    "all": {
      "global": {},
      "aggs": {
        "all_brand_avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}
复制代码

  上述请求中 query 用于限定 scope,对该scope范围内的doc执行聚合分析,而内部的global 关键字会将聚合分析的范围指定为所有doc

  请求结果

{
  "took" : 35,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "all" : {
      "doc_count" : 8,
      "all_brand_avg_price" : {
        "value" : 2650.0
      }
    },
    "single_brand_avg_price" : {
      "value" : 1666.6666666666667
    }
  }
}
复制代码

  一般来讲,有些聚合分析的metric操作,是很容易在多个shard中并行执行的,比如max、min、avg这种,coordinate node拿到各个shard的返回结果后,只需要经过简单计算就能得出最终结果:

  1. coordinate node把请求广播到所有shard;
  2. 每个分片计算本地最大的字段值,返回给coordinate node;
  3. coordinate node选出所有shard返回的最大值,这就是最终的最大值。

  上面这类算法可以随着机器数的线性增长而横向扩展,无须任何协调操作(机器之间不需要讨论中间结果),而且内存消耗很小(一个整型就能代表最大值)。

  但是还有些算法,是很难并行执行的,比如说count(distinct),并不是说在每个shard上直接过滤出distinct value就可以了,因为coordinate node需要拿到各个shard返回的结果,在内存中进行筛选操作,如果数据量非常大,这个过程非常耗时。

  所以,Elasticsearch为了提升性能,采用了近似算法,它们会提供准确但不是 100% 精确的结果, 以牺牲一点小小的估算错误为代价,这些算法可以为我们换来高速的执行效率和极小的内存消耗。

© 版权声明
THE END
喜欢就支持一下吧
点赞0 分享