大数据利器Elasticsearch之mapping参数keyword类型之normalizer-一一网

这是我参与8月更文挑战的第16天，活动详情查看：8月更文挑战
本Elasticsearch相关文章的版本为：7.4.2

keyword类型的normalizer参数和analyzer相似，但是normalizer只产生一个token。

normalizer会被应用于进行索引之前，同时也会应用于进行查询这个keyword字段之前，例如match查询和term-level查询，比如term查询。

测试数据：

PUT index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "foo": {
        "type": "keyword",
        "normalizer": "my_normalizer"
      }
    }
  }
}

PUT index/_doc/1
{
  "foo": "BÀR"
}

PUT index/_doc/2
{
  "foo": "bar"
}

PUT index/_doc/3
{
  "foo": "baz"
}

POST index/_refresh
复制代码

执行查询：

GET index/_search
{
  "query": {
    "term": {
      "foo": "BAR"
    }
  }
}

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}
复制代码

上面的两个查询都会返回doc1和doc2，因为doc1和doc2索引前经过normalizer的作用， [‘BAR’, ‘BÀR’] –> bar。然后查询keyword类型时，输入的文本也会经过normalizer的作用，BAR –> bar，那么doc1和doc2的倒排索引里面都有bar，所以doc1和doc2也会被命中。

注意： 在其他情况下，term查询的文本将会原样拿去和倒排索引中的token进行比较。但是当keyword类型的字段显式设置了normalizer属性后，将会打破这一常规，在拿去和倒排索引中的token进行比较之前，输入的文本将首先通过normalizer进行处理。

返回的数据：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.47000363,
    "hits" : [
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.47000363,
        "_source" : {
          "foo" : "BÀR"
        }
      },
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.47000363,
        "_source" : {
          "foo" : "bar"
        }
      }
    ]
  }
}
复制代码

需要注意的是，keyword类型的数据在进行索引前将首先进行normalizer的处理，这意味着在聚合查询的时候返回的key是normalizer后的词。
例子：

GET index/_search
{
  "size": 0,
  "aggs": {
    "foo_terms": {
      "terms": {
        "field": "foo"
      }
    }
  }
}
复制代码

返回的数据：

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "foo_terms" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "bar",
          "doc_count" : 2
        },
        {
          "key" : "baz",
          "doc_count" : 1
        }
      ]
    }
  }
}
复制代码

上面的例子说明了设置了normalizer属性的keyword类型的文档，当以这个keyword进行聚合查询时，聚合结果里面的key将是以normalizer后的token作为key。例如：[‘BAR’, ‘BÀR’] –> bar。

文章版权归作者所有，未经允许请勿转载。

THE END

后端