Big Data 184 - Elasticsearch Doc Values Mechanism Detailed

1. What is Doc Values

Doc Values is disk columnar storage data structure generated at indexing time, specially optimized for sorting, aggregation, script values.

  • Compared to fielddata (memory), doc values more memory efficient
  • Generated at write time, read from disk at query time

2. Default Behavior

Types enabled by default

  • numeric
  • date
  • IP
  • keyword
  • etc.

Types disabled by default

  • text: Doesn’t provide doc values (because text is tokenized, not suitable for aggregation/sorting)

3. How Text Field Aggregates/Sorts

3.1 Option 1: keyword subfield

{
  "content": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword"
      }
    }
  }
}

Use content.keyword for aggregation/sort.

3.2 Option 2: Enable fielddata

{
  "content": {
    "type": "text",
    "fielddata": true
  }
}

Note: fielddata loaded into heap memory, has serious memory pressure on large fields, use with caution.

4. Disable Doc Values

For fields that don’t need sort/aggregation/script, can disable:

{
  "my_field": {
    "type": "keyword",
    "doc_values": false
  }
}

5. Notes

  • Mapping cannot be modified casually, modification requires rebuilding index
  • Once field type determined, hard to change

6. Summary

  • Doc Values is disk columnar storage
  • Most types enabled by default
  • text fields need keyword subfield or fielddata for aggregation/sort