Big Data 184 - Elasticsearch Doc Values Mechanism Detailed
1. What is Doc Values
Doc Values is disk columnar storage data structure generated at indexing time, specially optimized for sorting, aggregation, script values.
- Compared to fielddata (memory), doc values more memory efficient
- Generated at write time, read from disk at query time
2. Default Behavior
Types enabled by default
- numeric
- date
- IP
- keyword
- etc.
Types disabled by default
- text: Doesn’t provide doc values (because text is tokenized, not suitable for aggregation/sorting)
3. How Text Field Aggregates/Sorts
3.1 Option 1: keyword subfield
{
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
Use content.keyword for aggregation/sort.
3.2 Option 2: Enable fielddata
{
"content": {
"type": "text",
"fielddata": true
}
}
Note: fielddata loaded into heap memory, has serious memory pressure on large fields, use with caution.
4. Disable Doc Values
For fields that don’t need sort/aggregation/script, can disable:
{
"my_field": {
"type": "keyword",
"doc_values": false
}
}
5. Notes
- Mapping cannot be modified casually, modification requires rebuilding index
- Once field type determined, hard to change
6. Summary
- Doc Values is disk columnar storage
- Most types enabled by default
- text fields need keyword subfield or fielddata for aggregation/sort