Big Data 182 - Elasticsearch Inverted Index Underlying Breakdown
1. Inverted Index Core Concepts
Inverted index consists of two parts:
- Terms Dictionary: Collection of all Terms, sorted in dictionary order
- Posting List: Document list corresponding to each Term
Forward: DocId → Document
Inverted: Term → [DocId1, DocId2, DocId3, ...]
2. Lucene Index File Structure
| File Suffix | Meaning |
|---|---|
| .tim | Terms Dictionary, stores Term and Postings pointers |
| .tip | Terms Index, FST structure, stores Term prefix and Block positions |
| .doc | Stores DocId and term frequency |
3. FST (Finite State Transducer)
FST is implementation structure of Term Index:
- Pros: Can be fully loaded into memory, millisecond-level positioning
- Principle: Uses directed acyclic graph to store Term prefix sharing
4. SkipList (Skip List)
Used to accelerate intersection calculation of multiple Posting Lists:
- Time complexity reduced from O(N) to O(logN)
- Reduces invalid comparisons
5. Query Flow
- Quickly locate Term through Terms Index (FST)
- Get Postings pointer from Terms Dictionary
- Read Postings
- Merge multiple Posting Lists using SkipList
6. Summary
Lucene segment immutable + flush/merge write model determines ES near real-time (NRT) characteristic: fast write, but queries need to merge results from multiple segments.