Big Data 182 - Elasticsearch Inverted Index Underlying Breakdown

1. Inverted Index Core Concepts

Inverted index consists of two parts:

  • Terms Dictionary: Collection of all Terms, sorted in dictionary order
  • Posting List: Document list corresponding to each Term
Forward: DocId → Document
Inverted: Term → [DocId1, DocId2, DocId3, ...]

2. Lucene Index File Structure

File SuffixMeaning
.timTerms Dictionary, stores Term and Postings pointers
.tipTerms Index, FST structure, stores Term prefix and Block positions
.docStores DocId and term frequency

3. FST (Finite State Transducer)

FST is implementation structure of Term Index:

  • Pros: Can be fully loaded into memory, millisecond-level positioning
  • Principle: Uses directed acyclic graph to store Term prefix sharing

4. SkipList (Skip List)

Used to accelerate intersection calculation of multiple Posting Lists:

  • Time complexity reduced from O(N) to O(logN)
  • Reduces invalid comparisons

5. Query Flow

  1. Quickly locate Term through Terms Index (FST)
  2. Get Postings pointer from Terms Dictionary
  3. Read Postings
  4. Merge multiple Posting Lists using SkipList

6. Summary

Lucene segment immutable + flush/merge write model determines ES near real-time (NRT) characteristic: fast write, but queries need to merge results from multiple segments.