Overview
This article deeply analyzes Elasticsearch’s inverted index principle based on Lucene, and document read/write flow.
1. Inverted Index Core Concepts
1.1 What is Inverted Index
Inverted index is a data structure used to quickly find documents containing specific vocabulary. It’s like a book’s index page, but structure is “inverted”, hence the name.
1.2 Forward Index vs Inverted Index
Forward Index
- Index structure based on documents, data organized around documents
- Storage: Create an entry for each document, record all terms contained in that document and their positions
- Typical implementation: DocumentID → [term1:position list, term2:position list,…]
- Application scenarios: Document content retrieval, document similarity calculation, document storage layer of search engine
Inverted Index
- Index structure based on vocabulary, data organized around terms
- Storage: Create an entry for each term, record all documents containing that term and their positions
- Typical implementation: Term → [DocumentID1:position list, DocumentID2:position list,…]
- Core advantages: Quickly locate documents containing specific term, support efficient boolean queries (AND/OR/NOT), facilitate relevance sorting
For example, suppose there are three documents:
- “Elasticsearch 是一个分布式搜索引擎”
- “分布式系统可以提供高可用性”
- “搜索引擎使用倒排索引进行高效搜索”
Forward index records what terms are in each document:
Doc1: ["Elasticsearch", "是", "一个", "分布式", "搜索", "引擎"]
Doc2: ["分布式", "系统", "可以", "提供", "高", "可用性"]
Doc3: ["搜索", "引擎", "使用", "倒排索引", "进行", "高效", "搜索"]
Inverted index records which documents each term appears in:
"分布式": [1, 2]
"搜索": [1, 3]
"引擎": [1, 3]
"倒排索引": [3]
1.3 Inverted Index Construction
In Elasticsearch, documents are first analyzed and processed, then generate inverted index. Process is roughly:
- Document tokenization: Each document goes through analyzer processing, analyzer responsible for decomposing document text into terms
- Normalization and filtering: After tokenization, analyzer usually does further processing, like converting all terms to lowercase, removing stop words, etc.
- Build inverted list: Inverted index associates each term with document IDs it appears in
1.4 Inverted Index Structure
Core parts of inverted index can be divided into:
- Term Dictionary: Stores all indexed terms, usually stored in dictionary form
- Posting List: For each term, posting list records all documents containing that term, can also include:
- Term Frequency (TF): Records number of times term appears in document
- Document Frequency (DF): Records total number of documents term appears in across index
- Position: Term’s position in document, used to support phrase and proximity queries
1.5 Inverted Index Query Principle
- Find term: Elasticsearch first looks up query term in Term Dictionary, finds their posting lists
- Merge posting lists: Merge posting lists of two terms, find documents containing all query terms
- Calculate relevance: Score documents according to TF-IDF or BM25 algorithm
1.6 Inverted Index Advantages
- Fast retrieval: Extremely fast retrieval for a term, especially in massive documents, query performance can stay at millisecond level
- Low memory usage: Through compression algorithms and optimized data structures, maintain low memory usage
- Support complex queries: Not only supports simple keyword queries, but also supports complex search conditions like phrase, prefix, proximity queries
2. Read/Write Flow
2.1 Create Document
When adding documents to Elasticsearch, since ES is a distributed cluster, underlying design has an index composed of many Shards, so need to determine which shard the document belongs to. Rule is:
shard = hash(routing) % number_of_primary_shards
- routing is a variable value, default is document’s ID, can also be set to a custom value
- routing generates a number through hash function, then this number divided by number_of_primary_shards (number of primary shards) to get remainder
2.2 Write Document Flow
Write operations must be completed on primary shard first, then can be copied to other nodes as replica shards. Create, index and delete requests are all write operations.
- Client sends write request to Master, that node acts as coordinating node
- Determine shard based on document’s _id, coordinating node forwards request to primary shard of that node
- Execute request on primary shard,成功后, primary shard based on replica count forwards request in parallel to corresponding replica shard nodes
- Once replica shards complete execution, all report success to primary shard, primary shard reports success to coordinating node, coordinating node reports success to client
2.3 Read Document Flow
A search request must query some replica of all shards in the queried index. One retrieval flow mainly has two phases:
Query Phase
- Client sends Search request to coordinating node
- Coordinating node forwards query request to each primary or replica shard in the index
- Each shard executes query locally, uses local Term/Document Frequency information for scoring, adds results to local priority queue of size from+size
- Each shard returns all document IDs and sorting values from their priority queue to coordinating node, coordinating node merges these values into its own priority queue, produces globally sorted list
Fetch Phase
- Coordinating node sends GET request to relevant nodes
- Nodes where shards are located return data to coordinating node
- Coordinating node waits for all document data to be retrieved, then returns to client
3. Error Quick Reference
| Symptom | Root Cause Location | Fix Keywords |
|---|---|---|
| Can write, but search can’t find or results few | Tokenizer inconsistency: different analyzer used at indexing vs query time | Use _analyze API to see actual terms generated at indexing and query |
| Query only hits part of documents, relevance sorting “very strange” | TF/DF in inverted index and BM25 scoring mechanism not understood | Use explain API to see scoring composition of single hit document |
| Write then immediately search, can’t find new document in short time | Not understanding refresh mechanism: there’s refresh period between indexing to inverted index | Check index’s refresh_interval config |
| Some shards have obviously higher QPS, cluster load uneven | Improper routing design, causing documents concentrated on few shards | Reasonably choose routing field, avoid low cardinality fields |
| Overall search becomes slow, CPU usage high | Inverted index expansion: too many fields, text field abuse | Set index: false for fields only used for display |
| Phrase/proximity query effect poor | Position info not enabled or not correctly used | Use match_phrase/span series queries |
| Cluster occasional write failure or “version conflict” exception | Frequent updates to same document, not understanding primary shard + replica async replication and optimistic locking version mechanism | Reasonably design update strategy, reduce meaningless full overwrite updates |
| Search results across pages deeper and slower | Using high from + size deep pagination, causing inverted result sets on multiple shards to repeatedly merge | Use search_after or scroll/cursor query for business deep pagination |
4. Summary
Inverted index is the core data structure for Elasticsearch to achieve millisecond-level full-text search. By understanding inverted index principle, shard routing mechanism and Query/Fetch two-phase search flow, can better optimize Elasticsearch query performance, design appropriate tokenization strategies and solve search-related issues.