Elasticsearch Inverted Index & Read/Write Process Full An...

Overview

This article deeply analyzes Elasticsearch’s inverted index principle based on Lucene, and document read/write flow.

1. Inverted Index Core Concepts

1.1 What is Inverted Index

Inverted index is a data structure used to quickly find documents containing specific vocabulary. It’s like a book’s index page, but structure is “inverted”, hence the name.

1.2 Forward Index vs Inverted Index

Forward Index

Index structure based on documents, data organized around documents
Storage: Create an entry for each document, record all terms contained in that document and their positions
Typical implementation: DocumentID → [term1:position list, term2:position list,…]
Application scenarios: Document content retrieval, document similarity calculation, document storage layer of search engine

Inverted Index

Index structure based on vocabulary, data organized around terms
Storage: Create an entry for each term, record all documents containing that term and their positions
Typical implementation: Term → [DocumentID1:position list, DocumentID2:position list,…]
Core advantages: Quickly locate documents containing specific term, support efficient boolean queries (AND/OR/NOT), facilitate relevance sorting

For example, suppose there are three documents:

“Elasticsearch 是一个分布式搜索引擎”
“分布式系统可以提供高可用性”
“搜索引擎使用倒排索引进行高效搜索”

Forward index records what terms are in each document:

Doc1: ["Elasticsearch", "是", "一个", "分布式", "搜索", "引擎"]
Doc2: ["分布式", "系统", "可以", "提供", "高", "可用性"]
Doc3: ["搜索", "引擎", "使用", "倒排索引", "进行", "高效", "搜索"]

Inverted index records which documents each term appears in:

"分布式": [1, 2]
"搜索": [1, 3]
"引擎": [1, 3]
"倒排索引": [3]

1.3 Inverted Index Construction

In Elasticsearch, documents are first analyzed and processed, then generate inverted index. Process is roughly:

Document tokenization: Each document goes through analyzer processing, analyzer responsible for decomposing document text into terms
Normalization and filtering: After tokenization, analyzer usually does further processing, like converting all terms to lowercase, removing stop words, etc.
Build inverted list: Inverted index associates each term with document IDs it appears in

1.4 Inverted Index Structure

Core parts of inverted index can be divided into:

Term Dictionary: Stores all indexed terms, usually stored in dictionary form
Posting List: For each term, posting list records all documents containing that term, can also include:
- Term Frequency (TF): Records number of times term appears in document
- Document Frequency (DF): Records total number of documents term appears in across index
- Position: Term’s position in document, used to support phrase and proximity queries

1.5 Inverted Index Query Principle

Find term: Elasticsearch first looks up query term in Term Dictionary, finds their posting lists
Merge posting lists: Merge posting lists of two terms, find documents containing all query terms
Calculate relevance: Score documents according to TF-IDF or BM25 algorithm

1.6 Inverted Index Advantages

Fast retrieval: Extremely fast retrieval for a term, especially in massive documents, query performance can stay at millisecond level
Low memory usage: Through compression algorithms and optimized data structures, maintain low memory usage
Support complex queries: Not only supports simple keyword queries, but also supports complex search conditions like phrase, prefix, proximity queries

2. Read/Write Flow

2.1 Create Document

When adding documents to Elasticsearch, since ES is a distributed cluster, underlying design has an index composed of many Shards, so need to determine which shard the document belongs to. Rule is:

shard = hash(routing) % number_of_primary_shards

routing is a variable value, default is document’s ID, can also be set to a custom value
routing generates a number through hash function, then this number divided by number_of_primary_shards (number of primary shards) to get remainder

2.2 Write Document Flow

Write operations must be completed on primary shard first, then can be copied to other nodes as replica shards. Create, index and delete requests are all write operations.

Client sends write request to Master, that node acts as coordinating node
Determine shard based on document’s _id, coordinating node forwards request to primary shard of that node
Execute request on primary shard,成功后, primary shard based on replica count forwards request in parallel to corresponding replica shard nodes
Once replica shards complete execution, all report success to primary shard, primary shard reports success to coordinating node, coordinating node reports success to client

2.3 Read Document Flow

A search request must query some replica of all shards in the queried index. One retrieval flow mainly has two phases:

Query Phase

Client sends Search request to coordinating node
Coordinating node forwards query request to each primary or replica shard in the index
Each shard executes query locally, uses local Term/Document Frequency information for scoring, adds results to local priority queue of size from+size
Each shard returns all document IDs and sorting values from their priority queue to coordinating node, coordinating node merges these values into its own priority queue, produces globally sorted list

Fetch Phase

Coordinating node sends GET request to relevant nodes
Nodes where shards are located return data to coordinating node
Coordinating node waits for all document data to be retrieved, then returns to client

3. Error Quick Reference

Symptom	Root Cause Location	Fix Keywords
Can write, but search can’t find or results few	Tokenizer inconsistency: different analyzer used at indexing vs query time	Use _analyze API to see actual terms generated at indexing and query
Query only hits part of documents, relevance sorting “very strange”	TF/DF in inverted index and BM25 scoring mechanism not understood	Use explain API to see scoring composition of single hit document
Write then immediately search, can’t find new document in short time	Not understanding refresh mechanism: there’s refresh period between indexing to inverted index	Check index’s refresh_interval config
Some shards have obviously higher QPS, cluster load uneven	Improper routing design, causing documents concentrated on few shards	Reasonably choose routing field, avoid low cardinality fields
Overall search becomes slow, CPU usage high	Inverted index expansion: too many fields, text field abuse	Set index: false for fields only used for display
Phrase/proximity query effect poor	Position info not enabled or not correctly used	Use match_phrase/span series queries
Cluster occasional write failure or “version conflict” exception	Frequent updates to same document, not understanding primary shard + replica async replication and optimistic locking version mechanism	Reasonably design update strategy, reduce meaningless full overwrite updates
Search results across pages deeper and slower	Using high from + size deep pagination, causing inverted result sets on multiple shards to repeatedly merge	Use search_after or scroll/cursor query for business deep pagination

4. Summary

Inverted index is the core data structure for Elasticsearch to achieve millisecond-level full-text search. By understanding inverted index principle, shard routing mechanism and Query/Fetch two-phase search flow, can better optimize Elasticsearch query performance, design appropriate tokenization strategies and solve search-related issues.