大数据-182 Elasticsearch 倒排索引底层拆解：Terms 字典、FST、SkipList 与 Lucene 索引文件

📚 大数据系列

2024/10/21

big-data elasticsearch

大数据-182 Elasticsearch 倒排索引底层拆解

1. 倒排索引核心概念

倒排索引由两部分组成：

Terms Dictionary（词典）：所有 Term 的集合，按字典序排列
Posting List（倒排表）：每个 Term 对应的文档列表

正排：DocId → Document
倒排：Term → [DocId1, DocId2, DocId3, ...]

2. Lucene 索引文件结构

文件后缀	含义
`.tim`	Terms Dictionary，存储 Term 及 Postings 指针
`.tip`	Terms Index，FST 结构，存储 Term 前缀和 Block 位置
`.doc`	存储 DocId 和词频

3. FST（有限状态转换器）

FST 是 Term Index 的实现结构：

优点：可以全部加载到内存，毫秒级定位
原理：用有向无环图存储 Term 前缀共享

4. SkipList（跳表）

用于加速多个 Posting List 的交集计算：

时间复杂度从 O(N) 降至 O(logN)
减少无效比较

5. 查询流程

通过 Terms Index（FST）快速定位 Term
从 Terms Dictionary 获取 Postings 指针
读取 Postings
使用 SkipList 合并多个 Posting List

6. 总结

Lucene 分段不可变 + flush/merge 的写入模型，决定了 ES 近实时（NRT）的特性：写入快，但查询需要 merge 多个段的结果。