TL;DR
- Scenario: Can’t search immediately after write, fear data loss on node restart, only know “near real-time” but don’t understand how ES ensures performance and reliability underneath.
- Conclusion: Segments immutable, Refresh responsible for “searchable”, Flush+Translog responsible for “recoverable”, one set of pipelines jointly defines Elasticsearch’s performance ceiling and data safety boundary.
- Output: Sort out full pipeline flow from index write to NRT search, give optimization ideas for refresh_interval, flush, translog.durability and other key parameters, and common failure pattern quick reference table.
Version Matrix
| Version Range | Verified | Note |
|---|---|---|
| Elasticsearch 7.x | Partial | Write/Refresh/Flush/Translog principles consistent with article description, API slightly different |
| Elasticsearch 8.x (2025) | Partial | Core mechanisms same, security and default config need to follow 8.x official docs |
| Elasticsearch 6.x and below | No | Only roughly applicable at principle level, specific parameter names and defaults may differ |
Index Document Write and Near Real-time Search Principles
Basic Concepts
Segments in Lucene
As known, basic unit of Elasticsearch storage is Shard, an ES Index may be divided into multiple Shards, actually each Shard is a Lucence Index, and each Lucence Index consists of multiple Segments, each Segment is actually a collection of some inverted indexes. Every time a new Document is created, it belongs to a new Segment, won’t modify original Segment. And each document deletion operation only marks that document as deleted in the Segment, won’t physically delete immediately, so ES index can be understood as an abstract concept.
Translog-Hbase WAL (Write Ahead Log)
Write Ahead Log: New document being indexed means document will be first written to memory buffer and translog file, each shard corresponds to a translog file.
Refresh In Elasticsearch
In Elasticsearch, _refresh operation executes once per second by default, meaning data in memory buffer is written to a new Segment, at this time index becomes searchable, after writing new Segment memory buffer is cleared.
Flush In Elasticsearch
Flush operation means writing all data in memory buffer to new Segment, flushing all Segments in memory to disk, and clearing translog process.
Near Real-time Search
Basic Flow
Elasticsearch write flow: when a write request reaches Elasticsearch, ES writes data to MemoryBuffer and adds transaction log (translog). If each piece of data is immediately written to disk after writing to memory, since written data is definitely discrete, disk write operation is random write. Random write efficiency is quite low, would seriously degrade ES performance.
Therefore ES in design added high-speed cache (FileSystemCache) between MemoryBuffer and disk to improve ES write efficiency.
When write request sent to ES, ES writes data to MemoryBuffer, at this time data cannot be queried. Under default settings, ES refreshes data from MemoryBuffer to Linux FileSystemCache every 1 second, clears MemoryBuffer, at this time written data can be queried.
Refresh API
In Elasticsearch, lightweight process of writing and opening a new segment is called Refresh, by default each shard refreshes automatically once per second. This is why we say Elasticsearch is “near” real-time search: document changes are not immediately visible to search, but will become visible within a second.
These behaviors may confuse new users, they indexed a document then try to search it but can’t find it. Solution to this problem is to execute a manual refresh using Refresh API:
POST /_refresh
POST /my_blogs/_refresh
POST /my_blogs/_doc/1?refresh
{"xxx": "xxx"}
PUT /test/_doc/2?refresh=true
{"xxx": "xxx"}
- Refresh all indexes
- Only refresh blogs index
- Only refresh document
Not all situations need refresh every second, you may be indexing a large number of files with Elasticsearch, you may want to optimize index speed rather than near real-time search, can set refresh_interval to lower index refresh frequency.
PUT /my_logs
{
"settings": {
"refresh_interval": "30s"
}
}
refresh_interval can be dynamically updated on existing indexes, in production environment when building a large index, can first turn off auto refresh, then turn back when starting to use the index.
PUT /my_logs/_settings
{
"refresh_interval": -1
}
PUT /my_logs/_settings
{
"refresh_interval": "1s"
}
Persist Changes
Basic Flow
Persist changes flush: Even though near real-time search achieved through per-second refresh, still need to frequently perform full commit to ensure recovery from failures. But what about documents that changed between two commits? We don’t want to lose this data.
Elasticsearch added a Translog, called transaction log, records log for every Elasticsearch operation, through translog entire flow is:
Step 1: After a document is indexed, it is added to memory buffer and appended to translog, as described: new document added to memory buffer and appended to transaction log.
Step 2: Refresh makes shard in state described: shards refresh once per second:
- Documents in memory buffer are written to a new segment without fsync
- This segment is opened to make it searchable
- Memory buffer is cleared
After refresh completes, buffer cleared but transaction log not cleared.
Step 3: This process continues, more documents added to memory buffer and appended to transaction log, transaction log accumulates documents.
Periodically: e.g., when translog becomes too large, index is flushed, a new translog created, and a full commit executed.
- All documents in memory buffer are written to a new Segment
- Buffer cleared
- A commit point written to disk
- File system cache flushed via fsync
- Old translog deleted
Translog provides a persistent record of all operations not yet flushed to disk. When Elasticsearch starts, it uses the last commit point to recover already existing segments, and replays all change operations in translog that occurred after last commit.
Translog is also used to provide real-time CRUD. When you try to query, parse, delete a document by ID, it first checks translog for any recent changes before trying to retrieve from corresponding segments. This means it can always get the latest version of document in real-time. After flush, segments are fully committed and transaction log cleared.
flush API
Behavior of executing a commit and truncating translog is called a flush in Elasticsearch, shards are automatically flushed every 30 minutes, or when translog is too large (512M).
flush API can be used to execute a manual flush:
POST /blogs/_flush
POST /_flush?wait_for_ongoing
- Flush blogs index
- Flush all indexes and wait for all flushes to complete before returning
We rarely need to manually execute a flush, usually auto flush is sufficient.
That said, executing flush before restarting node or shutting down is beneficial for your index, when Elasticsearch tries to recover or reopen an index, it needs to replay all operations in translog, so if log is shorter, recovery will be faster.
Translog Security Issues
How safe is Translog?
Purpose of Translog is to ensure operations are not lost, but brings corresponding issues:
Before file is fsync’d to disk, written files will be lost after restart. This process happens on both primary and replica shards. Basically, this means client won’t get a 200 OK response until the entire request is fsync’d to translog of primary and replica shards. Performing fsync after each write request brings performance loss, although practice shows this loss is not huge (especially bulk import, amortizes across many documents in one request)
But for some large capacity clusters where losing a few seconds of data is not a serious issue, using asynchronous fsync is still beneficial. For example, write data cached in memory, then fsync every 5 seconds.
This behavior can be started by setting durability parameter to async.
PUT /my_index/_settings
{
"index.translog.durability": "async",
"index.translog.sync_interval": "5s"
}
This option can be set per index and modified dynamically. If you decide to use async translog, ensure it’s okay to lose data within sync_interval timeframe when crash occurs. Know this feature before deciding.
If unsure about consequences of this behavior, best to use default parameter: “index.translog.durability”: “request” to avoid data loss.
Error Quick Reference
| Symptom | Root Cause | Location | Fix |
|---|---|---|---|
| Write returns 201/200, but can’t immediately search for document | Near real-time semantics: refresh_interval is 1s, Refresh not yet triggered | Check refresh_interval in index settings, monitor refresh frequency | For critical path can temporarily call _refresh, or shorten refresh_interval; opposite during batch write phase should increase or turn off |
| Lost writes within a few seconds after node restart | translog uses async mode, data within sync_interval not fsync | Check index.translog.durability and index.translog.sync_interval | For critical business use durability=request, shorten sync_interval; manually _flush before important changes |
| Cluster recovery or shard restart time unusually long | Long time without flush, translog too large, recovery needs to replay many operations | Check index translog size, RECOVERING status time | Adjust flush strategy; execute _flush during low traffic period; if necessary use rolling index to reduce single index size |
| Write throughput low, disk IO continuously high | refresh_interval too small or frequent manual _refresh/_flush | Monitor refresh/flush per second and segment generation frequency | Set refresh_interval=-1 during import phase, restore after import completes; reduce unnecessary manual _refresh/_flush |
| Disk space grows fast, segment count explodes | Frequent delete/update, merge pressure high, flush/rolling strategy improper | Observe segment count and size through _cat/segments, store size | Use rolling index to manage cold data, regularly force merge read-only indexes, optimize delete/update patterns |