TL;DR

  • Scenario: Can’t search immediately after write, fear data loss on node restart, only know “near real-time” but don’t understand how ES ensures performance and reliability underneath.
  • Conclusion: Segments immutable, Refresh responsible for “searchable”, Flush+Translog responsible for “recoverable”, one set of pipelines jointly defines Elasticsearch’s performance ceiling and data safety boundary.
  • Output: Sort out full pipeline flow from index write to NRT search, give optimization ideas for refresh_interval, flush, translog.durability and other key parameters, and common failure pattern quick reference table.

Version Matrix

Version RangeVerifiedNote
Elasticsearch 7.xPartialWrite/Refresh/Flush/Translog principles consistent with article description, API slightly different
Elasticsearch 8.x (2025)PartialCore mechanisms same, security and default config need to follow 8.x official docs
Elasticsearch 6.x and belowNoOnly roughly applicable at principle level, specific parameter names and defaults may differ

Index Document Write and Near Real-time Search Principles

Basic Concepts

Segments in Lucene

As known, basic unit of Elasticsearch storage is Shard, an ES Index may be divided into multiple Shards, actually each Shard is a Lucence Index, and each Lucence Index consists of multiple Segments, each Segment is actually a collection of some inverted indexes. Every time a new Document is created, it belongs to a new Segment, won’t modify original Segment. And each document deletion operation only marks that document as deleted in the Segment, won’t physically delete immediately, so ES index can be understood as an abstract concept.

Translog-Hbase WAL (Write Ahead Log)

Write Ahead Log: New document being indexed means document will be first written to memory buffer and translog file, each shard corresponds to a translog file.

Refresh In Elasticsearch

In Elasticsearch, _refresh operation executes once per second by default, meaning data in memory buffer is written to a new Segment, at this time index becomes searchable, after writing new Segment memory buffer is cleared.

Flush In Elasticsearch

Flush operation means writing all data in memory buffer to new Segment, flushing all Segments in memory to disk, and clearing translog process.


Basic Flow

Elasticsearch write flow: when a write request reaches Elasticsearch, ES writes data to MemoryBuffer and adds transaction log (translog). If each piece of data is immediately written to disk after writing to memory, since written data is definitely discrete, disk write operation is random write. Random write efficiency is quite low, would seriously degrade ES performance.

Therefore ES in design added high-speed cache (FileSystemCache) between MemoryBuffer and disk to improve ES write efficiency.

When write request sent to ES, ES writes data to MemoryBuffer, at this time data cannot be queried. Under default settings, ES refreshes data from MemoryBuffer to Linux FileSystemCache every 1 second, clears MemoryBuffer, at this time written data can be queried.

Refresh API

In Elasticsearch, lightweight process of writing and opening a new segment is called Refresh, by default each shard refreshes automatically once per second. This is why we say Elasticsearch is “near” real-time search: document changes are not immediately visible to search, but will become visible within a second.

These behaviors may confuse new users, they indexed a document then try to search it but can’t find it. Solution to this problem is to execute a manual refresh using Refresh API:

POST /_refresh

POST /my_blogs/_refresh

POST /my_blogs/_doc/1?refresh
{"xxx": "xxx"}

PUT /test/_doc/2?refresh=true
{"xxx": "xxx"}
  • Refresh all indexes
  • Only refresh blogs index
  • Only refresh document

Not all situations need refresh every second, you may be indexing a large number of files with Elasticsearch, you may want to optimize index speed rather than near real-time search, can set refresh_interval to lower index refresh frequency.

PUT /my_logs
{
  "settings": {
    "refresh_interval": "30s"
  }
}

refresh_interval can be dynamically updated on existing indexes, in production environment when building a large index, can first turn off auto refresh, then turn back when starting to use the index.

PUT /my_logs/_settings
{
  "refresh_interval": -1
}

PUT /my_logs/_settings
{
  "refresh_interval": "1s"
}

Persist Changes

Basic Flow

Persist changes flush: Even though near real-time search achieved through per-second refresh, still need to frequently perform full commit to ensure recovery from failures. But what about documents that changed between two commits? We don’t want to lose this data.

Elasticsearch added a Translog, called transaction log, records log for every Elasticsearch operation, through translog entire flow is:

Step 1: After a document is indexed, it is added to memory buffer and appended to translog, as described: new document added to memory buffer and appended to transaction log.

Step 2: Refresh makes shard in state described: shards refresh once per second:

  • Documents in memory buffer are written to a new segment without fsync
  • This segment is opened to make it searchable
  • Memory buffer is cleared

After refresh completes, buffer cleared but transaction log not cleared.

Step 3: This process continues, more documents added to memory buffer and appended to transaction log, transaction log accumulates documents.

Periodically: e.g., when translog becomes too large, index is flushed, a new translog created, and a full commit executed.

  • All documents in memory buffer are written to a new Segment
  • Buffer cleared
  • A commit point written to disk
  • File system cache flushed via fsync
  • Old translog deleted

Translog provides a persistent record of all operations not yet flushed to disk. When Elasticsearch starts, it uses the last commit point to recover already existing segments, and replays all change operations in translog that occurred after last commit.

Translog is also used to provide real-time CRUD. When you try to query, parse, delete a document by ID, it first checks translog for any recent changes before trying to retrieve from corresponding segments. This means it can always get the latest version of document in real-time. After flush, segments are fully committed and transaction log cleared.

flush API

Behavior of executing a commit and truncating translog is called a flush in Elasticsearch, shards are automatically flushed every 30 minutes, or when translog is too large (512M).

flush API can be used to execute a manual flush:

POST /blogs/_flush

POST /_flush?wait_for_ongoing
  • Flush blogs index
  • Flush all indexes and wait for all flushes to complete before returning

We rarely need to manually execute a flush, usually auto flush is sufficient.

That said, executing flush before restarting node or shutting down is beneficial for your index, when Elasticsearch tries to recover or reopen an index, it needs to replay all operations in translog, so if log is shorter, recovery will be faster.

Translog Security Issues

How safe is Translog?

Purpose of Translog is to ensure operations are not lost, but brings corresponding issues:

Before file is fsync’d to disk, written files will be lost after restart. This process happens on both primary and replica shards. Basically, this means client won’t get a 200 OK response until the entire request is fsync’d to translog of primary and replica shards. Performing fsync after each write request brings performance loss, although practice shows this loss is not huge (especially bulk import, amortizes across many documents in one request)

But for some large capacity clusters where losing a few seconds of data is not a serious issue, using asynchronous fsync is still beneficial. For example, write data cached in memory, then fsync every 5 seconds.

This behavior can be started by setting durability parameter to async.

PUT /my_index/_settings
{
  "index.translog.durability": "async",
  "index.translog.sync_interval": "5s"
}

This option can be set per index and modified dynamically. If you decide to use async translog, ensure it’s okay to lose data within sync_interval timeframe when crash occurs. Know this feature before deciding.

If unsure about consequences of this behavior, best to use default parameter: “index.translog.durability”: “request” to avoid data loss.


Error Quick Reference

SymptomRoot CauseLocationFix
Write returns 201/200, but can’t immediately search for documentNear real-time semantics: refresh_interval is 1s, Refresh not yet triggeredCheck refresh_interval in index settings, monitor refresh frequencyFor critical path can temporarily call _refresh, or shorten refresh_interval; opposite during batch write phase should increase or turn off
Lost writes within a few seconds after node restarttranslog uses async mode, data within sync_interval not fsyncCheck index.translog.durability and index.translog.sync_intervalFor critical business use durability=request, shorten sync_interval; manually _flush before important changes
Cluster recovery or shard restart time unusually longLong time without flush, translog too large, recovery needs to replay many operationsCheck index translog size, RECOVERING status timeAdjust flush strategy; execute _flush during low traffic period; if necessary use rolling index to reduce single index size
Write throughput low, disk IO continuously highrefresh_interval too small or frequent manual _refresh/_flushMonitor refresh/flush per second and segment generation frequencySet refresh_interval=-1 during import phase, restore after import completes; reduce unnecessary manual _refresh/_flush
Disk space grows fast, segment count explodesFrequent delete/update, merge pressure high, flush/rolling strategy improperObserve segment count and size through _cat/segments, store sizeUse rolling index to manage cold data, regularly force merge read-only indexes, optimize delete/update patterns