Elasticsearch Near Real-time Search: Segment, Refresh, Fl...

TL;DR

Scenario: Can’t search immediately after write, fear data loss on node restart, only know “near real-time” but don’t understand how ES ensures performance and reliability underneath.
Conclusion: Segments immutable, Refresh responsible for “searchable”, Flush+Translog responsible for “recoverable”, one set of pipelines jointly defines Elasticsearch’s performance ceiling and data safety boundary.
Output: Sort out full pipeline flow from index write to NRT search, give optimization ideas for refresh_interval, flush, translog.durability and other key parameters, and common failure pattern quick reference table.

Version Matrix

Version Range	Verified	Note
Elasticsearch 7.x	Partial	Write/Refresh/Flush/Translog principles consistent with article description, API slightly different
Elasticsearch 8.x (2025)	Partial	Core mechanisms same, security and default config need to follow 8.x official docs
Elasticsearch 6.x and below	No	Only roughly applicable at principle level, specific parameter names and defaults may differ

Index Document Write and Near Real-time Search Principles

Basic Concepts

Segments in Lucene

As known, basic unit of Elasticsearch storage is Shard, an ES Index may be divided into multiple Shards, actually each Shard is a Lucence Index, and each Lucence Index consists of multiple Segments, each Segment is actually a collection of some inverted indexes. Every time a new Document is created, it belongs to a new Segment, won’t modify original Segment. And each document deletion operation only marks that document as deleted in the Segment, won’t physically delete immediately, so ES index can be understood as an abstract concept.

Translog-Hbase WAL (Write Ahead Log)

Write Ahead Log: New document being indexed means document will be first written to memory buffer and translog file, each shard corresponds to a translog file.

Refresh In Elasticsearch

In Elasticsearch, _refresh operation executes once per second by default, meaning data in memory buffer is written to a new Segment, at this time index becomes searchable, after writing new Segment memory buffer is cleared.

Flush In Elasticsearch

Flush operation means writing all data in memory buffer to new Segment, flushing all Segments in memory to disk, and clearing translog process.

Near Real-time Search

Basic Flow

Elasticsearch write flow: when a write request reaches Elasticsearch, ES writes data to MemoryBuffer and adds transaction log (translog). If each piece of data is immediately written to disk after writing to memory, since written data is definitely discrete, disk write operation is random write. Random write efficiency is quite low, would seriously degrade ES performance.

Therefore ES in design added high-speed cache (FileSystemCache) between MemoryBuffer and disk to improve ES write efficiency.

When write request sent to ES, ES writes data to MemoryBuffer, at this time data cannot be queried. Under default settings, ES refreshes data from MemoryBuffer to Linux FileSystemCache every 1 second, clears MemoryBuffer, at this time written data can be queried.

Refresh API

In Elasticsearch, lightweight process of writing and opening a new segment is called Refresh, by default each shard refreshes automatically once per second. This is why we say Elasticsearch is “near” real-time search: document changes are not immediately visible to search, but will become visible within a second.

These behaviors may confuse new users, they indexed a document then try to search it but can’t find it. Solution to this problem is to execute a manual refresh using Refresh API:

POST /_refresh

POST /my_blogs/_refresh

POST /my_blogs/_doc/1?refresh
{"xxx": "xxx"}

PUT /test/_doc/2?refresh=true
{"xxx": "xxx"}

Refresh all indexes
Only refresh blogs index
Only refresh document

Not all situations need refresh every second, you may be indexing a large number of files with Elasticsearch, you may want to optimize index speed rather than near real-time search, can set refresh_interval to lower index refresh frequency.

PUT /my_logs
{
  "settings": {
    "refresh_interval": "30s"
  }
}

refresh_interval can be dynamically updated on existing indexes, in production environment when building a large index, can first turn off auto refresh, then turn back when starting to use the index.

PUT /my_logs/_settings
{
  "refresh_interval": -1
}

PUT /my_logs/_settings
{
  "refresh_interval": "1s"
}

Persist Changes

Basic Flow

Persist changes flush: Even though near real-time search achieved through per-second refresh, still need to frequently perform full commit to ensure recovery from failures. But what about documents that changed between two commits? We don’t want to lose this data.

Elasticsearch added a Translog, called transaction log, records log for every Elasticsearch operation, through translog entire flow is:

Step 1: After a document is indexed, it is added to memory buffer and appended to translog, as described: new document added to memory buffer and appended to transaction log.

Step 2: Refresh makes shard in state described: shards refresh once per second:

Documents in memory buffer are written to a new segment without fsync
This segment is opened to make it searchable
Memory buffer is cleared

After refresh completes, buffer cleared but transaction log not cleared.

Step 3: This process continues, more documents added to memory buffer and appended to transaction log, transaction log accumulates documents.

Periodically: e.g., when translog becomes too large, index is flushed, a new translog created, and a full commit executed.

All documents in memory buffer are written to a new Segment
Buffer cleared
A commit point written to disk
File system cache flushed via fsync
Old translog deleted

Translog provides a persistent record of all operations not yet flushed to disk. When Elasticsearch starts, it uses the last commit point to recover already existing segments, and replays all change operations in translog that occurred after last commit.

Translog is also used to provide real-time CRUD. When you try to query, parse, delete a document by ID, it first checks translog for any recent changes before trying to retrieve from corresponding segments. This means it can always get the latest version of document in real-time. After flush, segments are fully committed and transaction log cleared.

flush API

Behavior of executing a commit and truncating translog is called a flush in Elasticsearch, shards are automatically flushed every 30 minutes, or when translog is too large (512M).

flush API can be used to execute a manual flush:

POST /blogs/_flush

POST /_flush?wait_for_ongoing

Flush blogs index
Flush all indexes and wait for all flushes to complete before returning

We rarely need to manually execute a flush, usually auto flush is sufficient.

That said, executing flush before restarting node or shutting down is beneficial for your index, when Elasticsearch tries to recover or reopen an index, it needs to replay all operations in translog, so if log is shorter, recovery will be faster.

Translog Security Issues

How safe is Translog?

Purpose of Translog is to ensure operations are not lost, but brings corresponding issues:

Before file is fsync’d to disk, written files will be lost after restart. This process happens on both primary and replica shards. Basically, this means client won’t get a 200 OK response until the entire request is fsync’d to translog of primary and replica shards. Performing fsync after each write request brings performance loss, although practice shows this loss is not huge (especially bulk import, amortizes across many documents in one request)

But for some large capacity clusters where losing a few seconds of data is not a serious issue, using asynchronous fsync is still beneficial. For example, write data cached in memory, then fsync every 5 seconds.

This behavior can be started by setting durability parameter to async.

PUT /my_index/_settings
{
  "index.translog.durability": "async",
  "index.translog.sync_interval": "5s"
}

This option can be set per index and modified dynamically. If you decide to use async translog, ensure it’s okay to lose data within sync_interval timeframe when crash occurs. Know this feature before deciding.

If unsure about consequences of this behavior, best to use default parameter: “index.translog.durability”: “request” to avoid data loss.

Error Quick Reference

Symptom	Root Cause	Location	Fix
Write returns 201/200, but can’t immediately search for document	Near real-time semantics: refresh_interval is 1s, Refresh not yet triggered	Check refresh_interval in index settings, monitor refresh frequency	For critical path can temporarily call _refresh, or shorten refresh_interval; opposite during batch write phase should increase or turn off
Lost writes within a few seconds after node restart	translog uses async mode, data within sync_interval not fsync	Check index.translog.durability and index.translog.sync_interval	For critical business use durability=request, shorten sync_interval; manually _flush before important changes
Cluster recovery or shard restart time unusually long	Long time without flush, translog too large, recovery needs to replay many operations	Check index translog size, RECOVERING status time	Adjust flush strategy; execute _flush during low traffic period; if necessary use rolling index to reduce single index size
Write throughput low, disk IO continuously high	refresh_interval too small or frequent manual _refresh/_flush	Monitor refresh/flush per second and segment generation frequency	Set refresh_interval=-1 during import phase, restore after import completes; reduce unnecessary manual _refresh/_flush
Disk space grows fast, segment count explodes	Frequent delete/update, merge pressure high, flush/rolling strategy improper	Observe segment count and size through _cat/segments, store size	Use rolling index to manage cold data, regularly force merge read-only indexes, optimize delete/update patterns