When building voice interaction systems, latency is the core experience metric. If there’s no response within 2 seconds after user finishes speaking, users feel the system is laggy; beyond 3 seconds it’s basically unusable.

But ASR (Speech Recognition) + LLM (Generate Response) + TTS (Speech Synthesis) three-stage pipeline, each has latency, without optimization easily exceeds 5 seconds.

This article documents my thinking and actual results when optimizing this pipeline.

First, Figure Out Where Latency Is

Most intuitive way is to measure each segment separately:

import time

t0 = time.time()
# ASR: From user finishing speaking to text recognition
asr_result = asr_recognize(audio_chunk)
t1 = time.time()

# LLM: From text input to first token output
first_token = next(llm_stream(asr_result))
t2 = time.time()

# TTS: From first sentence synthesis to audio playable
audio = tts_synthesize(first_sentence)
t3 = time.time()

print(f"ASR: {(t1-t0)*1000:.0f}ms")
print(f"LLM TTFT: {(t2-t1)*1000:.0f}ms")  # Time To First Token
print(f"TTS: {(t3-t2)*1000:.0f}ms")

My test environment (local GPU, calling cloud LLM API):

StageLatency
ASR (FunASR local)300-500ms
LLM TTFT (GPT-4o)500-1200ms
TTS first sentence (CosyVoice)400-800ms
Serial Total1200-2500ms

If three stages are serial, fastest is 1.2 seconds. This is ideal, actual network jitter and high LLM load makes it slower.


Core Optimization: Pipeline Concurrency

Serial is the biggest waste. Core optimization: Don’t wait for previous stage to fully complete, immediately pass to next stage when enough output is received.

ASR Streaming Recognition

Most ASR services support streaming mode—recognize while speaking, output when a sentence is recognized, don’t wait for entire speech.

async def stream_asr(audio_stream):
    async for chunk in audio_stream:
        result = await asr_client.recognize_streaming(chunk)
        if result.is_final:  # Recognized complete sentence
            yield result.text

This way, while user is still speaking, the first sentence is already being processed.

LLM Streaming Output + Sentence Splitting

LLM streaming output is token-level, but TTS needs complete sentences. So sentence splitting is needed—accumulate a sentence and send to TTS, don’t wait for full text generation.

async def stream_llm_sentences(prompt: str):
    buffer = ""
    async for token in llm_client.stream(prompt):
        buffer += token
        # Detect sentence boundary: punctuation followed by space or newline
        sentences = re.split(r'(?<=[。!?.!?])\s*', buffer)
        if len(sentences) > 1:
            # At least one complete sentence
            for sentence in sentences[:-1]:
                if sentence.strip():
                    yield sentence.strip()
            buffer = sentences[-1]  # Keep unfinished sentence
    if buffer.strip():
        yield buffer.strip()

TTS Async Synthesis

As soon as each sentence is obtained, asynchronously start synthesis, add to playback queue when complete:

import asyncio

async def pipeline(user_audio):
    play_queue = asyncio.Queue()

    async def synthesize_and_enqueue(sentence):
        audio = await tts_client.synthesize(sentence)
        await play_queue.put(audio)

    # ASR → LLM → TTS full pipeline
    async for asr_text in stream_asr(user_audio):
        async for sentence in stream_llm_sentences(asr_text):
            asyncio.create_task(synthesize_and_enqueue(sentence))

    return play_queue

Effect: After pipeline concurrency, time from user finishing speaking to hearing first response dropped from 2-5 seconds to 800ms-1.5 seconds.


VAD Endpoint Detection Pitfalls

VAD (Voice Activity Detection) determines “when user finished speaking”, this judgment directly affects response speed and experience.

Problem: VAD silence threshold is hard to tune.

  • Too short (200ms): User pauses mid-sentence and gets cut off, sentence incomplete
  • Too long (800ms): User finishes speaking but waits long before response starts, feels laggy

My solution: Dynamic threshold

class AdaptiveVAD:
    def __init__(self):
        self.silence_threshold = 400  # Initial threshold ms
        self.speech_duration = 0

    def on_speech_end(self, duration_ms):
        # Longer speech allows longer pauses
        if duration_ms > 5000:
            self.silence_threshold = 600
        elif duration_ms > 2000:
            self.silence_threshold = 500
        else:
            self.silence_threshold = 350

Short questions (“what time is it”) short pause then cut, long narratives can have slightly longer pauses.

Another pitfall: Background noise. In noisy environments VAD has many false positives, Silero VAD performs much better than simple energy detection, recommend using it directly.


Practical Component Selection

ComponentMy ChoiceNotes
ASRFunASR (local)Good Chinese effect, supports streaming, needs GPU
ASR (backup)Alibaba Cloud ASRUse when don’t want to maintain local service
LLMGPT-4o / ClaudeSwitch by scenario, use unified routing layer
TTSCosyVoiceNatural voice, supports cloning; slow, needs optimization
TTS (fast)Edge TTSMicrosoft, free, low latency, average voice quality
VADSilero VADHigh accuracy, CPU usable

Summary

Core to reducing voice interaction latency is two things:

  1. Pipeline concurrency: Send to LLM when ASR outputs sentence, send to TTS when LLM outputs sentence, three stages parallel not serial
  2. Sentence-level processing: Don’t wait for full text, sentence is the minimum valid unit

After optimization, first-byte audio latency (from user finishing speaking to hearing first word) stably under 800ms, overall fluency from “obviously laggy” to “basically natural”.

Next step is to test edge ASR (Whisper.cpp quantized version) and faster TTS solutions to further compress pipeline latency.