When building voice interaction systems, latency is the core experience metric. If there’s no response within 2 seconds after user finishes speaking, users feel the system is laggy; beyond 3 seconds it’s basically unusable.
But ASR (Speech Recognition) + LLM (Generate Response) + TTS (Speech Synthesis) three-stage pipeline, each has latency, without optimization easily exceeds 5 seconds.
This article documents my thinking and actual results when optimizing this pipeline.
First, Figure Out Where Latency Is
Most intuitive way is to measure each segment separately:
import time
t0 = time.time()
# ASR: From user finishing speaking to text recognition
asr_result = asr_recognize(audio_chunk)
t1 = time.time()
# LLM: From text input to first token output
first_token = next(llm_stream(asr_result))
t2 = time.time()
# TTS: From first sentence synthesis to audio playable
audio = tts_synthesize(first_sentence)
t3 = time.time()
print(f"ASR: {(t1-t0)*1000:.0f}ms")
print(f"LLM TTFT: {(t2-t1)*1000:.0f}ms") # Time To First Token
print(f"TTS: {(t3-t2)*1000:.0f}ms")
My test environment (local GPU, calling cloud LLM API):
| Stage | Latency |
|---|---|
| ASR (FunASR local) | 300-500ms |
| LLM TTFT (GPT-4o) | 500-1200ms |
| TTS first sentence (CosyVoice) | 400-800ms |
| Serial Total | 1200-2500ms |
If three stages are serial, fastest is 1.2 seconds. This is ideal, actual network jitter and high LLM load makes it slower.
Core Optimization: Pipeline Concurrency
Serial is the biggest waste. Core optimization: Don’t wait for previous stage to fully complete, immediately pass to next stage when enough output is received.
ASR Streaming Recognition
Most ASR services support streaming mode—recognize while speaking, output when a sentence is recognized, don’t wait for entire speech.
async def stream_asr(audio_stream):
async for chunk in audio_stream:
result = await asr_client.recognize_streaming(chunk)
if result.is_final: # Recognized complete sentence
yield result.text
This way, while user is still speaking, the first sentence is already being processed.
LLM Streaming Output + Sentence Splitting
LLM streaming output is token-level, but TTS needs complete sentences. So sentence splitting is needed—accumulate a sentence and send to TTS, don’t wait for full text generation.
async def stream_llm_sentences(prompt: str):
buffer = ""
async for token in llm_client.stream(prompt):
buffer += token
# Detect sentence boundary: punctuation followed by space or newline
sentences = re.split(r'(?<=[。!?.!?])\s*', buffer)
if len(sentences) > 1:
# At least one complete sentence
for sentence in sentences[:-1]:
if sentence.strip():
yield sentence.strip()
buffer = sentences[-1] # Keep unfinished sentence
if buffer.strip():
yield buffer.strip()
TTS Async Synthesis
As soon as each sentence is obtained, asynchronously start synthesis, add to playback queue when complete:
import asyncio
async def pipeline(user_audio):
play_queue = asyncio.Queue()
async def synthesize_and_enqueue(sentence):
audio = await tts_client.synthesize(sentence)
await play_queue.put(audio)
# ASR → LLM → TTS full pipeline
async for asr_text in stream_asr(user_audio):
async for sentence in stream_llm_sentences(asr_text):
asyncio.create_task(synthesize_and_enqueue(sentence))
return play_queue
Effect: After pipeline concurrency, time from user finishing speaking to hearing first response dropped from 2-5 seconds to 800ms-1.5 seconds.
VAD Endpoint Detection Pitfalls
VAD (Voice Activity Detection) determines “when user finished speaking”, this judgment directly affects response speed and experience.
Problem: VAD silence threshold is hard to tune.
- Too short (200ms): User pauses mid-sentence and gets cut off, sentence incomplete
- Too long (800ms): User finishes speaking but waits long before response starts, feels laggy
My solution: Dynamic threshold
class AdaptiveVAD:
def __init__(self):
self.silence_threshold = 400 # Initial threshold ms
self.speech_duration = 0
def on_speech_end(self, duration_ms):
# Longer speech allows longer pauses
if duration_ms > 5000:
self.silence_threshold = 600
elif duration_ms > 2000:
self.silence_threshold = 500
else:
self.silence_threshold = 350
Short questions (“what time is it”) short pause then cut, long narratives can have slightly longer pauses.
Another pitfall: Background noise. In noisy environments VAD has many false positives, Silero VAD performs much better than simple energy detection, recommend using it directly.
Practical Component Selection
| Component | My Choice | Notes |
|---|---|---|
| ASR | FunASR (local) | Good Chinese effect, supports streaming, needs GPU |
| ASR (backup) | Alibaba Cloud ASR | Use when don’t want to maintain local service |
| LLM | GPT-4o / Claude | Switch by scenario, use unified routing layer |
| TTS | CosyVoice | Natural voice, supports cloning; slow, needs optimization |
| TTS (fast) | Edge TTS | Microsoft, free, low latency, average voice quality |
| VAD | Silero VAD | High accuracy, CPU usable |
Summary
Core to reducing voice interaction latency is two things:
- Pipeline concurrency: Send to LLM when ASR outputs sentence, send to TTS when LLM outputs sentence, three stages parallel not serial
- Sentence-level processing: Don’t wait for full text, sentence is the minimum valid unit
After optimization, first-byte audio latency (from user finishing speaking to hearing first word) stably under 800ms, overall fluency from “obviously laggy” to “basically natural”.
Next step is to test edge ASR (Whisper.cpp quantized version) and faster TTS solutions to further compress pipeline latency.