Many people use LangChain or directly call APIs to get a Demo working, then think it’s ready for production. The distance between production and Demo is far greater than imagined.
This article documents core problems and solutions encountered when pushing several LLM applications to production. No model selection or prompt basics involved, just engineering-level pitfalls.
Context Management: Most Easily Overlooked Problem
LLM’s context window is limited, and more tokens mean slower inference and higher costs. Most Demos don’t handle this, but production must.
Several common strategies:
Sliding Window
Keep the most recent N conversation rounds, discard early history. Simplest to implement, but causes model “memory loss”, suitable for scenarios with weak context dependency.
def trim_messages(messages: list, max_tokens: int = 4000) -> list:
# Always keep system message
system = [m for m in messages if m["role"] == "system"]
rest = [m for m in messages if m["role"] != "system"]
# Keep from newest going backwards
kept = []
total = count_tokens(system)
for msg in reversed(rest):
t = count_tokens([msg])
if total + t > max_tokens:
break
kept.insert(0, msg)
total += t
return system + kept
Summary Compression
When conversation history exceeds threshold, use the model itself to compress early history into a summary, then continue conversation. Preserves key information but introduces additional API call costs.
RAG Instead of Long Context
Vectorize knowledge base content, dynamically inject based on relevance during queries, instead of stuffing entire documents into context. This is currently the most mainstream approach.
Error Handling: Don’t Trust API to Always Be Available
LLM API will timeout, hit rate limits, return empty results. Production code must handle these.
Must-handle error types:
| Error | Cause | Handling |
|---|---|---|
RateLimitError | Request frequency too high | Exponential backoff retry |
Timeout | Generation time too long | Set timeout, return degraded response |
InvalidRequestError | Context too long | Truncate and retry |
| Empty/Truncated response | Model output incomplete | Detect and regenerate |
import time
from openai import RateLimitError, APITimeoutError
def call_llm_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
timeout=30,
)
return response.choices[0].message.content
except RateLimitError:
wait = 2 ** attempt # 1s, 2s, 4s
time.sleep(wait)
except APITimeoutError:
if attempt == max_retries - 1:
return "Service temporarily unavailable, please try again later."
return None
Cost Control: Token Usage Must Be Visible
Without knowing how many tokens each request costs, you can’t optimize costs. Must log token usage before going live.
Simplest approach: Log usage for each call
def log_usage(response, endpoint: str):
usage = response.usage
logger.info(
"llm_usage",
extra={
"endpoint": endpoint,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"total_tokens": usage.total_tokens,
"model": response.model,
}
)
With logs, you can analyze by endpoint, user, time period, find high-cost paths to optimize.
Common optimization points:
- Keep system prompt as short as possible, remove redundant descriptions
- Use smaller models for simple questions (like gpt-4o-mini), only use larger models for complex ones
- Semantic caching for high-repeat requests (reuse previous answers for similar questions)
Observability: Must Be Able to Locate Issues When Problems Occur
LLM app bugs are usually not code errors, but “model responses don’t match expectations”. These are hard to reproduce, must log complete input/output.
Minimum observability requirements:
- Log complete request: messages list, model parameters, temperature
- Log complete response: raw output, finish_reason, token usage
- Associate trace_id: One user request may trigger multiple LLM calls, must be able to chain them
import uuid
def traced_llm_call(messages, **kwargs):
trace_id = str(uuid.uuid4())
logger.info("llm_request", extra={
"trace_id": trace_id,
"messages": messages,
**kwargs
})
response = client.chat.completions.create(messages=messages, **kwargs)
logger.info("llm_response", extra={
"trace_id": trace_id,
"content": response.choices[0].message.content,
"finish_reason": response.choices[0].finish_reason,
"usage": response.usage.model_dump(),
})
return response
Summary
Core differences from Demo to production:
- Context management: Can’t pile up unlimited tokens, need truncation or compression strategy
- Error handling: API instability is normal, retry and degradation are a must
- Cost visibility: Can’t optimize without logging usage
- Observability: Must be able to trace when problems occur, log completely
These aren’t showing off skills, they’re the most basic engineering requirements. LLMs themselves are powerful, but without proper handling at the application layer, it’s easy to fail in production.