Many people use LangChain or directly call APIs to get a Demo working, then think it’s ready for production. The distance between production and Demo is far greater than imagined.

This article documents core problems and solutions encountered when pushing several LLM applications to production. No model selection or prompt basics involved, just engineering-level pitfalls.

Context Management: Most Easily Overlooked Problem

LLM’s context window is limited, and more tokens mean slower inference and higher costs. Most Demos don’t handle this, but production must.

Several common strategies:

Sliding Window

Keep the most recent N conversation rounds, discard early history. Simplest to implement, but causes model “memory loss”, suitable for scenarios with weak context dependency.

def trim_messages(messages: list, max_tokens: int = 4000) -> list:
    # Always keep system message
    system = [m for m in messages if m["role"] == "system"]
    rest = [m for m in messages if m["role"] != "system"]

    # Keep from newest going backwards
    kept = []
    total = count_tokens(system)
    for msg in reversed(rest):
        t = count_tokens([msg])
        if total + t > max_tokens:
            break
        kept.insert(0, msg)
        total += t

    return system + kept

Summary Compression

When conversation history exceeds threshold, use the model itself to compress early history into a summary, then continue conversation. Preserves key information but introduces additional API call costs.

RAG Instead of Long Context

Vectorize knowledge base content, dynamically inject based on relevance during queries, instead of stuffing entire documents into context. This is currently the most mainstream approach.


Error Handling: Don’t Trust API to Always Be Available

LLM API will timeout, hit rate limits, return empty results. Production code must handle these.

Must-handle error types:

ErrorCauseHandling
RateLimitErrorRequest frequency too highExponential backoff retry
TimeoutGeneration time too longSet timeout, return degraded response
InvalidRequestErrorContext too longTruncate and retry
Empty/Truncated responseModel output incompleteDetect and regenerate
import time
from openai import RateLimitError, APITimeoutError

def call_llm_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                timeout=30,
            )
            return response.choices[0].message.content
        except RateLimitError:
            wait = 2 ** attempt  # 1s, 2s, 4s
            time.sleep(wait)
        except APITimeoutError:
            if attempt == max_retries - 1:
                return "Service temporarily unavailable, please try again later."
    return None

Cost Control: Token Usage Must Be Visible

Without knowing how many tokens each request costs, you can’t optimize costs. Must log token usage before going live.

Simplest approach: Log usage for each call

def log_usage(response, endpoint: str):
    usage = response.usage
    logger.info(
        "llm_usage",
        extra={
            "endpoint": endpoint,
            "prompt_tokens": usage.prompt_tokens,
            "completion_tokens": usage.completion_tokens,
            "total_tokens": usage.total_tokens,
            "model": response.model,
        }
    )

With logs, you can analyze by endpoint, user, time period, find high-cost paths to optimize.

Common optimization points:

  • Keep system prompt as short as possible, remove redundant descriptions
  • Use smaller models for simple questions (like gpt-4o-mini), only use larger models for complex ones
  • Semantic caching for high-repeat requests (reuse previous answers for similar questions)

Observability: Must Be Able to Locate Issues When Problems Occur

LLM app bugs are usually not code errors, but “model responses don’t match expectations”. These are hard to reproduce, must log complete input/output.

Minimum observability requirements:

  1. Log complete request: messages list, model parameters, temperature
  2. Log complete response: raw output, finish_reason, token usage
  3. Associate trace_id: One user request may trigger multiple LLM calls, must be able to chain them
import uuid

def traced_llm_call(messages, **kwargs):
    trace_id = str(uuid.uuid4())

    logger.info("llm_request", extra={
        "trace_id": trace_id,
        "messages": messages,
        **kwargs
    })

    response = client.chat.completions.create(messages=messages, **kwargs)

    logger.info("llm_response", extra={
        "trace_id": trace_id,
        "content": response.choices[0].message.content,
        "finish_reason": response.choices[0].finish_reason,
        "usage": response.usage.model_dump(),
    })

    return response

Summary

Core differences from Demo to production:

  • Context management: Can’t pile up unlimited tokens, need truncation or compression strategy
  • Error handling: API instability is normal, retry and degradation are a must
  • Cost visibility: Can’t optimize without logging usage
  • Observability: Must be able to trace when problems occur, log completely

These aren’t showing off skills, they’re the most basic engineering requirements. LLMs themselves are powerful, but without proper handling at the application layer, it’s easy to fail in production.