LLM Application Engineering: Key Practices from Demo to P...

Many people use LangChain or directly call APIs to get a Demo working, then think it’s ready for production. The distance between production and Demo is far greater than imagined.

This article documents core problems and solutions encountered when pushing several LLM applications to production. No model selection or prompt basics involved, just engineering-level pitfalls.

Context Management: Most Easily Overlooked Problem

LLM’s context window is limited, and more tokens mean slower inference and higher costs. Most Demos don’t handle this, but production must.

Several common strategies:

Sliding Window

Keep the most recent N conversation rounds, discard early history. Simplest to implement, but causes model “memory loss”, suitable for scenarios with weak context dependency.

def trim_messages(messages: list, max_tokens: int = 4000) -> list:
    # Always keep system message
    system = [m for m in messages if m["role"] == "system"]
    rest = [m for m in messages if m["role"] != "system"]

    # Keep from newest going backwards
    kept = []
    total = count_tokens(system)
    for msg in reversed(rest):
        t = count_tokens([msg])
        if total + t > max_tokens:
            break
        kept.insert(0, msg)
        total += t

    return system + kept

Summary Compression

When conversation history exceeds threshold, use the model itself to compress early history into a summary, then continue conversation. Preserves key information but introduces additional API call costs.

RAG Instead of Long Context

Vectorize knowledge base content, dynamically inject based on relevance during queries, instead of stuffing entire documents into context. This is currently the most mainstream approach.

Error Handling: Don’t Trust API to Always Be Available

LLM API will timeout, hit rate limits, return empty results. Production code must handle these.

Must-handle error types:

Error	Cause	Handling
`RateLimitError`	Request frequency too high	Exponential backoff retry
`Timeout`	Generation time too long	Set timeout, return degraded response
`InvalidRequestError`	Context too long	Truncate and retry
Empty/Truncated response	Model output incomplete	Detect and regenerate

import time
from openai import RateLimitError, APITimeoutError

def call_llm_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                timeout=30,
            )
            return response.choices[0].message.content
        except RateLimitError:
            wait = 2 ** attempt  # 1s, 2s, 4s
            time.sleep(wait)
        except APITimeoutError:
            if attempt == max_retries - 1:
                return "Service temporarily unavailable, please try again later."
    return None

Cost Control: Token Usage Must Be Visible

Without knowing how many tokens each request costs, you can’t optimize costs. Must log token usage before going live.

Simplest approach: Log usage for each call

def log_usage(response, endpoint: str):
    usage = response.usage
    logger.info(
        "llm_usage",
        extra={
            "endpoint": endpoint,
            "prompt_tokens": usage.prompt_tokens,
            "completion_tokens": usage.completion_tokens,
            "total_tokens": usage.total_tokens,
            "model": response.model,
        }
    )

With logs, you can analyze by endpoint, user, time period, find high-cost paths to optimize.

Common optimization points:

Keep system prompt as short as possible, remove redundant descriptions
Use smaller models for simple questions (like gpt-4o-mini), only use larger models for complex ones
Semantic caching for high-repeat requests (reuse previous answers for similar questions)

Observability: Must Be Able to Locate Issues When Problems Occur

LLM app bugs are usually not code errors, but “model responses don’t match expectations”. These are hard to reproduce, must log complete input/output.

Minimum observability requirements:

Log complete request: messages list, model parameters, temperature
Log complete response: raw output, finish_reason, token usage
Associate trace_id: One user request may trigger multiple LLM calls, must be able to chain them

import uuid

def traced_llm_call(messages, **kwargs):
    trace_id = str(uuid.uuid4())

    logger.info("llm_request", extra={
        "trace_id": trace_id,
        "messages": messages,
        **kwargs
    })

    response = client.chat.completions.create(messages=messages, **kwargs)

    logger.info("llm_response", extra={
        "trace_id": trace_id,
        "content": response.choices[0].message.content,
        "finish_reason": response.choices[0].finish_reason,
        "usage": response.usage.model_dump(),
    })

    return response

Summary

Core differences from Demo to production:

Context management: Can’t pile up unlimited tokens, need truncation or compression strategy
Error handling: API instability is normal, retry and degradation are a must
Cost visibility: Can’t optimize without logging usage
Observability: Must be able to trace when problems occur, log completely

These aren’t showing off skills, they’re the most basic engineering requirements. LLMs themselves are powerful, but without proper handling at the application layer, it’s easy to fail in production.