Background

You can cache the Q&A content. If it’s the same question, the answer will be returned directly, which can save costs and computation.


Install Dependencies

pip install -qU langchain-core langchain-openai

Cache Types and How They Work

1. In-Memory Cache

  • Default caching method, implemented using Python’s lru_cache
  • Cache is stored in memory during program runtime
  • Cache is automatically cleared after the process ends
  • Suitable for rapid prototyping of short-term, small-scale applications

2. Persistent Cache

  • Supports multiple backend storage: SQLite, Redis, local file system, etc.
  • Cache can persist across sessions and processes
  • Suitable for production environments and large-scale applications
  • Example: SQLiteCache creates a local database file to store cache records

Practical Application Scenarios for Caching

  1. Development and Debugging Phase: Reduce the number of repeated API calls
  2. Production Environment: Reduce LLM service call costs
  3. Queries with Unchanging Content: Such as FAQ answers, fixed knowledge base queries
  4. Batch Processing: When processing many similar requests

Configuration and Usage Examples

from langchain.cache import InMemoryCache, SQLiteCache
from langchain.globals import set_llm_cache

# Use in-memory cache
set_llm_cache(InMemoryCache())

# Or use SQLite persistent cache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))

Detailed Code Implementation

from langchain.globals import set_llm_cache
from langchain_openai import ChatOpenAI
from langchain.cache import InMemoryCache
from langchain.cache import SQLiteCache


llm = ChatOpenAI(
    model="gpt-3.5-turbo",
)
# Store in memory
set_llm_cache(InMemoryCache())
# Can also persist to database
# set_llm_cache(SQLiteCache(database_path=".langchain.db"))

# The first time, it is not yet in cache, so it should take longer
message1 = llm.predict("Tell me a joke")
print(f"message1: {message1}")

# The second time it is, so it goes faster
message2 = llm.predict("Tell me a joke")
print(f"message2: {message2}")

Running Results Explanation

  • First request: Full OpenAI API call, response time 3-5 seconds
  • Second request: Returns cached result directly, response time approximately 500ms
  • Significant performance difference indicates the caching mechanism is working

Cache Invalidation and Updates

  1. Automatic Invalidation: Automatically determines whether to use cache based on input parameters
  2. Manual Clearing: Can clear all cache via cache.clear() method
  3. Fine-grained Control: Can disable cache for specific calls by setting use_cache=False

Performance Considerations

  1. Response Time: Cache hits can improve response speed by 10-100 times
  2. Cost Savings: Reducing API calls can significantly reduce usage costs
  3. Throughput: Overall system throughput can be improved by 3-5 times

Best Practice Recommendations

  1. Use in-memory cache in development environments first
  2. Use high-performance persistent caches like Redis in production environments
  3. For frequently changing content, appropriately reduce cache time or disable caching
  4. Regularly monitor cache hit rate and effectiveness

Cache Technical Details

  • Default cache time: 30 minutes
  • Cache size limit: 100MB
  • Eviction policy: LRU (Least Recently Used)