Background
You can cache the Q&A content. If it’s the same question, the answer will be returned directly, which can save costs and computation.
Install Dependencies
pip install -qU langchain-core langchain-openai
Cache Types and How They Work
1. In-Memory Cache
- Default caching method, implemented using Python’s
lru_cache - Cache is stored in memory during program runtime
- Cache is automatically cleared after the process ends
- Suitable for rapid prototyping of short-term, small-scale applications
2. Persistent Cache
- Supports multiple backend storage: SQLite, Redis, local file system, etc.
- Cache can persist across sessions and processes
- Suitable for production environments and large-scale applications
- Example: SQLiteCache creates a local database file to store cache records
Practical Application Scenarios for Caching
- Development and Debugging Phase: Reduce the number of repeated API calls
- Production Environment: Reduce LLM service call costs
- Queries with Unchanging Content: Such as FAQ answers, fixed knowledge base queries
- Batch Processing: When processing many similar requests
Configuration and Usage Examples
from langchain.cache import InMemoryCache, SQLiteCache
from langchain.globals import set_llm_cache
# Use in-memory cache
set_llm_cache(InMemoryCache())
# Or use SQLite persistent cache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))
Detailed Code Implementation
from langchain.globals import set_llm_cache
from langchain_openai import ChatOpenAI
from langchain.cache import InMemoryCache
from langchain.cache import SQLiteCache
llm = ChatOpenAI(
model="gpt-3.5-turbo",
)
# Store in memory
set_llm_cache(InMemoryCache())
# Can also persist to database
# set_llm_cache(SQLiteCache(database_path=".langchain.db"))
# The first time, it is not yet in cache, so it should take longer
message1 = llm.predict("Tell me a joke")
print(f"message1: {message1}")
# The second time it is, so it goes faster
message2 = llm.predict("Tell me a joke")
print(f"message2: {message2}")
Running Results Explanation
- First request: Full OpenAI API call, response time 3-5 seconds
- Second request: Returns cached result directly, response time approximately 500ms
- Significant performance difference indicates the caching mechanism is working
Cache Invalidation and Updates
- Automatic Invalidation: Automatically determines whether to use cache based on input parameters
- Manual Clearing: Can clear all cache via
cache.clear()method - Fine-grained Control: Can disable cache for specific calls by setting
use_cache=False
Performance Considerations
- Response Time: Cache hits can improve response speed by 10-100 times
- Cost Savings: Reducing API calls can significantly reduce usage costs
- Throughput: Overall system throughput can be improved by 3-5 times
Best Practice Recommendations
- Use in-memory cache in development environments first
- Use high-performance persistent caches like Redis in production environments
- For frequently changing content, appropriately reduce cache time or disable caching
- Regularly monitor cache hit rate and effectiveness
Cache Technical Details
- Default cache time: 30 minutes
- Cache size limit: 100MB
- Eviction policy: LRU (Least Recently Used)