LangChain Cache Mechanism: InMemoryCache and SQLiteCache ...

Background

You can cache the Q&A content. If it’s the same question, the answer will be returned directly, which can save costs and computation.

Install Dependencies

pip install -qU langchain-core langchain-openai

Cache Types and How They Work

1. In-Memory Cache

Default caching method, implemented using Python’s lru_cache
Cache is stored in memory during program runtime
Cache is automatically cleared after the process ends
Suitable for rapid prototyping of short-term, small-scale applications

2. Persistent Cache

Supports multiple backend storage: SQLite, Redis, local file system, etc.
Cache can persist across sessions and processes
Suitable for production environments and large-scale applications
Example: SQLiteCache creates a local database file to store cache records

Practical Application Scenarios for Caching

Development and Debugging Phase: Reduce the number of repeated API calls
Production Environment: Reduce LLM service call costs
Queries with Unchanging Content: Such as FAQ answers, fixed knowledge base queries
Batch Processing: When processing many similar requests

Configuration and Usage Examples

from langchain.cache import InMemoryCache, SQLiteCache
from langchain.globals import set_llm_cache

# Use in-memory cache
set_llm_cache(InMemoryCache())

# Or use SQLite persistent cache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))

Detailed Code Implementation

from langchain.globals import set_llm_cache
from langchain_openai import ChatOpenAI
from langchain.cache import InMemoryCache
from langchain.cache import SQLiteCache


llm = ChatOpenAI(
    model="gpt-3.5-turbo",
)
# Store in memory
set_llm_cache(InMemoryCache())
# Can also persist to database
# set_llm_cache(SQLiteCache(database_path=".langchain.db"))

# The first time, it is not yet in cache, so it should take longer
message1 = llm.predict("Tell me a joke")
print(f"message1: {message1}")

# The second time it is, so it goes faster
message2 = llm.predict("Tell me a joke")
print(f"message2: {message2}")

Running Results Explanation

First request: Full OpenAI API call, response time 3-5 seconds
Second request: Returns cached result directly, response time approximately 500ms
Significant performance difference indicates the caching mechanism is working

Cache Invalidation and Updates

Automatic Invalidation: Automatically determines whether to use cache based on input parameters
Manual Clearing: Can clear all cache via cache.clear() method
Fine-grained Control: Can disable cache for specific calls by setting use_cache=False

Performance Considerations

Response Time: Cache hits can improve response speed by 10-100 times
Cost Savings: Reducing API calls can significantly reduce usage costs
Throughput: Overall system throughput can be improved by 3-5 times

Best Practice Recommendations

Use in-memory cache in development environments first
Use high-performance persistent caches like Redis in production environments
For frequently changing content, appropriately reduce cache time or disable caching
Regularly monitor cache hit rate and effectiveness

Cache Technical Details

Default cache time: 30 minutes
Cache size limit: 100MB
Eviction policy: LRU (Least Recently Used)