LangChain 06 - RAG with Source Document
Concept Analysis
Retrieval-Augmented Generation (RAG) is an artificial intelligence technology framework that combines information retrieval with text generation. Unlike traditional generation models, RAG systems first retrieve relevant information from one or more knowledge bases before executing generation tasks, then generate more accurate responses based on this retrieved content.
Core Components
-
Document Preprocessing System
- Document parsing: Support parsing of PDF, Word, Excel and other formats
- Text cleaning: Remove special characters, normalize formats
- Chunking: Split large documents into paragraphs suitable for retrieval (typically 256-512 tokens)
- Metadata extraction: Automatically identify key information like document title, author, date
-
Vector Database
- Embedding model selection: Pre-trained models like BERT, GPT, etc.
- Vectorization: Convert text to high-dimensional vector representations
- Index building: Build efficient similarity search structures (like FAISS, Annoy)
- Storage optimization: Support incremental updates and real-time retrieval
-
Retrieval Module
- Query understanding: Analyze user question intent and key information
- Similarity calculation: Various algorithms like cosine similarity, Euclidean distance
- Multi-level retrieval: Two-stage retrieval strategy of coarse screening then fine ranking
- Result fusion: Evidence combination from different document fragments
-
Generation Module
- Prompt engineering: Design context templates including retrieval results
- Generation control: Adjust creativity, professionalism and other parameters
- Citation annotation: Automatically mark information sources in answers
- Quality verification: Check accuracy and consistency of generated content
Typical Workflow
- User Question: “What are the latest applications of quantum computing in drug development?”
- Retrieval Phase:
- System vectorizes the question
- Find most relevant document fragments in vector database
- Return top 5 most matching results (including original document references)
- Generation Phase:
- Combine retrieval results with question into prompt
- Language model generates answer based on this information
- Automatically cite specific document paragraphs in the answer
- Output Example: “According to research from Nature journal 2023 (Document A, page 12), quantum computing has been successfully applied… Another paper from MIT (Document B) points out…”
Application Scenarios
-
Enterprise Knowledge Management:
- Quickly query policy documents, technical manuals
- Automatically generate report drafts conforming to company standards
- Example: Lawyers use RAG system to quickly search similar cases
-
Academic Research:
- Literature review assistance tools
- Cross-paper knowledge correlation discovery
- Example: Graduate students use RAG system to analyze hundreds of related papers
-
Customer Support:
- Intelligent Q&A based on product documents
- Troubleshooting guidance generation
- Example: E-commerce customer service system automatically cites return policy terms
-
Medical Diagnostic Support:
- Decision support combined with medical literature
- Patient education material generation
- Example: Doctors get suggestions with citations when querying latest treatment guidelines
Technical Advantages
- Factual Accuracy: Reduce hallucination compared to pure generation models
- Traceability: Each answer can be traced to specific source documents
- Knowledge Update: Just update document library without retraining model
- Domain Adaptation: Quickly adapt to different professional domains by replacing document library
Implementation Challenges
- Document Quality Dependency: Garbage in, garbage out (GIGO) problem
- Retrieval Efficiency: Response speed bottleneck with massive documents
- Context Limitations: Generation models have limited context length
- Multi-document Fusion: How to handle conflicting information from different documents
Best Practices
- Preserve complete metadata and location information during document preprocessing
- Design targeted chunking strategies for different document types
- Implement result re-ranking mechanism to improve relevance
- Add fact-checking step in generation环节
- Establish user feedback mechanism for continuous system optimization
Future Developments
- Multi-modal RAG: Combine text, image, table and other information
- Active Retrieval: System automatically identifies knowledge gaps needing supplementation
- Dynamic Knowledge Update: Real-time capture and integration of latest information
- Personalized Adaptation: Adjust retrieval and generation strategies based on user profiles
Install Dependencies
pip install --upgrade --quiet langchain-core langchain-community langchain-openai
Code Implementation
The article provides complete Python code implementation examples, including vector database creation, retriever configuration, prompt template definition, conversation history management and other functions.
Running Result
result1: {'answer': AIMessage(content='Sam worked at home.', response_metadata={'finish_reason': 'stop', 'logprobs': None}), 'docs': [Document(page_content='sam worked at home'), Document(page_content='wuzikang worked at earth'), Document(page_content='harrison worked at kensho')]}
result2: {'answer': AIMessage(content='Sam actually worked at home.', response_metadata={'finish_reason': 'stop', 'logprobs': None}), 'docs': [Document(page_content='sam worked at home'), Document(page_content='wuzikang worked at earth'), Document(page_content='harrison worked at kensho')]}