LangChain 06 - RAG with Source Document

Concept Analysis

Retrieval-Augmented Generation (RAG) is an artificial intelligence technology framework that combines information retrieval with text generation. Unlike traditional generation models, RAG systems first retrieve relevant information from one or more knowledge bases before executing generation tasks, then generate more accurate responses based on this retrieved content.

Core Components

Document Preprocessing System
- Document parsing: Support parsing of PDF, Word, Excel and other formats
- Text cleaning: Remove special characters, normalize formats
- Chunking: Split large documents into paragraphs suitable for retrieval (typically 256-512 tokens)
- Metadata extraction: Automatically identify key information like document title, author, date
Vector Database
- Embedding model selection: Pre-trained models like BERT, GPT, etc.
- Vectorization: Convert text to high-dimensional vector representations
- Index building: Build efficient similarity search structures (like FAISS, Annoy)
- Storage optimization: Support incremental updates and real-time retrieval
Retrieval Module
- Query understanding: Analyze user question intent and key information
- Similarity calculation: Various algorithms like cosine similarity, Euclidean distance
- Multi-level retrieval: Two-stage retrieval strategy of coarse screening then fine ranking
- Result fusion: Evidence combination from different document fragments
Generation Module
- Prompt engineering: Design context templates including retrieval results
- Generation control: Adjust creativity, professionalism and other parameters
- Citation annotation: Automatically mark information sources in answers
- Quality verification: Check accuracy and consistency of generated content

Typical Workflow

User Question: “What are the latest applications of quantum computing in drug development?”
Retrieval Phase:
- System vectorizes the question
- Find most relevant document fragments in vector database
- Return top 5 most matching results (including original document references)
Generation Phase:
- Combine retrieval results with question into prompt
- Language model generates answer based on this information
- Automatically cite specific document paragraphs in the answer
Output Example: “According to research from Nature journal 2023 (Document A, page 12), quantum computing has been successfully applied… Another paper from MIT (Document B) points out…”

Application Scenarios

Enterprise Knowledge Management:
- Quickly query policy documents, technical manuals
- Automatically generate report drafts conforming to company standards
- Example: Lawyers use RAG system to quickly search similar cases
Academic Research:
- Literature review assistance tools
- Cross-paper knowledge correlation discovery
- Example: Graduate students use RAG system to analyze hundreds of related papers
Customer Support:
- Intelligent Q&A based on product documents
- Troubleshooting guidance generation
- Example: E-commerce customer service system automatically cites return policy terms
Medical Diagnostic Support:
- Decision support combined with medical literature
- Patient education material generation
- Example: Doctors get suggestions with citations when querying latest treatment guidelines

Technical Advantages

Factual Accuracy: Reduce hallucination compared to pure generation models
Traceability: Each answer can be traced to specific source documents
Knowledge Update: Just update document library without retraining model
Domain Adaptation: Quickly adapt to different professional domains by replacing document library

Implementation Challenges

Document Quality Dependency: Garbage in, garbage out (GIGO) problem
Retrieval Efficiency: Response speed bottleneck with massive documents
Context Limitations: Generation models have limited context length
Multi-document Fusion: How to handle conflicting information from different documents

Best Practices

Preserve complete metadata and location information during document preprocessing
Design targeted chunking strategies for different document types
Implement result re-ranking mechanism to improve relevance
Add fact-checking step in generation环节
Establish user feedback mechanism for continuous system optimization

Future Developments

Multi-modal RAG: Combine text, image, table and other information
Active Retrieval: System automatically identifies knowledge gaps needing supplementation
Dynamic Knowledge Update: Real-time capture and integration of latest information
Personalized Adaptation: Adjust retrieval and generation strategies based on user profiles

Install Dependencies

pip install --upgrade --quiet  langchain-core langchain-community langchain-openai

Code Implementation

The article provides complete Python code implementation examples, including vector database creation, retriever configuration, prompt template definition, conversation history management and other functions.

Running Result

result1: {'answer': AIMessage(content='Sam worked at home.', response_metadata={'finish_reason': 'stop', 'logprobs': None}), 'docs': [Document(page_content='sam worked at home'), Document(page_content='wuzikang worked at earth'), Document(page_content='harrison worked at kensho')]}
result2: {'answer': AIMessage(content='Sam actually worked at home.', response_metadata={'finish_reason': 'stop', 'logprobs': None}), 'docs': [Document(page_content='sam worked at home'), Document(page_content='wuzikang worked at earth'), Document(page_content='harrison worked at kensho')]}