Background
LangChain provides multiple types of Text Splitters to meet different text processing needs:
1. RecursiveCharacterTextSplitter
- How it works: Starts from the first character of the text and recursively splits according to a predefined character order (such as newlines, periods, etc.)
- Features and advantages: Customizable chunk_size and chunk_overlap parameters, supports progressive splitting strategy
- Typical applications: Processing mixed-format documents (e.g., PDF to text)
2. CharacterTextSplitter
- Core function: Supports fully customizable delimiter list, default includes [“\n\n”, “\n”, ” ”]
- Configuration options: Can adjust separator parameter to specify priority split symbols, keep_separator controls whether to keep delimiters
3. RecursiveTextSplitter
- Tokenization mechanism: Uses NLP tools like spaCy for semantic tokenization, rather than simple character splitting
- Significant advantage: Maintains word integrity, avoids cutting in the middle of words
4. TokenTextSplitter
- Technical foundation: Uses OpenAI’s tiktoken library for precise token counting
- Unique value: Perfectly matches LLM’s token processing method, ensures input length limits
- Key application: Text preprocessing before calling large model APIs like GPT
Install Dependencies
pip install -qU langchain-text-splitters
Main Splitter Types and Code Examples
- HTML Splitter - Splits by HTML tags (h1, h2, h3, etc.)
- WebHTML Splitter - Fetches HTML from URL and splits
- Character Splitter - Splits text by characters
- Code Splitter - Supports multiple programming languages like Python, JavaScript, TypeScript
- Markdown Splitter - Splits by Markdown syntax structure
- Markdown Header Splitter - Splits by Markdown heading levels
- JSON Splitter - Recursively splits JSON data
Selection Recommendations
When choosing, consider:
- Text characteristics (structured/unstructured)
- Downstream task requirements (whether semantic integrity is needed)
- Processing efficiency requirements
- Target model’s token limit
Advanced usage also includes combining multiple splitters, such as first using CharacterTextSplitter to split by chapters, then using TokenTextSplitter to ensure each chapter meets token limits.