LangChain Text Splitter: Character, Word, HTML and Code-b...

Background

LangChain provides multiple types of Text Splitters to meet different text processing needs:

1. RecursiveCharacterTextSplitter

How it works: Starts from the first character of the text and recursively splits according to a predefined character order (such as newlines, periods, etc.)
Features and advantages: Customizable chunk_size and chunk_overlap parameters, supports progressive splitting strategy
Typical applications: Processing mixed-format documents (e.g., PDF to text)

2. CharacterTextSplitter

Core function: Supports fully customizable delimiter list, default includes [“\n\n”, “\n”, ” ”]
Configuration options: Can adjust separator parameter to specify priority split symbols, keep_separator controls whether to keep delimiters

3. RecursiveTextSplitter

Tokenization mechanism: Uses NLP tools like spaCy for semantic tokenization, rather than simple character splitting
Significant advantage: Maintains word integrity, avoids cutting in the middle of words

4. TokenTextSplitter

Technical foundation: Uses OpenAI’s tiktoken library for precise token counting
Unique value: Perfectly matches LLM’s token processing method, ensures input length limits
Key application: Text preprocessing before calling large model APIs like GPT

Install Dependencies

pip install -qU langchain-text-splitters

Main Splitter Types and Code Examples

HTML Splitter - Splits by HTML tags (h1, h2, h3, etc.)
WebHTML Splitter - Fetches HTML from URL and splits
Character Splitter - Splits text by characters
Code Splitter - Supports multiple programming languages like Python, JavaScript, TypeScript
Markdown Splitter - Splits by Markdown syntax structure
Markdown Header Splitter - Splits by Markdown heading levels
JSON Splitter - Recursively splits JSON data

Selection Recommendations

When choosing, consider:

Text characteristics (structured/unstructured)
Downstream task requirements (whether semantic integrity is needed)
Processing efficiency requirements
Target model’s token limit

Advanced usage also includes combining multiple splitters, such as first using CharacterTextSplitter to split by chapters, then using TokenTextSplitter to ensure each chapter meets token limits.