Background

LangChain provides multiple types of Text Splitters to meet different text processing needs:

1. RecursiveCharacterTextSplitter

  • How it works: Starts from the first character of the text and recursively splits according to a predefined character order (such as newlines, periods, etc.)
  • Features and advantages: Customizable chunk_size and chunk_overlap parameters, supports progressive splitting strategy
  • Typical applications: Processing mixed-format documents (e.g., PDF to text)

2. CharacterTextSplitter

  • Core function: Supports fully customizable delimiter list, default includes [“\n\n”, “\n”, ” ”]
  • Configuration options: Can adjust separator parameter to specify priority split symbols, keep_separator controls whether to keep delimiters

3. RecursiveTextSplitter

  • Tokenization mechanism: Uses NLP tools like spaCy for semantic tokenization, rather than simple character splitting
  • Significant advantage: Maintains word integrity, avoids cutting in the middle of words

4. TokenTextSplitter

  • Technical foundation: Uses OpenAI’s tiktoken library for precise token counting
  • Unique value: Perfectly matches LLM’s token processing method, ensures input length limits
  • Key application: Text preprocessing before calling large model APIs like GPT

Install Dependencies

pip install -qU langchain-text-splitters

Main Splitter Types and Code Examples

  1. HTML Splitter - Splits by HTML tags (h1, h2, h3, etc.)
  2. WebHTML Splitter - Fetches HTML from URL and splits
  3. Character Splitter - Splits text by characters
  4. Code Splitter - Supports multiple programming languages like Python, JavaScript, TypeScript
  5. Markdown Splitter - Splits by Markdown syntax structure
  6. Markdown Header Splitter - Splits by Markdown heading levels
  7. JSON Splitter - Recursively splits JSON data

Selection Recommendations

When choosing, consider:

  • Text characteristics (structured/unstructured)
  • Downstream task requirements (whether semantic integrity is needed)
  • Processing efficiency requirements
  • Target model’s token limit

Advanced usage also includes combining multiple splitters, such as first using CharacterTextSplitter to split by chapters, then using TokenTextSplitter to ensure each chapter meets token limits.