Background

LangChain provides a rich ecosystem of document loaders, which have been specifically optimized for different data sources to meet various document processing needs.

Main Loader Types

  1. TextLoader - The most basic text loader, supporting loading plain text data from various sources like local file system, remote URLs, etc. Can handle .txt, .md formats, supports custom encoding settings.

  2. CSVLoader - Specifically for processing structured tabular data, can automatically identify delimiters in CSV files, supports loading specific columns.

  3. UnstructuredFileLoader - Intelligent file processing loader, using the Unstructured open-source library as backend, can automatically identify and process 200+ file formats including Word (.docx), Excel (.xlsx), PPT (.pptx), etc.

  4. DirectoryLoader - Batch file loader, can recursively scan all files in a specified directory, supports file filtering through glob patterns.

  5. UnstructuredHTMLLoader - Professional HTML parser, can intelligently identify main content areas in web pages.

  6. JSONLoader - Supports deep parsing of JSON data structures, can specify JSONPath to extract specific fields.

  7. PyPDFLoader - PDF document-specific loader, based on PyPDF2 library, supports extracting text content and metadata.

  8. ArxivLoader - Academic paper-specific loader, can directly fetch papers from arXiv by paper ID or search keywords.


Install Dependencies

pip install -qU langchain-core langchain-openai

Load Text

from langchain_community.document_loaders import TextLoader

loader = TextLoader("./index.md")
data = loader.load()
print(data)

Load CSV

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
print(data)

Load Directory

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('../', glob="**/*.md")
docs = loader.load()
print(docs)

# Show progress bar
loader = DirectoryLoader('../', glob="**/*.md", show_progress=True)

# Multi-threaded loading
loader = DirectoryLoader('../', glob="**/*.md", use_multiprocessing=True)

# Auto-detect encoding
text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)

Load HTML

from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import BSHTMLLoader

loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
print(data)

# Use BeautifulSoup4 for parsing
loader = BSHTMLLoader("example_data/fake-content.html")
data = loader.load()
print(data)

Load JSON

from langchain_community.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint

# Use JSONLoader
loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
pprint(data)

Load JSON Lines

from langchain_community.document_loaders import JSONLoader

file_path = './example_data/facebook_chat_messages.jsonl'
pprint(Path(file_path).read_text())

loader = JSONLoader(
    file_path='./example_data/facebook_chat_messages.jsonl',
    jq_schema='.content',
    text_content=False,
    json_lines=True)

data = loader.load()
pprint(data)

Load Markdown

from langchain_community.document_loaders import UnstructuredMarkdownLoader

markdown_path = "../../../../../README.md"
loader = UnstructuredMarkdownLoader(markdown_path)
data = loader.load()

Load PDF

Install Dependencies

pip install pypdf
pip install rapidocr-onnxruntime

Write Code

from langchain_community.document_loaders import PyPDFLoader

# Load PDF
loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()
print(pages[0])

# Convert images to text
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)
pages = loader.load()
pages[4].page_content

Vectorize Data (Simple Example)

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("How will the community be engaged?", k=2)
for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])