LangChain-22 Text Embedding and FAISS Practical Explanation

Background

Text Embedding in Large Models

Text Embedding involves the process of mapping high-dimensional data (such as text, images, etc.) to lower-dimensional spaces. This process not only helps reduce the complexity of data processing but also captures and expresses semantic information in the data, thereby improving model performance and efficiency.

Text Embedding converts words, sentences, or paragraphs in text into fixed-length vector representations through deep neural networks (such as BERT, GPT, etc.). These vectors not only contain the literal meaning of words but also embed contextual information, semantic relationships, and other deep features.

Application Scenarios

Text Classification: By converting text to vectors, classifiers can be trained for sentiment analysis, topic classification, etc.
Information Retrieval: Calculate similarity between text vectors to achieve efficient semantic search
Machine Translation: Establish cross-language vector space mapping to improve translation quality
Question Answering Systems: Map questions and answers to the same space for matching
Recommendation Systems: Analyze user and content vector representations to achieve personalized recommendations

Main Methods

Word2Vec: The earliest word vector model, learns word vectors through context prediction
GloVe: Word vector model based on global word frequency statistics
BERT: Bidirectional pre-training model based on Transformer, can generate context-aware word vectors
Sentence-BERT: Improved model specifically for sentence-level embedding

How It Works

The core idea of Text Embedding is to convert words, phrases, or sentences in text into real number vectors (also called embedding vectors). These vectors typically exist in a high-dimensional continuous vector space (common dimensions are 50-300), and their spatial structure can effectively capture semantic features and grammatical relationships of language.

Key Features

Semantic Relevance: Semantic similarity is reflected through geometric relationships in vector space
- Semantically similar words (like “cat” and “dog”) are closer in vector space
- Semantically opposite words (like “good” and “bad”) may show symmetrical vector relationships
- Analogy relationships (like “king-queen” ≈ “man-woman”) can be reflected through vector operations
Mathematical Operability:
- Supports vector addition and subtraction operations (e.g., vector(“Paris”) - vector(“France”) + vector(“Japan”) ≈ vector(“Tokyo”))
- Can calculate cosine similarity and other metrics to evaluate semantic relevance
- Easy for machine learning models to process

Install Dependencies

pip install -qU langchain-core langchain-openai

Write Code

from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
print(len(embeddings))
print(len(embeddings[0]))

embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
print(embedded_query[:5])

Running Results

➜ python3 test22.py
5
1536
[0.005339288459123527, -0.0004900397315547535, 0.03888638540715689, -0.0029435385310610336, -0.00899561173676785]

Storing to FAISS

Basic Concepts

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and dense vector clustering developed by Facebook AI Research team. It can quickly handle nearest neighbor search problems for large-scale vector datasets, especially suitable for vector similarity calculation scenarios common in machine learning applications.

Core Features

High Performance: FAISS is highly optimized for large-scale vector search
Flexible and Scalable: Supports vector scales from millions to billions
Cross-platform Support: Provides Python interface while using C++ implementation underneath for efficiency

Typical Application Scenarios

Recommendation Systems: Finding similar users or items
Natural Language Processing: Semantic search, document retrieval, question answering systems
Computer Vision: Image retrieval, face recognition
Anomaly Detection: Identifying anomalous samples through vector distances

Install FAISS

pip install --upgrade --quiet  langchain-openai faiss-cpu

Write Code

from langchain_openai import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import CharacterTextSplitter

underlying_embeddings = OpenAIEmbeddings()
store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, store, namespace=underlying_embeddings.model
)
print(list(store.yield_keys()))

raw_documents = TextLoader("./state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

db = FAISS.from_documents(documents, cached_embedder)
print(list(store.yield_keys())[:5])