Background
Text Embedding in Large Models
Text Embedding involves the process of mapping high-dimensional data (such as text, images, etc.) to lower-dimensional spaces. This process not only helps reduce the complexity of data processing but also captures and expresses semantic information in the data, thereby improving model performance and efficiency.
Text Embedding converts words, sentences, or paragraphs in text into fixed-length vector representations through deep neural networks (such as BERT, GPT, etc.). These vectors not only contain the literal meaning of words but also embed contextual information, semantic relationships, and other deep features.
Application Scenarios
- Text Classification: By converting text to vectors, classifiers can be trained for sentiment analysis, topic classification, etc.
- Information Retrieval: Calculate similarity between text vectors to achieve efficient semantic search
- Machine Translation: Establish cross-language vector space mapping to improve translation quality
- Question Answering Systems: Map questions and answers to the same space for matching
- Recommendation Systems: Analyze user and content vector representations to achieve personalized recommendations
Main Methods
- Word2Vec: The earliest word vector model, learns word vectors through context prediction
- GloVe: Word vector model based on global word frequency statistics
- BERT: Bidirectional pre-training model based on Transformer, can generate context-aware word vectors
- Sentence-BERT: Improved model specifically for sentence-level embedding
How It Works
The core idea of Text Embedding is to convert words, phrases, or sentences in text into real number vectors (also called embedding vectors). These vectors typically exist in a high-dimensional continuous vector space (common dimensions are 50-300), and their spatial structure can effectively capture semantic features and grammatical relationships of language.
Key Features
-
Semantic Relevance: Semantic similarity is reflected through geometric relationships in vector space
- Semantically similar words (like “cat” and “dog”) are closer in vector space
- Semantically opposite words (like “good” and “bad”) may show symmetrical vector relationships
- Analogy relationships (like “king-queen” ≈ “man-woman”) can be reflected through vector operations
-
Mathematical Operability:
- Supports vector addition and subtraction operations (e.g., vector(“Paris”) - vector(“France”) + vector(“Japan”) ≈ vector(“Tokyo”))
- Can calculate cosine similarity and other metrics to evaluate semantic relevance
- Easy for machine learning models to process
Install Dependencies
pip install -qU langchain-core langchain-openai
Write Code
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings()
embeddings = embeddings_model.embed_documents(
[
"Hi there!",
"Oh, hello!",
"What's your name?",
"My friends call me World",
"Hello World!"
]
)
print(len(embeddings))
print(len(embeddings[0]))
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
print(embedded_query[:5])
Running Results
➜ python3 test22.py
5
1536
[0.005339288459123527, -0.0004900397315547535, 0.03888638540715689, -0.0029435385310610336, -0.00899561173676785]
Storing to FAISS
Basic Concepts
FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and dense vector clustering developed by Facebook AI Research team. It can quickly handle nearest neighbor search problems for large-scale vector datasets, especially suitable for vector similarity calculation scenarios common in machine learning applications.
Core Features
- High Performance: FAISS is highly optimized for large-scale vector search
- Flexible and Scalable: Supports vector scales from millions to billions
- Cross-platform Support: Provides Python interface while using C++ implementation underneath for efficiency
Typical Application Scenarios
- Recommendation Systems: Finding similar users or items
- Natural Language Processing: Semantic search, document retrieval, question answering systems
- Computer Vision: Image retrieval, face recognition
- Anomaly Detection: Identifying anomalous samples through vector distances
Install FAISS
pip install --upgrade --quiet langchain-openai faiss-cpu
Write Code
from langchain_openai import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import CharacterTextSplitter
underlying_embeddings = OpenAIEmbeddings()
store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
underlying_embeddings, store, namespace=underlying_embeddings.model
)
print(list(store.yield_keys()))
raw_documents = TextLoader("./state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = FAISS.from_documents(documents, cached_embedder)
print(list(store.yield_keys())[:5])