Version Matrix

Ensuring version compatibility is critical for successful DeepSeek-OCR deployment. The following version matrix has been tested and validated:

  • Python: 3.12
  • PyTorch: 2.6.0
  • Transformers: 4.46.3
  • FlashAttention: 2.7.3

Using different versions may lead to compatibility issues or performance degradation. These specific versions are recommended for production deployment.

Environment Configuration

System Requirements

  • CUDA 12.1+ for GPU acceleration
  • Minimum 8GB GPU memory for inference
  • 16GB+ system RAM recommended
  • SSD storage for model files

Installation Steps

# Create virtual environment
python -m venv deepseek-ocr
source deepseek-ocr/bin/activate

# Install PyTorch with CUDA support
pip install torch==2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install transformers and dependencies
pip install transformers==4.46.3

# Install FlashAttention
pip install flash-attn==2.7.3 --no-build-isolation

# Install DeepSeek-OCR
pip install deepseek-ocr

Model Loading

Basic Inference Example

from deepseek_ocr import DeepSeekOCR

# Initialize OCR engine
ocr = DeepSeekOCR(
    model_path="deepseek/ocr-3b",
    device="cuda",
    precision="bf16"
)

# Process image
result = ocr.process("document.png")

# Output structured results
for item in result:
    print(f"Text: {item['text']}")
    print(f"Confidence: {item['confidence']}")
    print(f"Bounding Box: {item['bbox']}")

Supported Data Formats

DeepSeek-OCR supports comprehensive document processing capabilities:

  • Image Formats: PNG, JPG, JPEG, BMP, TIFF, WebP
  • Document Formats: PDF (single/multi-page), DjVu
  • Input Modes: Single image, batch processing, directory scan

Training and Fine-tuning

For custom OCR scenarios, fine-tuning the base model may improve accuracy:

  • Base Model: DeepSeek-OCR-3B
  • Training Framework: PyTorch Lightning
  • Recommended GPU: A100 (40GB) or H100
  • Fine-tuning Approaches:
    • LoRA adapter training (recommended for limited resources)
    • Full parameter fine-tuning (requires significant GPU memory)

Model Specifications

  • Parameters: Approximately 3 billion
  • Model Format: safetensors
  • Model Size: Approximately 6.6GB (BF16 precision)
  • Precision Options: BF16, FP16, INT8, INT4
PrecisionVRAM RequiredQuality
BF16~7GBBest
FP16~7GBBest
INT8~4GBGood
INT4~2GBAcceptable

Deployment Options

Local Inference Service

Deploy as a local API service for production environments:

from deepseek_ocr import DeepSeekOCRServer

server = DeepSeekOCRServer(
    model_path="deepseek/ocr-3b",
    host="0.0.0.0",
    port=8000
)

server.start()

HuggingFace Spaces

Quick deployment option using HuggingFace infrastructure:

  1. Visit HuggingFace Spaces
  2. Select DeepSeek-OCR template
  3. Configure hardware (CPU/GPU)
  4. Deploy with one click

vLLM Integration

For high-throughput production scenarios:

from vllm import LLM, SamplingParams

# Load model with vLLM
llm = LLM(model="deepseek/ocr-3b")

# Process OCR tasks
outputs = llm.generate(prompts, sampling_params)

Error Quick Reference

ErrorCauseSolution
CUDA out of memoryInsufficient GPU VRAMUse INT8/INT4 quantization
FlashAttention build failedMissing CUDA toolkitInstall CUDA 12.1+
Model not foundIncorrect pathVerify model_path parameter
Low accuracyDomain mismatchFine-tune with domain data

Performance Optimization

Batch Processing

For high-volume OCR workloads, batch processing significantly improves throughput:

results = ocr.process_batch(
    image_paths=["doc1.png", "doc2.png", "doc3.png"],
    batch_size=8
)

Quantization

Reduce resource requirements with minimal accuracy loss:

ocr = DeepSeekOCR(
    model_path="deepseek/ocr-3b",
    precision="int8"  # or "int4"
)

Production Considerations

  1. Monitoring: Implement Prometheus metrics for inference latency
  2. Caching: Enable result caching for repeated documents
  3. Load Balancing: Use multiple GPU instances for horizontal scaling
  4. Health Checks: Implement regular model health verification