1. BLIP-2

Institution: Salesforce Research

Core Architectural Innovations

  1. Dual freezing strategy: Simultaneously freeze pre-trained visual encoder and language model, only train lightweight query transformer (Q-Former) in the middle
  2. Parameter-efficient design: Q-Former typically has 12-layer structure, parameter count of tens of millions
  3. Two-stage training process: First stage image-text contrastive learning for feature alignment; second stage generative task fine-tuning

Performance

  • Zero-shot VQAv2 benchmark: BLIP-2 (based on Flan-T5 XXL) achieves 82.4% accuracy
  • COCO Caption task: CIDEr score reaches 136.7

Model Configuration Options

ConfigurationLanguage ModelParameters
LargeFlan-T5 XXLApproximately 11 billion
MediumFlan-T5 XLApproximately 3 billion
SmallOPT seriesApproximately 2.7B

2. MiniGPT-4

Institution: KAUST (King Abdullah University of Science and Technology)

Architectural Design

  • Visual encoding front-end: Use pre-trained CLIP ViT-g and other visual models to extract image features
  • Projection layer: Map visual features to language model space through simple linear projection layer
  • Language model back-end: Connect to pre-trained Vicuna-13B large language model

Two-Stage Training Scheme

  • First stage: Use approximately 5 million pairs of general image-text description data for alignment
  • Second stage: Use high-quality conversational image-text data for fine-tuning

Parameter Scale

  • 13B version: Based on Vicuna-13B, actual trainable parameters only approximately a few million

3. Flamingo

Institution: DeepMind

Technical Characteristics

  • Architecture: Large language model (80B or larger) and visual encoding module fused through cross-attention
  • Core capability: Support few-shot learning on arbitrary image-text sequences

Limitations

  • 80B parameters huge, training uses massive private data
  • Model itself not fully open source

4. LLaVA

Institution: Open-source community

Technical Characteristics

  • Typical configuration: LLaVA-13B approximately 13 billion parameters
  • Visual part: CLIP ViT-L/14
  • Language part: Based on LLaMA architecture
  • Training method: Use GPT-4 generated image-text conversation data for instruction fine-tuning

Deployment Advantages

  • 13B parameter scale can run on consumer GPUs (e.g., RTX 3090)
  • Community provides complete fine-tuning toolchain

5. Qwen2.5-VL

Institution: Alibaba

Model Scale Options

VersionParameter ScaleApplicable Scenarios
Base3BMobile devices and edge computing
Standard7BSMBs, balance between cost and effect
Flagship72BComplex visual reasoning tasks

Technical Characteristics

  1. Cross-modal understanding capability: Support simultaneous processing of images, text, video, and other inputs
  2. OCR capability: 92%+ accuracy on complex scene recognition
  3. Long context processing: Support up to 32k tokens context window

6. Summary Comparison

ModelCore InnovationParameter Scale
BLIP-2Lightweight Q-Former design3B-110B
MiniGPT-4Extremely simple single-layer linear mapping7B-13B
FlamingoCross-attention few-shot80B+
LLaVAInstruction fine-tuning conversation optimization13B
Qwen2.5-VLIndustrial-grade long context3B-72B
  1. Mainstream choice: 7B-13B models become mainstream due to balancing effect and deployment cost
  2. Quantization focus: INT8/INT4 quantization becomes community focus
  3. Success cases:
    • BLIP-2 INT8 quantization reduces memory by 37%
    • MiniGPT-4 INT4 version can run on RTX 3060