1. BLIP-2
Institution: Salesforce Research
Core Architectural Innovations
- Dual freezing strategy: Simultaneously freeze pre-trained visual encoder and language model, only train lightweight query transformer (Q-Former) in the middle
- Parameter-efficient design: Q-Former typically has 12-layer structure, parameter count of tens of millions
- Two-stage training process: First stage image-text contrastive learning for feature alignment; second stage generative task fine-tuning
Performance
- Zero-shot VQAv2 benchmark: BLIP-2 (based on Flan-T5 XXL) achieves 82.4% accuracy
- COCO Caption task: CIDEr score reaches 136.7
Model Configuration Options
| Configuration | Language Model | Parameters |
|---|---|---|
| Large | Flan-T5 XXL | Approximately 11 billion |
| Medium | Flan-T5 XL | Approximately 3 billion |
| Small | OPT series | Approximately 2.7B |
2. MiniGPT-4
Institution: KAUST (King Abdullah University of Science and Technology)
Architectural Design
- Visual encoding front-end: Use pre-trained CLIP ViT-g and other visual models to extract image features
- Projection layer: Map visual features to language model space through simple linear projection layer
- Language model back-end: Connect to pre-trained Vicuna-13B large language model
Two-Stage Training Scheme
- First stage: Use approximately 5 million pairs of general image-text description data for alignment
- Second stage: Use high-quality conversational image-text data for fine-tuning
Parameter Scale
- 13B version: Based on Vicuna-13B, actual trainable parameters only approximately a few million
3. Flamingo
Institution: DeepMind
Technical Characteristics
- Architecture: Large language model (80B or larger) and visual encoding module fused through cross-attention
- Core capability: Support few-shot learning on arbitrary image-text sequences
Limitations
- 80B parameters huge, training uses massive private data
- Model itself not fully open source
4. LLaVA
Institution: Open-source community
Technical Characteristics
- Typical configuration: LLaVA-13B approximately 13 billion parameters
- Visual part: CLIP ViT-L/14
- Language part: Based on LLaMA architecture
- Training method: Use GPT-4 generated image-text conversation data for instruction fine-tuning
Deployment Advantages
- 13B parameter scale can run on consumer GPUs (e.g., RTX 3090)
- Community provides complete fine-tuning toolchain
5. Qwen2.5-VL
Institution: Alibaba
Model Scale Options
| Version | Parameter Scale | Applicable Scenarios |
|---|---|---|
| Base | 3B | Mobile devices and edge computing |
| Standard | 7B | SMBs, balance between cost and effect |
| Flagship | 72B | Complex visual reasoning tasks |
Technical Characteristics
- Cross-modal understanding capability: Support simultaneous processing of images, text, video, and other inputs
- OCR capability: 92%+ accuracy on complex scene recognition
- Long context processing: Support up to 32k tokens context window
6. Summary Comparison
| Model | Core Innovation | Parameter Scale |
|---|---|---|
| BLIP-2 | Lightweight Q-Former design | 3B-110B |
| MiniGPT-4 | Extremely simple single-layer linear mapping | 7B-13B |
| Flamingo | Cross-attention few-shot | 80B+ |
| LLaVA | Instruction fine-tuning conversation optimization | 13B |
| Qwen2.5-VL | Industrial-grade long context | 3B-72B |
Community Development Trends
- Mainstream choice: 7B-13B models become mainstream due to balancing effect and deployment cost
- Quantization focus: INT8/INT4 quantization becomes community focus
- Success cases:
- BLIP-2 INT8 quantization reduces memory by 37%
- MiniGPT-4 INT4 version can run on RTX 3060