AI Research 40 - Multimodal Large Model Quantization: Pat...

7/22/2025

artificial-intelligence ai multimodal large model BLIP-2 MiniGPT-4 LLaVA Qwen2.5-VL

1. BLIP-2

Institution: Salesforce Research

Core Architectural Innovations

Dual freezing strategy: Simultaneously freeze pre-trained visual encoder and language model, only train lightweight query transformer (Q-Former) in the middle
Parameter-efficient design: Q-Former typically has 12-layer structure, parameter count of tens of millions
Two-stage training process: First stage image-text contrastive learning for feature alignment; second stage generative task fine-tuning

Performance

Zero-shot VQAv2 benchmark: BLIP-2 (based on Flan-T5 XXL) achieves 82.4% accuracy
COCO Caption task: CIDEr score reaches 136.7

Model Configuration Options

Configuration	Language Model	Parameters
Large	Flan-T5 XXL	Approximately 11 billion
Medium	Flan-T5 XL	Approximately 3 billion
Small	OPT series	Approximately 2.7B

2. MiniGPT-4

Institution: KAUST (King Abdullah University of Science and Technology)

Architectural Design

Visual encoding front-end: Use pre-trained CLIP ViT-g and other visual models to extract image features
Projection layer: Map visual features to language model space through simple linear projection layer
Language model back-end: Connect to pre-trained Vicuna-13B large language model

Two-Stage Training Scheme

First stage: Use approximately 5 million pairs of general image-text description data for alignment
Second stage: Use high-quality conversational image-text data for fine-tuning

Parameter Scale

13B version: Based on Vicuna-13B, actual trainable parameters only approximately a few million

3. Flamingo

Institution: DeepMind

Technical Characteristics

Architecture: Large language model (80B or larger) and visual encoding module fused through cross-attention
Core capability: Support few-shot learning on arbitrary image-text sequences

Limitations

80B parameters huge, training uses massive private data
Model itself not fully open source

4. LLaVA

Institution: Open-source community

Technical Characteristics

Typical configuration: LLaVA-13B approximately 13 billion parameters
Visual part: CLIP ViT-L/14
Language part: Based on LLaMA architecture
Training method: Use GPT-4 generated image-text conversation data for instruction fine-tuning

Deployment Advantages

13B parameter scale can run on consumer GPUs (e.g., RTX 3090)
Community provides complete fine-tuning toolchain

5. Qwen2.5-VL

Institution: Alibaba

Model Scale Options

Version	Parameter Scale	Applicable Scenarios
Base	3B	Mobile devices and edge computing
Standard	7B	SMBs, balance between cost and effect
Flagship	72B	Complex visual reasoning tasks

Technical Characteristics

Cross-modal understanding capability: Support simultaneous processing of images, text, video, and other inputs
OCR capability: 92%+ accuracy on complex scene recognition
Long context processing: Support up to 32k tokens context window

6. Summary Comparison

Model	Core Innovation	Parameter Scale
BLIP-2	Lightweight Q-Former design	3B-110B
MiniGPT-4	Extremely simple single-layer linear mapping	7B-13B
Flamingo	Cross-attention few-shot	80B+
LLaVA	Instruction fine-tuning conversation optimization	13B
Qwen2.5-VL	Industrial-grade long context	3B-72B

Community Development Trends

Mainstream choice: 7B-13B models become mainstream due to balancing effect and deployment cost
Quantization focus: INT8/INT4 quantization becomes community focus
Success cases:
- BLIP-2 INT8 quantization reduces memory by 37%
- MiniGPT-4 INT4 version can run on RTX 3060