Common Datasets and Evaluation Metrics
1. Flickr30k
- Scale: 31,783 images, 5 descriptions each
- Tasks: Image captioning, cross-modal retrieval
- Metrics: BLEU, METEOR, ROUGE-L, CIDEr, Recall@K
2. MS COCO
- Scale: 120,000+ images, 330,000+ descriptions
- Tasks: Image captioning, visual question answering, image-text retrieval
- Metrics: BLEU-1~4, METEOR, ROUGE-L, CIDEr
3. VQA (Visual Question Answering)
- Scale: 200,000 images, 250,000 Q&A pairs
- Scoring mechanism: min(number of matching reference answers/3, 1)
- Question types: Yes/no, counting, open questions
4. ActivityNet
- Scale: 20,000 videos, 100,000 descriptions
- Tasks: Video captioning, temporal localization, Q&A
- Metrics: BLEU, METEOR, CIDEr, IoU@0.5
5. Other Datasets
| Dataset | Usage | Characteristics |
|---|---|---|
| MSRVTT | Video captioning | 10,000 video clips |
| GQA | Visual reasoning | 11 million Q&A pairs |
| OK-VQA | Common sense Q&A | Requires external knowledge |
| Hateful Memes | Hate detection | Image-text implicit semantics |
Quantization Evaluation Framework
Three-dimensional capability evaluation:
- Visual perception: Object detection mAP, image classification Top-1/5 accuracy
- Language generation: BLEU-4, METEOR, ROUGE, CIDEr
- Cross-modal reasoning: VQA accuracy, reasoning task accuracy
Quantization trade-off suggestions:
- Mobile: Can accept 5% accuracy loss for 3x inference speedup
- Critical scenarios (medical): Control accuracy loss within 1%