Common Datasets and Evaluation Metrics

1. Flickr30k

  • Scale: 31,783 images, 5 descriptions each
  • Tasks: Image captioning, cross-modal retrieval
  • Metrics: BLEU, METEOR, ROUGE-L, CIDEr, Recall@K

2. MS COCO

  • Scale: 120,000+ images, 330,000+ descriptions
  • Tasks: Image captioning, visual question answering, image-text retrieval
  • Metrics: BLEU-1~4, METEOR, ROUGE-L, CIDEr

3. VQA (Visual Question Answering)

  • Scale: 200,000 images, 250,000 Q&A pairs
  • Scoring mechanism: min(number of matching reference answers/3, 1)
  • Question types: Yes/no, counting, open questions

4. ActivityNet

  • Scale: 20,000 videos, 100,000 descriptions
  • Tasks: Video captioning, temporal localization, Q&A
  • Metrics: BLEU, METEOR, CIDEr, IoU@0.5

5. Other Datasets

DatasetUsageCharacteristics
MSRVTTVideo captioning10,000 video clips
GQAVisual reasoning11 million Q&A pairs
OK-VQACommon sense Q&ARequires external knowledge
Hateful MemesHate detectionImage-text implicit semantics

Quantization Evaluation Framework

Three-dimensional capability evaluation:

  1. Visual perception: Object detection mAP, image classification Top-1/5 accuracy
  2. Language generation: BLEU-4, METEOR, ROUGE, CIDEr
  3. Cross-modal reasoning: VQA accuracy, reasoning task accuracy

Quantization trade-off suggestions:

  • Mobile: Can accept 5% accuracy loss for 3x inference speedup
  • Critical scenarios (medical): Control accuracy loss within 1%