1. Post-Training Quantization (PTQ)
Working principle: After training is complete, use calibration data (100-500 samples) to calculate activation distribution, determine quantization parameters (scale factor, zero point, quantization range), convert FP32 weights to INT8/INT4.
Advantages: Efficient and fast (completed in hours), no training required, plug-and-play Limitations: Quantization below 4-bit has 5-15% accuracy degradation
2. Quantization-Aware Training (QAT)
Core mechanisms:
- Forward propagation simulates quantization noise
- Backward propagation uses Straight-Through Estimator (STE) to pass gradients
- Progressive quantization, special handling for sensitive layers
Performance: INT8 accuracy loss <1%, 5-15% accuracy improvement compared to PTQ
Resource requirements:
- 7B model requires approximately 80GB GPU memory
- 13B model requires approximately 160GB GPU memory
3. Low-Bit Quantization Technology (INT4/INT2)
Technical Challenges and Solutions
| Challenge | Solution |
|---|---|
| Information loss | Group scaling (independent scale factor per group) |
| Outliers | Store separately as FP16, clustering quantization, non-linear quantization |
| Accuracy degradation | QAT+STE, learnable scale factors |
Performance Data
- ResNet50: INT4 (GPTQ method) 74.6% (98% retained)
- LLaMA-13B: INT4 (AWQ method) perplexity 10.31
Practical Suggestions
- >13B large models: Prioritize INT4, use GPTQ/AWQ
- Below 7B models: Recommend INT8, if INT4 needed should combine with QAT
4. Mixed Precision Quantization
Mainstream Solutions
- W4A16 (weights 4-bit/activations 16-bit): 75% memory reduction, <1% accuracy loss
- INT8: 2-4x throughput improvement
- FP8 (H100): Reduces type conversion overhead
- AWQ: Identify outlier-related weights to retain higher bit width
5. Combining LoRA with Quantization
QLoRA Scheme
- Quantization phase: GPTQ converts weights to INT4 (13GB→3.5GB)
- Fine-tuning phase: Freeze quantized weights, only train LoRA parameters (0.2% of original parameters)
Advanced Scheme Comparison
| Method | Core Innovation | Quantization Accuracy | Effect |
|---|---|---|---|
| QA-LoRA | Constrain LoRA to match quantization groups | Full INT4 | 98% accuracy |
| L4Q | Online merging + joint optimization | Full INT4 | 2.5% improvement over QLoRA |
Resource Comparison
| Scheme | Memory Usage | Inference Latency | Accuracy |
|---|---|---|---|
| Full parameters FP16 | 100% | 1.0x | 100% |
| QLoRA | 15% | 1.2x | 95% |
| L4Q | 12% | 1.1x | 97% |
Summary
| Technology | Applicable Scenario | Advantages | Cost |
|---|---|---|---|
| PTQ | Quick prototyping, temporary deployment | Zero training cost | Higher accuracy loss |
| QAT | High accuracy requirement scenarios | Maintains 95%+ accuracy | Requires retraining |
| Mixed precision | Real-time systems | Intelligent resource allocation | Complex configuration |
| QLoRA | Edge devices + fine-tuning | 75% memory savings | Depends on fine-tuning data |
Future directions: Adaptive dynamic quantization, conditional computation quantization, hardware-aware design