1. Post-Training Quantization (PTQ)

Working principle: After training is complete, use calibration data (100-500 samples) to calculate activation distribution, determine quantization parameters (scale factor, zero point, quantization range), convert FP32 weights to INT8/INT4.

Advantages: Efficient and fast (completed in hours), no training required, plug-and-play Limitations: Quantization below 4-bit has 5-15% accuracy degradation

2. Quantization-Aware Training (QAT)

Core mechanisms:

  • Forward propagation simulates quantization noise
  • Backward propagation uses Straight-Through Estimator (STE) to pass gradients
  • Progressive quantization, special handling for sensitive layers

Performance: INT8 accuracy loss <1%, 5-15% accuracy improvement compared to PTQ

Resource requirements:

  • 7B model requires approximately 80GB GPU memory
  • 13B model requires approximately 160GB GPU memory

3. Low-Bit Quantization Technology (INT4/INT2)

Technical Challenges and Solutions

ChallengeSolution
Information lossGroup scaling (independent scale factor per group)
OutliersStore separately as FP16, clustering quantization, non-linear quantization
Accuracy degradationQAT+STE, learnable scale factors

Performance Data

  • ResNet50: INT4 (GPTQ method) 74.6% (98% retained)
  • LLaMA-13B: INT4 (AWQ method) perplexity 10.31

Practical Suggestions

  • >13B large models: Prioritize INT4, use GPTQ/AWQ
  • Below 7B models: Recommend INT8, if INT4 needed should combine with QAT

4. Mixed Precision Quantization

Mainstream Solutions

  1. W4A16 (weights 4-bit/activations 16-bit): 75% memory reduction, <1% accuracy loss
  2. INT8: 2-4x throughput improvement
  3. FP8 (H100): Reduces type conversion overhead
  4. AWQ: Identify outlier-related weights to retain higher bit width

5. Combining LoRA with Quantization

QLoRA Scheme

  1. Quantization phase: GPTQ converts weights to INT4 (13GB→3.5GB)
  2. Fine-tuning phase: Freeze quantized weights, only train LoRA parameters (0.2% of original parameters)

Advanced Scheme Comparison

MethodCore InnovationQuantization AccuracyEffect
QA-LoRAConstrain LoRA to match quantization groupsFull INT498% accuracy
L4QOnline merging + joint optimizationFull INT42.5% improvement over QLoRA

Resource Comparison

SchemeMemory UsageInference LatencyAccuracy
Full parameters FP16100%1.0x100%
QLoRA15%1.2x95%
L4Q12%1.1x97%

Summary

TechnologyApplicable ScenarioAdvantagesCost
PTQQuick prototyping, temporary deploymentZero training costHigher accuracy loss
QATHigh accuracy requirement scenariosMaintains 95%+ accuracyRequires retraining
Mixed precisionReal-time systemsIntelligent resource allocationComplex configuration
QLoRAEdge devices + fine-tuning75% memory savingsDepends on fine-tuning data

Future directions: Adaptive dynamic quantization, conditional computation quantization, hardware-aware design