Pre-training
Pre-training is a critical step in robot learning, aimed at obtaining general foundational capabilities through large-scale data training.
Supervised Pre-training (Imitation Learning)
Use large amounts of expert demonstration data (e.g., human operation videos) to train models to learn basic motion patterns and task understanding.
Self-supervised Pre-training
Learn general representations using unlabeled data, common methods include contrastive learning, masked prediction, etc.
- Typical Case: OpenAI’s VPTR method pre-trains world models on millions of random interaction videos
- DeepMind RT-2: Trains visual language models using billions of scale web image-text data
Transfer Learning Strategies
Use existing pre-trained models (such as ViT, CLIP, etc.) as feature extractors, only need to train downstream control modules.
Pre-training Outputs
- Initial policy model
- Feature extractor
- World model
Recommended Data Volume
- Vision tasks: 100,000+ images/videos
- Control tasks: 1000+ hours of operation records
Fine-tuning
Full Parameter Updates
Update all model parameters, requires larger computational resources and training data, suitable when the target task differs significantly from the pre-training task.
Parameter-Efficient Fine-tuning (PEFT)
Only update partial parameters or add a small number of trainable parameters, common methods include:
- LoRA
- Adapter
- Prefix-tuning
Reasonable fine-tuning strategies can reduce training costs by 50-90% while maintaining model performance.
Reinforcement Learning Fine-tuning / Online Training
Follow the “IL→RL” paradigm:
Imitation Learning Stage
Train basic policy through expert demonstration data, ensuring the robot masters the basic skill framework.
Reinforcement Learning Stage
- Set fine-grained reward functions
- Robot autonomously explores optimization space
- Improve policy through trial and error learning
Typical Methods: Policy gradient methods, Q-learning algorithms, residual learning, etc.
Reward Modeling and Human Feedback (RLHF)
When designing explicit numerical rewards is difficult, use human feedback to shape model behavior.
RLHF Pipeline
- Data Collection Phase: Collect large amounts of human preference comparison data
- Reward Model Training: Train reward model R(s) using comparison data
- Policy Optimization Phase: Use trained reward model for reinforcement learning (such as PPO)
Training Optimization Techniques
- Use causal masking in imitation learning to prevent future information leakage
- Use experience replay and reward normalization in RL to stabilize training
- Use FlashAttention in Transformer models to accelerate training
- BC commonly uses MSE or cross-entropy loss, RL uses policy gradient loss
- Can add imitation loss to fuse IL and RL
- Continuous control can add smoothing terms to loss
Iterative Training
Robot learning is a continuously optimized closed-loop process. Typical iteration cycle:
- Initial model training
- Deployment testing
- Data augmentation
- Model improvement
- Difficulty progression (Curriculum Learning)
Training Path in Practical Applications
- Simulation pre-training stage
- Real machine fine-tuning stage
- Continuous optimization