Pre-training

Pre-training is a critical step in robot learning, aimed at obtaining general foundational capabilities through large-scale data training.

Supervised Pre-training (Imitation Learning)

Use large amounts of expert demonstration data (e.g., human operation videos) to train models to learn basic motion patterns and task understanding.

Self-supervised Pre-training

Learn general representations using unlabeled data, common methods include contrastive learning, masked prediction, etc.

  • Typical Case: OpenAI’s VPTR method pre-trains world models on millions of random interaction videos
  • DeepMind RT-2: Trains visual language models using billions of scale web image-text data

Transfer Learning Strategies

Use existing pre-trained models (such as ViT, CLIP, etc.) as feature extractors, only need to train downstream control modules.

Pre-training Outputs

  • Initial policy model
  • Feature extractor
  • World model
  • Vision tasks: 100,000+ images/videos
  • Control tasks: 1000+ hours of operation records

Fine-tuning

Full Parameter Updates

Update all model parameters, requires larger computational resources and training data, suitable when the target task differs significantly from the pre-training task.

Parameter-Efficient Fine-tuning (PEFT)

Only update partial parameters or add a small number of trainable parameters, common methods include:

  • LoRA
  • Adapter
  • Prefix-tuning

Reasonable fine-tuning strategies can reduce training costs by 50-90% while maintaining model performance.

Reinforcement Learning Fine-tuning / Online Training

Follow the “IL→RL” paradigm:

Imitation Learning Stage

Train basic policy through expert demonstration data, ensuring the robot masters the basic skill framework.

Reinforcement Learning Stage

  • Set fine-grained reward functions
  • Robot autonomously explores optimization space
  • Improve policy through trial and error learning

Typical Methods: Policy gradient methods, Q-learning algorithms, residual learning, etc.

Reward Modeling and Human Feedback (RLHF)

When designing explicit numerical rewards is difficult, use human feedback to shape model behavior.

RLHF Pipeline

  1. Data Collection Phase: Collect large amounts of human preference comparison data
  2. Reward Model Training: Train reward model R(s) using comparison data
  3. Policy Optimization Phase: Use trained reward model for reinforcement learning (such as PPO)

Training Optimization Techniques

  • Use causal masking in imitation learning to prevent future information leakage
  • Use experience replay and reward normalization in RL to stabilize training
  • Use FlashAttention in Transformer models to accelerate training
  • BC commonly uses MSE or cross-entropy loss, RL uses policy gradient loss
  • Can add imitation loss to fuse IL and RL
  • Continuous control can add smoothing terms to loss

Iterative Training

Robot learning is a continuously optimized closed-loop process. Typical iteration cycle:

  1. Initial model training
  2. Deployment testing
  3. Data augmentation
  4. Model improvement
  5. Difficulty progression (Curriculum Learning)

Training Path in Practical Applications

  • Simulation pre-training stage
  • Real machine fine-tuning stage
  • Continuous optimization