AI Investigation #108: Complete Robot Model Training Proc...

Pre-training

Pre-training is a critical step in robot learning, aimed at obtaining general foundational capabilities through large-scale data training.

Supervised Pre-training (Imitation Learning)

Use large amounts of expert demonstration data (e.g., human operation videos) to train models to learn basic motion patterns and task understanding.

Self-supervised Pre-training

Learn general representations using unlabeled data, common methods include contrastive learning, masked prediction, etc.

Typical Case: OpenAI’s VPTR method pre-trains world models on millions of random interaction videos
DeepMind RT-2: Trains visual language models using billions of scale web image-text data

Transfer Learning Strategies

Use existing pre-trained models (such as ViT, CLIP, etc.) as feature extractors, only need to train downstream control modules.

Pre-training Outputs

Initial policy model
Feature extractor
World model

Recommended Data Volume

Vision tasks: 100,000+ images/videos
Control tasks: 1000+ hours of operation records

Fine-tuning

Full Parameter Updates

Update all model parameters, requires larger computational resources and training data, suitable when the target task differs significantly from the pre-training task.

Parameter-Efficient Fine-tuning (PEFT)

Only update partial parameters or add a small number of trainable parameters, common methods include:

LoRA
Adapter
Prefix-tuning

Reasonable fine-tuning strategies can reduce training costs by 50-90% while maintaining model performance.

Reinforcement Learning Fine-tuning / Online Training

Follow the “IL→RL” paradigm:

Imitation Learning Stage

Train basic policy through expert demonstration data, ensuring the robot masters the basic skill framework.

Reinforcement Learning Stage

Set fine-grained reward functions
Robot autonomously explores optimization space
Improve policy through trial and error learning

Typical Methods: Policy gradient methods, Q-learning algorithms, residual learning, etc.

Reward Modeling and Human Feedback (RLHF)

When designing explicit numerical rewards is difficult, use human feedback to shape model behavior.

RLHF Pipeline

Data Collection Phase: Collect large amounts of human preference comparison data
Reward Model Training: Train reward model R(s) using comparison data
Policy Optimization Phase: Use trained reward model for reinforcement learning (such as PPO)

Training Optimization Techniques

Use causal masking in imitation learning to prevent future information leakage
Use experience replay and reward normalization in RL to stabilize training
Use FlashAttention in Transformer models to accelerate training
BC commonly uses MSE or cross-entropy loss, RL uses policy gradient loss
Can add imitation loss to fuse IL and RL
Continuous control can add smoothing terms to loss

Iterative Training

Robot learning is a continuously optimized closed-loop process. Typical iteration cycle:

Initial model training
Deployment testing
Data augmentation
Model improvement
Difficulty progression (Curriculum Learning)

Training Path in Practical Applications

Simulation pre-training stage
Real machine fine-tuning stage
Continuous optimization