Training Strategy: Multi-stage Optimization and Alignment
Initialization
- Thinker Main LLM: Initialized from Qwen2.5 base model
- Vision Encoder: Initialized from Qwen2.5-VL model
- Audio Encoder: Initialized from Whisper-large-v3
Stage 1: Encoder Alignment Pre-training
Freeze LLM parameters, train only vision and audio encoders (including adapter layers connecting to LLM).
Goal: Align image and speech encoder outputs to LLM semantic space without disrupting pretrained language model knowledge.
Stage 2: Full Multimodal Joint Training
Unfreeze LLM, all model parameters participate in training.
Training data: Image-text pairs, audio-video pairs, audio+text, multimodal mixtures, etc.
Stage 3: Long Context Enhancement
Expand context window from 8192 to 32768 tokens.
SFT and RLHF
Supervised Fine-tuning (SFT)
Uses multi-turn dialogue format similar to ChatGPT, with ChatML template.
Fine-tuning data scale: Over 1 million dialogue examples.
Reinforcement Learning Tuning
- DPO (Direct Preference Optimization)
- Group RPO (Group Relative Policy Optimization)
Multimodal Capabilities
Supported Modalities
- Input: Text, Image, Audio, Video
- Output: Text + Speech
Visual Understanding
- Image captioning
- Image Q&A
- OCR recognition
Speech and Audio
- ASR speech recognition
- Speech translation
- Non-speech audio understanding
Video Understanding
Simultaneously processes video frames and corresponding audio.
Real-time Interaction
Thinker-Talker architecture and chunked processing ensure receiving input while generating output.
Long Context Technologies
- Dual Chunk Attention (DCA)
- YARN
- TMRoPE