Training Strategy: Multi-stage Optimization and Alignment

Initialization

  • Thinker Main LLM: Initialized from Qwen2.5 base model
  • Vision Encoder: Initialized from Qwen2.5-VL model
  • Audio Encoder: Initialized from Whisper-large-v3

Stage 1: Encoder Alignment Pre-training

Freeze LLM parameters, train only vision and audio encoders (including adapter layers connecting to LLM).

Goal: Align image and speech encoder outputs to LLM semantic space without disrupting pretrained language model knowledge.

Stage 2: Full Multimodal Joint Training

Unfreeze LLM, all model parameters participate in training.

Training data: Image-text pairs, audio-video pairs, audio+text, multimodal mixtures, etc.

Stage 3: Long Context Enhancement

Expand context window from 8192 to 32768 tokens.

SFT and RLHF

Supervised Fine-tuning (SFT)

Uses multi-turn dialogue format similar to ChatGPT, with ChatML template.

Fine-tuning data scale: Over 1 million dialogue examples.

Reinforcement Learning Tuning

  • DPO (Direct Preference Optimization)
  • Group RPO (Group Relative Policy Optimization)

Multimodal Capabilities

Supported Modalities

  • Input: Text, Image, Audio, Video
  • Output: Text + Speech

Visual Understanding

  • Image captioning
  • Image Q&A
  • OCR recognition

Speech and Audio

  • ASR speech recognition
  • Speech translation
  • Non-speech audio understanding

Video Understanding

Simultaneously processes video frames and corresponding audio.

Real-time Interaction

Thinker-Talker architecture and chunked processing ensure receiving input while generating output.

Long Context Technologies

  • Dual Chunk Attention (DCA)
  • YARN
  • TMRoPE