Architecture Highlights
Overall Architecture
- Thinker-Talker Dual-core Architecture
- Unified Transformer decoder for text, image, video, audio fusion
- TMRoPE (Time-aligned Multimodal RoPE) for positional embeddings
Thinker Module
- Model “brain,” based on Transformer decoder architecture
- Responsible for deep understanding and reasoning of multimodal inputs, generating text
- Audio features extracted via Whisper-derived encoder
- Image/video processed by Vision Transformer encoder
Talker Module
- Model “mouth,” specialized in converting semantic vectors and text to speech output
- Uses dual-track autoregressive Transformer structure
- Outputs discrete speech units using qwen-tts-tokenizer
- Supports multi-speaker voice decoupling
Training Data
Pre-training Corpus
- Scale: 18 trillion tokens (vs 7 trillion for previous generation)
- Covers 29+ languages
Multimodal Alignment Data
- Image/video tokens: 800 billion
- Audio tokens: 300 billion
- Video-audio mixed tokens: 100 billion
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Audio-video desync, lip sync issues | TMRoPE timestamp inconsistency | Unify sample rate/frame rate |
| High first-packet latency | Streaming chunk too large | Reduce first chunk size; enable KV Cache |
| OOM | Long sequences not chunked | Enable chunking and sliding window; reduce resolution/frame rate |
| Chinese homophone misreading | Insufficient text reference tokens | Increase reference window |