Architecture Highlights

Overall Architecture

  • Thinker-Talker Dual-core Architecture
  • Unified Transformer decoder for text, image, video, audio fusion
  • TMRoPE (Time-aligned Multimodal RoPE) for positional embeddings

Thinker Module

  • Model “brain,” based on Transformer decoder architecture
  • Responsible for deep understanding and reasoning of multimodal inputs, generating text
  • Audio features extracted via Whisper-derived encoder
  • Image/video processed by Vision Transformer encoder

Talker Module

  • Model “mouth,” specialized in converting semantic vectors and text to speech output
  • Uses dual-track autoregressive Transformer structure
  • Outputs discrete speech units using qwen-tts-tokenizer
  • Supports multi-speaker voice decoupling

Training Data

Pre-training Corpus

  • Scale: 18 trillion tokens (vs 7 trillion for previous generation)
  • Covers 29+ languages

Multimodal Alignment Data

  • Image/video tokens: 800 billion
  • Audio tokens: 300 billion
  • Video-audio mixed tokens: 100 billion

Error Quick Reference

SymptomRoot CauseFix
Audio-video desync, lip sync issuesTMRoPE timestamp inconsistencyUnify sample rate/frame rate
High first-packet latencyStreaming chunk too largeReduce first chunk size; enable KV Cache
OOMLong sequences not chunkedEnable chunking and sliding window; reduce resolution/frame rate
Chinese homophone misreadingInsufficient text reference tokensIncrease reference window