VRAM and Deployment

  • FP16 precision: ~14GB VRAM
  • INT8/INT4 quantization: <4GB
  • Can run on regular PCs or even high-end phones

Context Window

  • Default: 8192 tokens
  • Long-sequence enhancement: 32k tokens
  • Experimental Turbo: ~1 million tokens

Inference Optimization

  • FlashAttention 2 acceleration (requires Ampere+ GPU)
  • Supports bfloat16 or int8/int4 quantization

Concurrency Capability

7B model can run multiple instances on single machine, high QPS with low unit cost, suitable for enterprise deployment.

Open Source License

Apache-2.0 license, free for commercial use.

Performance

BenchmarkScore
OmniBench Overall56.13%
MMLU71.0%
GSM8K88.7%
HumanEval78.7%

Cost

Discusses both Apache-2.0 open source self-hosted and cloud API options.