VRAM and Deployment
- FP16 precision: ~14GB VRAM
- INT8/INT4 quantization: <4GB
- Can run on regular PCs or even high-end phones
Context Window
- Default: 8192 tokens
- Long-sequence enhancement: 32k tokens
- Experimental Turbo: ~1 million tokens
Inference Optimization
- FlashAttention 2 acceleration (requires Ampere+ GPU)
- Supports bfloat16 or int8/int4 quantization
Concurrency Capability
7B model can run multiple instances on single machine, high QPS with low unit cost, suitable for enterprise deployment.
Open Source License
Apache-2.0 license, free for commercial use.
Performance
| Benchmark | Score |
|---|---|
| OmniBench Overall | 56.13% |
| MMLU | 71.0% |
| GSM8K | 88.7% |
| HumanEval | 78.7% |
Cost
Discusses both Apache-2.0 open source self-hosted and cloud API options.