DeepSeek-V3: Scaling Inference-Time Compute
DeepSeek-AI
Core Insight
A lightweight "lightning indexer" can learn which tokens matter for attention, reducing O(L²) to O(Lk) complexity while preserving quality. Combined with allocating >10% of pre-training compute to post-training RL, this unlocks frontier-level reasoning in open models.
Why Previous Approaches Failed
Three structural problems held open-source models back from frontier performance:
1. Quadratic Attention Bottleneck
Standard attention computes all L² pairwise interactions between tokens. At 128K context length:
- Dense attention: 128K × 128K = 16 billion operations per layer
- This becomes prohibitively expensive for both inference and RL training
- RL requires many rollouts—each rollout at 128K context explodes compute
2. Underinvestment in Post-Training
Open models typically allocate <1% of pre-training compute to RLHF/post-training. This caps the reasoning ceiling—the model learns facts during pre-training but not how to think through hard problems. DeepSeek found that the reasoning gap to frontier models closed dramatically when they 10x'd post-training investment.
3. Reasoning-Tool Disconnect
Models either reason OR use tools, but combining them was broken:
- Each tool call would discard the reasoning context
- Model had to re-reason from scratch after each tool response
- This created redundant computation and lost reasoning chains
The Method
DeepSeek-V3 addresses each failure mode with specific mechanisms:
1. DeepSeek Sparse Attention (DSA)
Instead of attending to all previous tokens, a small "lightning indexer" network learns to score token relevance in real-time:
The indexer is a tiny attention mechanism (few heads, FP8 precision) that runs before the main attention. It produces scores for all previous tokens, then selects the top-k (k=2048) for full attention computation.
Complexity reduction:
- Dense: O(L²) → 128K × 128K = 16B ops
- Sparse (DSA): O(L × k) → 128K × 2K = 256M ops
- ~60x reduction in attention compute
2. Scaled RL Post-Training
They allocate more than 10% of pre-training compute to reinforcement learning—dramatically more than typical open models. Key stabilization techniques:
- Unbiased KL estimates: Standard KL estimator is biased when policies diverge. They use importance sampling correction.
- Off-policy masking: For negative-advantage sequences with high policy divergence, mask them from gradient updates to prevent instability.
- Keep Routing: In MoE models, preserve expert routing paths from inference during training. Mismatch causes instability.
- Keep Sampling Mask: Maintain identical action spaces between old and new policies.
3. Thinking-in-Tool-Use
Context management that retains reasoning traces across tool calls:
- Only discard reasoning when a new user message arrives
- Tool outputs don't trigger context reset
- Full tool call history is always preserved
This lets the model build on previous reasoning after each tool call instead of starting fresh.
Architecture
Click diagram to step through
Key Equations
The mask $M_{i,t}$ zeros out negative-advantage sequences with high policy divergence. Without this, off-policy samples with large likelihood ratios create unstable gradients that can crash training.
Standard K3 KL estimator is biased when $\pi_\theta \ll \pi_{ref}$ (policy has moved far from reference). This importance-weighted correction eliminates the bias, enabling stable training even with aggressive policy updates.
Results
| Benchmark | DeepSeek-V3 | GPT-5-High | Gemini-3.0-Pro |
|---|---|---|---|
| AIME 2025 | 93.1% | 94.6% | 95.0% |
| HMMT Feb 2025 | 92.5% | 88.3% | 97.5% |
| HLE (text-only) | 25.1% | 26.3% | 37.7% |
| SWE-Verified | 73.1% | 74.9% | 76.2% |
| BrowseComp | 67.6% | 54.9% | 60.2% |
Extended thinking variant (V3-Speciale):
- AIME: 96.0%
- HMMT: 99.2%
- Gold medals in IMO 2025 and IOI 2025
Trade-off: Speciale requires ~2x more tokens than Gemini-3.0-Pro for equivalent performance. The intelligence density per token is lower, but total capability is frontier-level.
What Actually Matters
What Actually Drives the Gains?
1. RL compute scaling is the key differentiator. Performance correlates strongly with RL budget. Allocating >10% of pre-training cost to post-training is what separates frontier from near-frontier. Most open models do <1%.
2. Synthetic agentic tasks enable transfer. RL on 1,827 synthesized environments + 85K prompts transfers to unseen benchmarks. Critical finding: RL on only code/search environments does NOT transfer well. Diversity matters.
3. Off-policy masking is crucial. Without masking negative-advantage sequences with high policy divergence, certain training scenarios exhibit divergence. The mask prevents unstable gradient updates from corrupting the policy.
4. Keep Routing for MoE. Preserving expert routing paths between inference and training eliminates a major source of instability. If you let routing change during training, experts learn conflicting behaviors.
5. Lightning indexer quality. The indexer must be trained carefully—if it drops important tokens, attention quality degrades. They find FP8 with few heads is the sweet spot: fast enough to preserve speedup, accurate enough to select well.
Assumptions & Limitations
Token efficiency gap. V3-Speciale needs significantly more tokens than Gemini-3.0-Pro for equivalent performance on hard problems. The intelligence density per token is lower—it thinks longer to reach the same conclusions.
Knowledge breadth. Fewer total training FLOPs means narrower world knowledge compared to frontier proprietary models. On obscure trivia and specialized domains, the gap is noticeable.
Context management fragility. Agent frameworks that simulate tool calls via user messages (like some code assistants) break the reasoning persistence mechanism. They discard context at the wrong boundaries.
Self-verification loops. The model frequently over-verifies, generating long trajectories that exceed 128K context. This is particularly problematic on MCP benchmarks where it keeps checking its work.
Bottom Line
Open-source models can match GPT-5 on reasoning if you (1) invest heavily in post-training RL (>10% of pre-train compute) and (2) solve the attention efficiency bottleneck. The gap to Gemini-3.0-Pro is token efficiency, not capability ceiling.