ToolOrchestra: Small Orchestrators Beat Giant Models
Hongjin Su, Shizhe Diao, Ximing Lu, NVIDIA
Core Insight
An 8B parameter model trained with multi-objective reinforcement learning (correctness + efficiency + user preference) can orchestrate stronger models and tools to outperform GPT-5 at 30% of the cost. The key insight: the "brain" coordinating the system doesn't need to be the biggest component—it just needs to learn WHEN to call expensive resources.
Why Previous Approaches Failed
Prompting LLMs to orchestrate tools exhibits two systematic biases that destroy efficiency:
1. Self-Enhancement Bias
When GPT-5 acts as orchestrator, it delegates to GPT-5-mini 66% of the time, ignoring cheaper alternatives even when explicitly instructed to minimize cost. The model over-trusts its own "family" of models.
2. Other-Enhancement Bias
When Qwen3-8B acts as orchestrator, it defaults to GPT-5 73% of the time—always deferring to the strongest available option regardless of task difficulty. Smaller models assume they can't handle anything.
3. No Cost Signal in Training
Standard tool-use training optimizes only for correctness. The model has no gradient signal to minimize compute, latency, or monetary cost. It learns WHAT tools do, not WHEN to use expensive ones.
4. Preference Blindness
Users have heterogeneous needs:
- Privacy-conscious users prefer local search over web APIs
- Budget-constrained users want cheaper models when possible
- Latency-sensitive users need fast responses over perfect ones
But prompted models can't adapt their orchestration strategy to these preferences—they have no mechanism to weigh user constraints against accuracy.
The Method
ToolOrchestra trains a small language model to be the "brain" of a heterogeneous tool-use system. Three key innovations:
1. Unified Tool Interface
All tools—from simple utilities to powerful LLMs—are exposed through identical JSON schema:
- Basic tools: Web search (Tavily API), local search (Faiss + Qwen3-Embedding), code interpreter (Python sandbox)
- Specialized LLMs: Qwen2.5-Math-72B, Qwen2.5-Coder-32B for domain tasks
- Generalist LLMs: GPT-5, Claude Opus 4.1, Llama-70B for complex reasoning
Model descriptions are generated automatically:
- Sample 10 training tasks
- Run each model on those tasks
- Ask another LLM to write capability descriptions from the trajectories
This bootstrapping creates accurate tool descriptions without manual annotation.
2. Multi-Objective Reward Design
The reward function balances four objectives:
Where the measurement vector is:
M_τ = [tool_1_count, ..., tool_n_count, outcome, -cost, -latency]
And P is the user preference vector (0-1 weights for each dimension).
Critical detail: Each dimension is normalized within the rollout batch:
This prevents any single objective from dominating gradients—a trajectory that's slightly cheaper gets the same relative reward as one that's significantly more accurate.
3. Diverse Training Configurations
To prevent overfitting to specific tool setups:
- Random tool subsets per training instance—model can't rely on any specific tool always being available
- Varying price schedules across examples—forces adaptation to different cost structures
- Heterogeneous preference vectors for each task—learns to read and follow user constraints
4. ToolScale Dataset
They synthesize 10 domains of verifiable multi-turn tool-use tasks:
- LLM generates database schemas and tool APIs for a domain (e.g., flight booking)
- LLM proposes diverse user intents for that domain
- Convert intents to specific tasks with golden action sequences
- Filter: remove tasks where (a) golden actions error, (b) LLMs can't solve in pass@8, (c) solvable without any tools
Evaluation criteria: execution correctness + process fidelity + operation completeness.
5. Training Stabilization
Three filters during backprop to prevent RL instability:
- Homogeneity filtering: Skip when batch reward std < 0.1 (weak signal)
- Format consistency: Skip malformed tool calls
- Invalid output filtering: Skip when no valid answer produced
Architecture
Click diagram to step through
Key Equations
Each trajectory is compared against its batch, not an absolute baseline. This stabilizes training when reward distributions shift across different tool configurations and task domains.
Standard PPO clipping prevents catastrophic policy updates. The ratio $\text{ratio}_\theta(\tau) = \pi_\theta(\tau) / \pi_{old}(\tau)$ measures how much the policy has changed.
The dot product between normalized measurements and user preferences creates a single scalar reward. If user sets $P[\text{cost}] = 0$ and $P[\text{accuracy}] = 1$, the model ignores cost entirely. If $P[\text{local\_search}] = 1$, the model is rewarded for using local search (privacy-preserving).
Results
| Benchmark | Orchestrator-8B | GPT-5 (tools) | Claude Opus 4.1 | Qwen3-235B |
|---|---|---|---|---|
| HLE (accuracy) | 37.1% | 35.1% | 34.6% | 32.8% |
| FRAMES | 76.3% | 74.0% | 72.8% | 74.2% |
| τ²-Bench | 80.2% | 77.7% | 76.8% | 75.6% |
| Avg Cost ($) | 0.09 | 0.30 | 0.53 | 0.30 |
| Avg Latency (min) | 8.2 | 19.8 | 25.6 | 21.2 |
The shocking result: An 8B model orchestrating tools beats GPT-5 (much larger) directly using tools. The 8B model calls GPT-5 in only ~20% of steps, yet outperforms GPT-5-as-orchestrator which calls GPT-5-mini 66% of the time.
Generalization Results
When tested with completely unseen tools (Claude Opus replacing GPT-5, Codestral replacing Qwen-Coder, etc.):
| Benchmark | Orchestrator-8B | GPT-5 | Qwen3-235B |
|---|---|---|---|
| HLE | 22.0% | 16.4% | 14.7% |
| FRAMES | 73.8% | 54.8% | 63.5% |
| τ²-Bench | 48.8% | 44.8% | 38.7% |
The model generalizes to unseen tools by reading their text descriptions—no retraining needed.
What Actually Matters
What Actually Drives the Gains?
1. Synthetic data is essential. Training on real tool-use data alone doesn't generalize. The ToolScale dataset (10 domains, LLM-verified trajectories) enables transfer to unseen benchmarks. Without synthetic data, HLE performance drops from 37.1% to ~20%.
2. Multi-objective reward is required. Optimizing only for correctness produces models that always call the strongest tool. Adding cost/latency penalties forces strategic delegation. Without efficiency reward: 37.1% accuracy at $0.30 cost → with efficiency reward: 37.1% accuracy at $0.09 cost (same accuracy, 3x cheaper).
3. Tool configuration diversity matters. Models trained with fixed tool subsets overfit to those tools. Random tool subsets + varying prices during training produces robust generalization to unseen configurations.
4. Preference training transfers. Models trained with preference vectors can follow user instructions about tool choice at test time. Preference score: 46.7 (Orchestrator) vs 34.6 (GPT-5)—users get what they ask for.
5. Training stabilization filters are critical. Without homogeneity filtering (skip when batch reward std < 0.1), training becomes unstable. Without format consistency filtering, the model learns bad tool-calling syntax.
Assumptions & Limitations
Tool descriptions are everything. The orchestrator relies entirely on text descriptions to understand tool capabilities. Poor descriptions = poor tool selection. The auto-generated descriptions work well, but novel tools with unusual capabilities might be misjudged.
Can't handle truly novel tools. Generalization works for tools similar to training distribution. A tool that operates on completely new modalities or requires new reasoning patterns would likely require retraining.
Reward hacking risk. With multi-objective rewards, the model might find degenerate strategies that satisfy metrics without solving tasks. They observe this occasionally and filter it, but the risk scales with task complexity.
Single orchestrator bottleneck. All decisions flow through one 8B model. For extremely complex tasks requiring parallel tool coordination, a hierarchy of orchestrators might be needed.
Evaluation on narrow benchmarks. HLE, FRAMES, and τ²-Bench are specific task types. Performance on open-ended creative tasks or multi-step planning over days/weeks is unknown.
Bottom Line
The orchestration paradigm—small brain coordinating specialized tools—beats monolithic giants. An 8B model outperforms GPT-5 at 30% cost by learning WHEN to call expensive models, not just HOW to use tools. The insight: delegation is a learnable skill, and small models can master it.