2511.21689 / 2025-11-26 / Agents

ToolOrchestra: Small Orchestrators Beat Giant Models

Hongjin Su, Shizhe Diao, Ximing Lu, NVIDIA

Core Insight

An 8B parameter model trained with multi-objective reinforcement learning (correctness + efficiency + user preference) can orchestrate stronger models and tools to outperform GPT-5 at 30% of the cost. The key insight: the "brain" coordinating the system doesn't need to be the biggest component—it just needs to learn WHEN to call expensive resources.

Why Previous Approaches Failed

Prompting LLMs to orchestrate tools exhibits two systematic biases that destroy efficiency:

1. Self-Enhancement Bias

When GPT-5 acts as orchestrator, it delegates to GPT-5-mini 66% of the time, ignoring cheaper alternatives even when explicitly instructed to minimize cost. The model over-trusts its own "family" of models.

2. Other-Enhancement Bias

When Qwen3-8B acts as orchestrator, it defaults to GPT-5 73% of the time—always deferring to the strongest available option regardless of task difficulty. Smaller models assume they can't handle anything.

3. No Cost Signal in Training

Standard tool-use training optimizes only for correctness. The model has no gradient signal to minimize compute, latency, or monetary cost. It learns WHAT tools do, not WHEN to use expensive ones.

4. Preference Blindness

Users have heterogeneous needs:

Privacy-conscious users prefer local search over web APIs
Budget-constrained users want cheaper models when possible
Latency-sensitive users need fast responses over perfect ones

But prompted models can't adapt their orchestration strategy to these preferences—they have no mechanism to weigh user constraints against accuracy.

The Method

ToolOrchestra trains a small language model to be the "brain" of a heterogeneous tool-use system. Three key innovations:

1. Unified Tool Interface

All tools—from simple utilities to powerful LLMs—are exposed through identical JSON schema:

Basic tools: Web search (Tavily API), local search (Faiss + Qwen3-Embedding), code interpreter (Python sandbox)
Specialized LLMs: Qwen2.5-Math-72B, Qwen2.5-Coder-32B for domain tasks
Generalist LLMs: GPT-5, Claude Opus 4.1, Llama-70B for complex reasoning

Model descriptions are generated automatically:

Sample 10 training tasks
Run each model on those tasks
Ask another LLM to write capability descriptions from the trajectories

This bootstrapping creates accurate tool descriptions without manual annotation.

2. Multi-Objective Reward Design

The reward function balances four objectives:

Trajectory Reward

$$R(\tau) = \begin{cases} M^{norm}_{\tau} \cdot P & \text{if } r_{outcome}(\tau) = 1 \\ 0 & \text{otherwise} \end{cases}$$

Where the measurement vector is:

M_τ = [tool_1_count, ..., tool_n_count, outcome, -cost, -latency]

And P is the user preference vector (0-1 weights for each dimension).

Critical detail: Each dimension is normalized within the rollout batch:

Batch Normalization

$$M^{norm}_{\tau}[k] = \frac{M_{\tau}[k] - M^{min}_{\mathcal{T}}[k]}{M^{max}_{\mathcal{T}}[k] - M^{min}_{\mathcal{T}}[k]}$$

This prevents any single objective from dominating gradients—a trajectory that's slightly cheaper gets the same relative reward as one that's significantly more accurate.

3. Diverse Training Configurations

To prevent overfitting to specific tool setups:

Random tool subsets per training instance—model can't rely on any specific tool always being available
Varying price schedules across examples—forces adaptation to different cost structures
Heterogeneous preference vectors for each task—learns to read and follow user constraints

4. ToolScale Dataset

They synthesize 10 domains of verifiable multi-turn tool-use tasks:

LLM generates database schemas and tool APIs for a domain (e.g., flight booking)
LLM proposes diverse user intents for that domain
Convert intents to specific tasks with golden action sequences
Filter: remove tasks where (a) golden actions error, (b) LLMs can't solve in pass@8, (c) solvable without any tools

Evaluation criteria: execution correctness + process fidelity + operation completeness.

5. Training Stabilization

Three filters during backprop to prevent RL instability:

Homogeneity filtering: Skip when batch reward std < 0.1 (weak signal)
Format consistency: Skip malformed tool calls
Invalid output filtering: Skip when no valid answer produced

Architecture

Click diagram to step through

Key Equations

Group Relative Policy Optimization (GRPO) Advantage

$$A(\tau) = \frac{R(\tau) - \text{mean}_{\tau \in \mathcal{T}} R(\tau)}{\text{std}_{\tau \in \mathcal{T}} R(\tau)}$$

Each trajectory is compared against its batch, not an absolute baseline. This stabilizes training when reward distributions shift across different tool configurations and task domains.

GRPO Policy Update

$$\mathcal{L}_{GRPO}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \min\left( \text{ratio}_\theta(\tau) A(\tau), \text{clip}(\text{ratio}_\theta(\tau), 1-\epsilon, 1+\epsilon) A(\tau) \right) \right]$$

Standard PPO clipping prevents catastrophic policy updates. The ratio $\text{ratio}_\theta(\tau) = \pi_\theta(\tau) / \pi_{old}(\tau)$ measures how much the policy has changed.

Preference-Aware Reward

$$R(\tau) = M^{norm}_{\tau} \cdot P = \sum_{k} M^{norm}_{\tau}[k] \cdot P[k]$$

The dot product between normalized measurements and user preferences creates a single scalar reward. If user sets $P[\text{cost}] = 0$ and $P[\text{accuracy}] = 1$, the model ignores cost entirely. If $P[\text{local\_search}] = 1$, the model is rewarded for using local search (privacy-preserving).

Results

Benchmark	Orchestrator-8B	GPT-5 (tools)	Claude Opus 4.1	Qwen3-235B
HLE (accuracy)	37.1%	35.1%	34.6%	32.8%
FRAMES	76.3%	74.0%	72.8%	74.2%
τ²-Bench	80.2%	77.7%	76.8%	75.6%
Avg Cost ($)	0.09	0.30	0.53	0.30
Avg Latency (min)	8.2	19.8	25.6	21.2

The shocking result: An 8B model orchestrating tools beats GPT-5 (much larger) directly using tools. The 8B model calls GPT-5 in only ~20% of steps, yet outperforms GPT-5-as-orchestrator which calls GPT-5-mini 66% of the time.

Generalization Results

When tested with completely unseen tools (Claude Opus replacing GPT-5, Codestral replacing Qwen-Coder, etc.):

Benchmark	Orchestrator-8B	GPT-5	Qwen3-235B
HLE	22.0%	16.4%	14.7%
FRAMES	73.8%	54.8%	63.5%
τ²-Bench	48.8%	44.8%	38.7%

The model generalizes to unseen tools by reading their text descriptions—no retraining needed.

What Actually Matters

What Actually Drives the Gains?

1. Synthetic data is essential. Training on real tool-use data alone doesn't generalize. The ToolScale dataset (10 domains, LLM-verified trajectories) enables transfer to unseen benchmarks. Without synthetic data, HLE performance drops from 37.1% to ~20%.

2. Multi-objective reward is required. Optimizing only for correctness produces models that always call the strongest tool. Adding cost/latency penalties forces strategic delegation. Without efficiency reward: 37.1% accuracy at $0.30 cost → with efficiency reward: 37.1% accuracy at $0.09 cost (same accuracy, 3x cheaper).

3. Tool configuration diversity matters. Models trained with fixed tool subsets overfit to those tools. Random tool subsets + varying prices during training produces robust generalization to unseen configurations.

4. Preference training transfers. Models trained with preference vectors can follow user instructions about tool choice at test time. Preference score: 46.7 (Orchestrator) vs 34.6 (GPT-5)—users get what they ask for.

5. Training stabilization filters are critical. Without homogeneity filtering (skip when batch reward std < 0.1), training becomes unstable. Without format consistency filtering, the model learns bad tool-calling syntax.

Assumptions & Limitations

Tool descriptions are everything. The orchestrator relies entirely on text descriptions to understand tool capabilities. Poor descriptions = poor tool selection. The auto-generated descriptions work well, but novel tools with unusual capabilities might be misjudged.

Can't handle truly novel tools. Generalization works for tools similar to training distribution. A tool that operates on completely new modalities or requires new reasoning patterns would likely require retraining.

Reward hacking risk. With multi-objective rewards, the model might find degenerate strategies that satisfy metrics without solving tasks. They observe this occasionally and filter it, but the risk scales with task complexity.

Single orchestrator bottleneck. All decisions flow through one 8B model. For extremely complex tasks requiring parallel tool coordination, a hierarchy of orchestrators might be needed.

Evaluation on narrow benchmarks. HLE, FRAMES, and τ²-Bench are specific task types. Performance on open-ended creative tasks or multi-step planning over days/weeks is unknown.

Bottom Line

The orchestration paradigm—small brain coordinating specialized tools—beats monolithic giants. An 8B model outperforms GPT-5 at 30% cost by learning WHEN to call expensive models, not just HOW to use tools. The insight: delegation is a learnable skill, and small models can master it.