2512.02038 / 2025-12-04 / Survey

Deep Research: A Systematic Survey of RAG-to-Agency Evolution

Zhengliang Shi, Zhaochun Ren

Core Insight

Deep Research is RAG's evolution into autonomous agency: the LLM doesn't just retrieve-then-answer—it plans query decomposition, decides WHEN to retrieve based on confidence, manages working memory across long horizons, and synthesizes verifiable reports with citations.

Why Previous Approaches Failed

Standard RAG architectures have fundamental structural limitations:

1. Fixed Retrieval Pipeline

Retrieve → Generate is a one-shot pipeline. If the first retrieval misses relevant documents, you're stuck. No mechanism for:

Iterative refinement of queries
Following up on partial information
Recognizing retrieval failure and trying different approaches

2. No Query Planning

Complex questions require strategic decomposition. "What were the causes and consequences of the 2008 financial crisis?" needs multiple sub-queries, but RAG systems treat it as a single retrieval.

3. Always-Retrieve Fallacy

Standard RAG retrieves on every query, even when the model already knows the answer. Counterintuitively, irrelevant documents actively mislead LLMs. If you retrieve documents that mention similar terms but describe different concepts, the model gets confused. More retrieval ≠ better answers.

4. Context Window as Memory

Long research tasks (hours/days of work) exceed context limits. No mechanism for:

Consolidation: Compressing findings into summaries
Indexing: Organizing by topic for fast lookup
Forgetting: Discarding outdated/irrelevant information

5. No Verification

Generated answers lack citations or evidence chains. Users can't audit whether claims are supported by sources or hallucinated.

The Method

The paper formalizes Deep Research as a four-component framework:

1. Query Planning

Decompose complex questions into sub-queries using one of three strategies:

Parallel: Independent sub-questions executed simultaneously (e.g., "Who are the main characters?" and "What is the setting?")
Sequential: Each query depends on previous answers (e.g., "Who is the CEO?" → "What did [CEO name] say about...?")
Tree-based: Hierarchical decomposition with branching based on intermediate results

The planning strategy should match question structure—wrong decomposition causes cascading errors.

2. Information Acquisition

Three sub-components:

Retrieval tools: Search engines, databases, APIs, code execution
Retrieval timing: Adaptive—retrieve based on confidence, not every step
Information filtering: Score retrieved documents for relevance, discard noise

Retrieval Timing Decision

$$\text{retrieve}(q) = \begin{cases} \text{True} & \text{if } \text{confidence}(q) < \tau \\ \text{False} & \text{otherwise} \end{cases}$$

The threshold τ is learned or tuned per domain. High-stakes domains (medical, legal) should retrieve more aggressively; general knowledge can rely on the model.

3. Memory Management

Handle long-horizon tasks that exceed context:

Consolidation: Compress retrieved info into summaries. "These 5 documents all say X" → store "X" once.
Indexing: Organize by topic/relevance for fast lookup. Don't linear-scan all memory on every query.
Updating: Revise stored info when new evidence arrives. Don't keep stale facts.
Forgetting: Actively discard outdated/irrelevant information. Memory has a cost.

4. Answer Generation

Synthesize findings into verifiable outputs:

Explicit citations pointing to sources
Evidence chains showing reasoning steps
Confidence scores on claims
Acknowledgment of gaps/uncertainties

Architecture

Click diagram to step through

Key Equations

This is a survey paper—it systematizes existing work rather than proposing new equations. The key conceptual contribution is the four-component framework and capability taxonomy.

Adaptive Retrieval Decision

$$\text{retrieve}(q) = \begin{cases} \text{True} & \text{if } \text{confidence}(q) < \tau \\ \text{False} & \text{otherwise} \end{cases}$$

Only fetch external information when the model is uncertain. Threshold τ is learned or tuned per domain. Key insight: unnecessary retrieval actively hurts performance by introducing irrelevant context.

Results

This is a survey, not an empirical paper. Key findings synthesized from reviewed literature:

Adaptive retrieval timing improves accuracy 5-15% over always-retrieve baselines
Memory consolidation enables tasks 10× longer than context window
Tree-based query planning best for complex multi-hop questions
Citation generation reduces hallucination rates by 30-50%
Forgetting mechanisms prevent memory bloat and conflicting information

Capability Progression

Agentic Search: Fact-finding with tools (current state for most systems)
Integrated Research: Multi-source synthesis with coherent reports (emerging)
AI Scientist: Hypothesis generation and validation (research frontier)

What Actually Matters

Key insights from surveyed papers:

Retrieval noise hurts more than no retrieval. Adding irrelevant documents degrades performance more than not retrieving at all. The model gets confused by plausible-sounding but incorrect information.

Forgetting is necessary. Without selective memory deletion, old information creates conflicts with new evidence. The model can't distinguish between "I learned X yesterday" and "X is still true."

Planning strategy must match question structure. Using parallel decomposition on sequential questions causes errors. Using sequential on parallel wastes time. Automatic strategy selection is an open problem.

Citation quality varies widely. Some systems cite sources without verifying claims actually appear there. Faithful citation requires explicit extraction and verification steps.

Assumptions & Limitations

Evaluation is immature. No standard benchmarks for full Deep Research systems—most papers evaluate components in isolation. Hard to compare end-to-end system performance.

Compute costs unclear. Long-horizon research tasks with many retrieval/planning steps have unknown scaling properties. Could be prohibitively expensive for complex queries.

Hallucination not solved. Even with citations, models can fabricate sources or misattribute claims. Verification mechanisms help but don't eliminate the problem.

Human-in-loop assumed. Current systems require human verification of final outputs. True autonomy (where you trust the output without checking) remains distant.

Bottom Line

Deep Research extends RAG from retrieve-then-generate to autonomous research agency: query planning, adaptive retrieval, long-horizon memory, and verifiable synthesis. The field is nascent but the four-component architecture is emerging as standard.