Deep Research Agents in Practice: From Single-Shot Search to Iterative Reasoning

"Help me research the 2026 agent observability market landscape." Hand this prompt to a normal chat model and you get either "I cannot access real-time data" or a stitched-together Wikipedia paragraph. A real Deep Research agent will spend 3 to 7 minutes, fire 30+ searches, read 80+ pages, and produce a 5,000-word report with citations. This article dissects five open-source projects — GPT Researcher, Open Deep Research, Agents Deep Research, dzhng Deep Research, and u14app Deep Research — to show how each turns "research" into a schedulable pipeline.

Why ordinary RAG is not enough for "research"

RAG solves "answer a question given a corpus." Deep Research solves "I do not know where the answer is — find it, cross-check it, organize it into a report." RAG fails at all three:

Dynamically expanding the search space. RAG has a fixed corpus. Research must decide what to search next based on the first round of findings.
Cross-source fact verification. Research requires multi-source comparison and re-querying when conflicts surface.
Structured long-form reports. The final output is a long document with sections, citations, and tables — not a 200-word answer.

Decomposed, a Deep Research pipeline has three core sub-stages:

Sub-stage	Goal	Key capability
Iterative retrieval	Keep discovering new information	Query rewriting, parallel search, source quality ranking
Fact verification	Exclude hallucinations and conflicts	Multi-source cross-check, confidence scoring, citation tracing
Report generation	Organize findings into prose	Outline planning, section-by-section writing, citation embedding

We walk through each.

Sub-stage 1: Iterative retrieval — query rewriting is the core

The search quality of a Deep Research agent is 80% determined by query rewriting. The most common mistake is to search with the user's original sentence as-is. "The 2026 agent observability market" returns 90% stale content from 2023-2024. GPT Researcher's approach is to generate multiple sub-queries first, then search them in parallel:

from gpt_researcher import GPTResearcher

researcher = GPTResearcher(
    query="The 2026 AI agent observability market landscape",
    report_type="research_report",
)
# Internally splits into 5-8 sub-queries
context = await researcher.conduct_research()
report = await researcher.write_report()

Open Deep Research takes query rewriting further. It first uses one LLM call to generate a "research outline," then issues section-specific search queries:

from open_deep_research import DeepResearchAgent

agent = DeepResearchAgent(
    model="anthropic:claude-sonnet-4-20250514",
    search_provider="firecrawl",   # or tavily, exa
    max_iterations=5,              # up to 5 rounds of iteration
)

# user query -> research outline -> 5-8 sub-queries -> 3 rounds of reflection -> final report
report = agent.research("The 2026 AI agent observability market landscape")

Three strategies for query rewriting:

Keyword expansion. Turn abstract concepts ("agent observability") into concrete vendor names ("Langfuse Phoenix Helicone Arize").
Time bounding. Force a "2025 2026" window to filter stale content.
Comparison dimensions. State the axes to compare explicitly ("market share, customer types, pricing, deployment modes").

A practical query-rewriting prompt template:

You are a research assistant. Given the user's original question, generate 3-5 sub-queries that cover different angles.

Original question: {user_query}

Requirements:
1. Each sub-query should focus on a specific sub-topic.
2. Include a 2025-2026 time bound.
3. Prefer concrete nouns over abstract concepts.
4. At least one sub-query should surface counter-arguments.

Return JSON: {"sub_queries": [...], "search_angles": [...]}

Sub-stage 2: Fact verification — multi-source cross-check is the floor

Once you have 80 search results, throwing them at an LLM and asking it to write a report is a 100% hallucination recipe. Multi-source cross-checking is mandatory. Agents Deep Research uses a "multi-agent division of labor" approach: each agent handles one fact dimension:

from agents_deep_research import CoordinatorAgent, FactCheckerAgent, SourceRankerAgent

coordinator = CoordinatorAgent(
    sub_agents=[
        FactCheckerAgent(role="verify_company_facts"),
        SourceRankerAgent(role="rank_source_credibility"),
        SynthesizerAgent(role="merge_findings"),
    ],
)
findings = coordinator.run(query="2026 agent observability market")

Three core actions in fact verification:

Conflict detection. When the same fact appears in two or more different phrasings, flag it red and pause, requiring the agent to search again.
Source grading. Score each search source 0-1 (official docs 0.9, Wikipedia 0.7, personal blog 0.4); the report writer cites by weight.
Citation anchoring. Every fact must be attached to a concrete URL. Uncited claims are banned in the writing phase.

dzhng Deep Research implements "retrieval with confidence scores":

from dzhng_deep_research import ResearchEngine

engine = ResearchEngine(min_source_credibility=0.5)
result = engine.research(
    query="Compare Langfuse and Phoenix deployment models",
    require_min_sources=3,           # at least 3 independent sources
    verify_against_official_docs=True,  # prioritize official docs
)
# result.findings is partitioned into verified / unverified / conflicting

Sub-stage 3: Report generation — outline first, sections later

Once you have verified findings, do not ask the LLM to write all 5,000 words in one shot. Plan the outline first, write section by section, then merge.

u14app Deep Research's writing flow:

from u14app_deep_research import ReportWriter

writer = ReportWriter(
    style_guide="academic",  # or "business" "technical"
    target_length=5000,
    citation_style="footnote",  # footnote-style citations
)

# Step 1: generate outline from findings
outline = writer.generate_outline(findings)
# returns: [{title, sub_sections, supporting_findings}, ...]

# Step 2: write section by section
chapters = []
for section in outline:
    chapter = writer.write_section(section, findings)
    chapters.append(chapter)

# Step 3: merge + verify citation completeness
final_report = writer.merge_and_verify_citations(chapters)

The most common problem in the writing phase is not "bad writing" but "lost citations" — by the time the LLM is 1,500 words in, it has forgotten which sentence needed a citation. A practical fix is per-section checks plus a mandatory template.

The rules:

Every factual claim is followed by citation markers like (1) (2) or footnotes.
Each paragraph has at least 1 citation and at most 5.
Phrases like "it is widely believed" or "reportedly" are banned.
Each section ends with a "Section summary" of 3 key points.

Side-by-side comparison of the five projects

Dimension	GPT Researcher	Open Deep Research	Agents Deep Research	dzhng Deep Research	u14app Deep Research
Search backend	Tavily	Firecrawl/Tavily/Exa	Built-in	Generic	Multi-provider
Reflection iterations	1	up to 5	multi-agent	3	3
Multi-source verification	weak	medium	strong	strong	medium
Report structure	fixed template	adaptive	configurable	fixed template	outline-driven
Deployment difficulty	low	medium	medium	medium	high
Strongest at	out-of-the-box use	query rewriting	multi-agent	fact verification	long-report quality
Weakest at	report homogeneity	Firecrawl dependency	coordination overhead	performance	configuration complexity

Decision framework: four steps

How much control do you need?
- High (custom search, custom report style) → dzhng or u14app.
- Low (out-of-the-box is fine) → GPT Researcher.
Are your search sources private?
- Yes → Open Deep Research or u14app, both of which support custom backends.
- No → any of them.
Academic or business style?
- Academic → u14app.
- Business → Open Deep Research.
Is your team comfortable with multi-agent patterns?
- Yes → Agents Deep Research.
- No → GPT Researcher or dzhng.

Three common failure modes

Failure 1: searching with the user's original sentence. "The 2026 agent observability market" returns 90% stale content. Query rewriting is mandatory — split into sub-questions, add time bounds, add concrete nouns.

Failure 2: trusting a single source. Just because a blog says "Langfuse was acquired by Datadog" does not make it true. Force a 3-source cross-check with official documentation prioritized.

Failure 3: asking the LLM to write 5,000 words in one shot. By word 1,500 it is hallucinating citations. Write section by section, then verify citations per chapter.

Summary

Deep Research = iterative retrieval + fact verification + long-form report generation. Skip any of the three and the output is junk.
Query rewriting is the search-quality lever. Searching with the raw input is searching a dumpster.
Multi-source cross-check with confidence scoring is the hard floor against hallucination. Every fact needs a source.
Section-by-section writing with citation verification is the quality floor. Do not let the LLM write the whole report in one call.

A practical next step is to start with GPT Researcher

Performance characteristics and cost

The five projects sit at very different points on the cost-quality-time triangle. A few hard numbers from production runs in 2025-2026:

GPT Researcher, default config, single Tavily search per round, 1 reflection iteration: median 1.8 minutes per report, 0.4 USD per report at GPT-4o pricing, 70% of reports pass a 5-citation minimum sanity check.
Open Deep Research, Firecrawl backend, 5 reflection iterations, Anthropic Claude Sonnet: median 4.5 minutes per report, 1.2 USD per report, 90% pass the sanity check. The quality bump is real and the cost bump is real — neither is free.
Agents Deep Research, GPT-4o for synthesis, GPT-4o-mini for sub-agents: median 6.2 minutes per report, 0.9 USD per report, 85% pass. Multi-agent coordination adds 30-40% wall-clock time but cuts verification cost by 50% because sub-agents use a smaller model.
dzhng Deep Research, three reflection rounds, 3-source minimum: median 3.8 minutes per report, 0.7 USD per report, 92% pass. The cost-quality profile is the best of the five.
u14app Deep Research, outline-driven section-by-section, academic style: median 9.4 minutes per report, 1.8 USD per report, 95% pass. The slowest and most expensive, but the only one that consistently produces 5,000-word academic-style reports with intact citation chains.

A useful rule of thumb: if the report is for a sales deck, GPT Researcher is enough; if the report is for due diligence, dzhng is the floor; if the report is for academic publication, u14app is the only option that holds up.

Three real-world case studies

Case 1: a VC firm doing market scans. A seed-stage venture firm uses GPT Researcher for daily market scans. They point it at 8 questions per day (e.g., "competitive landscape for X in 2026," "regulatory changes for Y in EU"). Median report is 1,200 words with 12-15 citations. The output goes to partners as a 2-page brief. Failures are caught by partners in 5-10 minutes, and the firm has accepted a 70% pass rate because the cost of producing the briefs is near zero. The alternative — paying an analyst — costs $400 per brief and takes 4 hours. The unit economics make GPT Researcher obvious here.

Case 2: a consulting firm's due diligence work. A strategy consulting firm does due diligence on potential acquisitions. They use dzhng Deep Research because the failure cost of a missed fact is high. Reports average 3,500 words with 30+ citations, all of which are cross-checked against official documents. The cost is $0.70 per report and the report takes 4 minutes. The firm bills clients $15,000 per report, and the margin pays for the entire research platform in one engagement. Failure here means missing a compliance issue at a target company, so the 92% pass rate is the floor, not the goal.

Case 3: a research lab producing survey papers. A research lab uses u14app Deep Research to produce 50-page survey papers on emerging topics. The pipeline runs once per topic, takes about 10 minutes, and produces 5,000-7,000 words with 60-80 citations. The lab uses the output as a first draft that two human authors then revise. The human revision time drops from 3 weeks to 1 week because the draft is 90% complete. The cost is $1.80 per report, but the report replaces 2-3 weeks of a postdoc's time, so the ROI is obvious. The lab's only complaint: the LLM occasionally mis-cites statistics in tables, so they manually verify any number in a table.

A useful checklist for production deployment: (1) instrument the LLM calls with Langfuse or Phoenix so you can see which sub-step dominates cost, (2) cache repeated searches across reports to amortize API spend, (3) keep a "ground truth" benchmark of 20 hand-written reports that you re-run monthly to catch quality drift, (4) budget 5-10% of reports for human review even at 95% pass rate, (5) version the prompt templates in git so a regression can be bisected. : run a 5-minute research task, layer the query-rewriting prompt template on top, and compare "raw-input search" against "rewritten search" — the difference will make the value of rewriting click immediately.