A self-improving agent pipeline for general-purpose reasoning tasks
AI Office GAIA is a multi-agent reasoning pipeline designed around a specific hypothesis: that a smaller number of well-structured agents with a functioning self-improvement loop will outperform a larger chain of agents that cannot learn from its own failures. The pipeline runs Claude Sonnet as the primary reasoning agent, with Claude Haiku handling planning and Overseer functions. Every attempt produces a structured trace. The Overseer — split into Diagnosis and Abstractor roles — reads those traces, identifies failure patterns, and generates gap records that inform subsequent attempts. The system gets demonstrably better with each run cycle.
The core distinction
Most agent pipelines are static: given a task, run a sequence of steps, return an answer. If the answer is wrong, nothing changes for the next attempt. AI Office is built on a different premise.
A fixed pipeline runs each task independently. Failures are logged at best. No mechanism exists to identify why a class of question fails, or to adjust the pipeline accordingly. Each run starts from the same baseline.
Every attempt produces a structured trace. The Overseer reads traces across attempts, identifies failure patterns, and generates gap records. Those gaps inform the PM agent's planning on subsequent runs. The pipeline's effective capability grows over time.
This distinction shifts the design objective from maximising single-attempt performance — tuning prompts until a benchmark run looks good — to building a system that compounds: one that is measurably better after 100 attempts than after 10.
Three roles in the pipeline
Receives the GAIA question and a brief assembled from relevant gap records. Produces a structured plan: question type classification, tool sequence, reasoning approach, known failure modes to avoid. Sets the frame for the main agent before it begins.
Claude Sonnet executes the plan with full tool access: web search, code execution, file handling, multi-hop retrieval. Produces the answer and a resolution trace. This is the only agent with tool access — all reasoning and retrieval happens here.
Two sequential Haiku calls. Diagnosis reads the attempt trace and identifies specifically what went wrong and why. Abstractor takes that diagnosis and writes a gap record in a form that is reusable across future attempts — not tied to the specific question that produced it.
Gap records — the system's working memory
Gap records are the mechanism by which failed attempts make future attempts better. They are not logs. They are structured, reusable diagnoses: what type of reasoning failed, why, and what the pipeline should do differently when it encounters a similar question.
The first Overseer call reads the full attempt trace — the PM plan, the main agent's tool calls, the raw answer, the correct answer, and the resolution type. It produces a specific diagnosis: not "answer was wrong" but "agent retrieved the right source but extracted the wrong field" or "multi-hop chain broke at the second lookup because the intermediate result wasn't verified."
The second Overseer call takes the diagnosis and abstracts it. The gap record it produces is not tied to the specific question — it describes a failure pattern applicable to a class of questions. "When a question involves a named entity with a common name, verify identity before extracting attributes" is useful. "Question 42 about John Smith failed" is not.
Before the PM agent plans each new attempt, relevant gap records are retrieved from Supabase and included in its brief. The PM uses them to anticipate failure modes specific to this question type. Gap records accumulate across attempts — the PM brief gets progressively richer, and planning gets more precise.
The Overseer diagnoses only. It never modifies the pipeline directly. Human review is required for any change to agent prompts or pipeline logic — the gap record is a proposal, not an instruction.
The pipeline
Each GAIA attempt runs through four sequential stages. The pipeline is deliberately lean: one planning agent, one execution agent, one two-call Overseer. Complexity is concentrated where it matters — in the main agent's reasoning — not distributed across a long chain of specialists.
Fewer, better-oriented agents outperform long chains. Each additional agent in a chain is a potential failure point and a source of context dilution. The PM exists to set frame; the main agent exists to reason; the Overseer exists to learn.
Receives the GAIA question and a brief assembled from relevant gap records retrieved from Supabase. Classifies the question type, identifies likely tool requirements, flags known failure modes from prior attempts. Produces a structured JSON plan consumed by the main agent. The PM never touches tools — its only job is to set the reasoning frame correctly before execution begins.
↓Claude Sonnet receives the question, the PM plan, and full tool access: web search, code execution, file I/O, and multi-hop retrieval chains. Executes the plan, adapting as intermediate results require. Produces the final answer and a resolution trace — the raw record of every tool call, intermediate result, and reasoning step. Format detection runs pre-scoring to ensure the answer matches the expected output format.
↓Runs only on incorrect attempts. Reads the full trace: PM plan, tool call sequence, intermediate results, final answer, correct answer, and resolution type. Produces a specific diagnosis of what failed and why — identifying the exact point in the reasoning chain where the attempt went wrong. Does not propose a fix.
↓Takes the Diagnosis output and abstracts it into a reusable gap record. Strips the question-specific detail to produce a failure pattern applicable to a class of questions. The gap record is written to Supabase and becomes available to the PM on all subsequent attempts of the same question type. Diagnosis and Abstraction run in parallel with the database insert for speed.
The self-improvement loop
The loop is what separates this pipeline from a static benchmark runner. Every incorrect attempt produces a gap record. Gap records accumulate. The PM brief gets richer. Performance improves not because the model gets better — but because the pipeline learns what the model tends to get wrong, and plans around it.
A GAIA question arrives. Relevant gap records are retrieved from Supabase based on question type and content similarity. The PM receives the question plus a brief assembled from those records.
Haiku classifies the question, identifies tool requirements, and flags failure modes from prior gap records relevant to this question type. Outputs structured JSON consumed by Sonnet.
Full tool access. Web search, code execution, file I/O, multi-hop chains. Produces an answer and a resolution trace. Format detection runs pre-scoring to match expected output format.
The attempt is scored against the GAIA ground truth. Correct attempts are recorded. Incorrect attempts trigger the Overseer. The resolution type is logged — format error, retrieval failure, reasoning gap, tool limit, etc.
Haiku reads the full trace for the incorrect attempt. Produces a specific diagnosis of where and why the reasoning chain failed. Runs in parallel with the Abstractor call for speed.
Takes the diagnosis and produces a reusable gap record. Strips question-specific detail to describe a failure pattern applicable to a class of questions. Written to Supabase alongside the Diagnosis.
The gap record enters the retrieval pool. On the next attempt with a similar question type, the PM brief will include it. The system does not need to encounter the identical question to benefit — pattern similarity is sufficient.
The loop closes. Future PM briefs are informed by the accumulated gap record pool. The pipeline's effective capability grows with each cycle — not because any model changed, but because planning gets progressively better-informed.
Gap records inform planning automatically. Changes to agent prompts or pipeline logic require human review. The Overseer proposes; it does not implement. This constraint is permanent.
The Overseer
The Overseer is two sequential Haiku calls, not a single model. The split matters: Diagnosis requires close reading of a specific trace; Abstraction requires stepping back from that trace to describe a general pattern. Combining them in one call produces diagnoses that are either too specific to be reusable or too generic to be actionable.
Both calls run in parallel with the Supabase insert for speed. The gap record is available to the PM on the very next attempt.
Reads the attempt trace, identifies the specific failure point, abstracts it into a reusable gap record. Operates only on incorrect attempts. Produces structured output consumed by the planning layer — it does not communicate with the main agent directly.
The Overseer cannot change agent prompts, pipeline logic, or its own instructions. Gap records inform planning automatically. Everything else requires human review. The constraint is permanent — the value of the loop is the quality of diagnosis, not the autonomy of the system to act on it.
Trace infrastructure
Every attempt is recorded in gaia_attempts in Supabase. The trace is not a log — it is the Overseer's primary input and the evidence base for the self-improvement loop. A trace that cannot explain why an attempt failed cannot produce a useful gap record.
A trace is only complete if it can answer: exactly what did the PM plan, exactly what did the main agent do, and exactly where did the reasoning chain break?
| Field | Purpose |
|---|---|
pm_brief | The full brief passed to the PM, including retrieved gap records. Allows comparison of planning quality across attempts on similar questions. |
pm_plan | The PM's structured JSON output — question classification, tool sequence, flagged failure modes. The Overseer reads this to assess whether the plan was appropriate for the question. |
tool_calls | Raw tool use blocks from the main agent's execution. Includes search queries, code inputs/outputs, and file operations. The Overseer uses this to identify where retrieval or reasoning diverged from the plan. |
raw_answer | The main agent's answer before format normalisation. Distinguishes format errors from reasoning errors — a correct answer in the wrong format is a different failure mode than a wrong answer. |
resolution_type | Classification of how the attempt resolved: correct, format_error, retrieval_failure, reasoning_gap, tool_limit, or server_tool_use. Enables pattern analysis across failure categories. |
overseer_diagnosis | The Diagnosis Haiku's specific assessment of what failed. Stored verbatim — not summarised — so the Abstractor has full fidelity input. |
gap_record | The Abstractor's reusable failure pattern. What enters the retrieval pool for future PM briefs. |
Results and component status
Current performance on the GAIA benchmark as of April 2026. Score improvements correspond directly to pipeline changes — each major architectural addition is traceable to a measurable score delta.
| Component | Status | Impact |
|---|---|---|
| Main agent (Sonnet 4.6) with tool access | Live | Baseline execution |
| Format detection + pre-score containment | Live | 60.4% → 65.5% |
| PM agent (Haiku) with gap record brief | Live | 65.5% → 73% |
| Overseer — two-call Diagnosis + Abstractor | Live | Gap record quality improved |
| Parallelised Abstractor + DB insert | Live | Latency reduced |
| Auto-recovery on transient failures | Live | Reliability improved |
| SHR (structured hint retrieval) | Live | Active |
| Raw search query logging for Overseer | Backlog | Improves retrieval diagnosis |
| Vision tool for image questions | Backlog | Targets image-heavy gap category |
| SHR download support | Backlog | Extends file handling |
Current score: 73% — 43 correct from 59 attempted. Known gaps: proprietary API access, Wayback Machine retrieval, image-only questions, and deep multi-hop chains where intermediate verification is required.
Design principles
These decisions were made deliberately and are traceable to measurable outcomes. Each represents a choice that could have gone differently — and in earlier versions, did.
Each agent in a chain is a failure point and a source of context dilution. The pipeline has three roles: plan, execute, diagnose. Adding agents without a clear quality justification makes the system harder to debug and slower to improve.
The Overseer produces gap records, not prompt changes. Human review is required for any modification to agent instructions or pipeline logic. The constraint is permanent — trust in the loop is built by keeping this boundary clear.
A single Overseer call conflates close reading of a specific trace with generalisation across a class of questions. Splitting them produces gap records that are both accurate and reusable — the combination that makes the retrieval pool valuable over time.
A correct answer in the wrong format and a wrong answer are different failure modes and require different responses. The resolution_type field enforces this distinction in every trace. Format detection runs pre-scoring to eliminate the former before the answer is evaluated.
Every improvement claim must be traceable to a score delta. Score improvements are measured against the full attempted set, not a curated subset. A change that improves performance on one question category while degrading another is a regression, not a win.
The value of the loop is not visible in a single attempt. It becomes visible across 50. The PM brief gets richer with each cycle, planning gets more precise, and the failure modes that remain are genuinely hard — not the same mistakes repeating.