GAIA Pipeline · Technical Architecture · April 2026

A self-improving agent pipeline for general-purpose reasoning tasks

Abstract

AI Office GAIA is a multi-agent reasoning pipeline designed around a specific hypothesis: that a smaller number of well-structured agents with a functioning self-improvement loop will outperform a larger chain of agents that cannot learn from its own failures. The pipeline runs Claude Sonnet as the primary reasoning agent, with Claude Haiku handling planning and Overseer functions. Every attempt produces a structured trace. The Overseer — split into Diagnosis and Abstractor roles — reads those traces, identifies failure patterns, and generates gap records that inform subsequent attempts. The system gets demonstrably better with each run cycle.

BenchmarkGAIA (General AI Assistants)

Primary modelClaude Sonnet 4.6

Support modelsClaude Haiku 4.5 (PM + Overseer)

Current score73% — 43/59 attempted

The core distinction

Most agent pipelines are static: given a task, run a sequence of steps, return an answer. If the answer is wrong, nothing changes for the next attempt. AI Office is built on a different premise.

What most systems do

Execute

A fixed pipeline runs each task independently. Failures are logged at best. No mechanism exists to identify why a class of question fails, or to adjust the pipeline accordingly. Each run starts from the same baseline.

What AI Office does

Learn

Every attempt produces a structured trace. The Overseer reads traces across attempts, identifies failure patterns, and generates gap records. Those gaps inform the PM agent's planning on subsequent runs. The pipeline's effective capability grows over time.

This distinction shifts the design objective from maximising single-attempt performance — tuning prompts until a benchmark run looks good — to building a system that compounds: one that is measurably better after 100 attempts than after 10.

Three roles in the pipeline

PMHaiku

Project Manager — planning and routing

Receives the GAIA question and a brief assembled from relevant gap records. Produces a structured plan: question type classification, tool sequence, reasoning approach, known failure modes to avoid. Sets the frame for the main agent before it begins.

MainSonnet

Primary reasoning agent — execution

Claude Sonnet executes the plan with full tool access: web search, code execution, file handling, multi-hop retrieval. Produces the answer and a resolution trace. This is the only agent with tool access — all reasoning and retrieval happens here.

OVRHaiku ×2

Overseer — diagnosis and abstraction

Two sequential Haiku calls. Diagnosis reads the attempt trace and identifies specifically what went wrong and why. Abstractor takes that diagnosis and writes a gap record in a form that is reusable across future attempts — not tied to the specific question that produced it.

Gap records — the system's working memory

Gap records are the mechanism by which failed attempts make future attempts better. They are not logs. They are structured, reusable diagnoses: what type of reasoning failed, why, and what the pipeline should do differently when it encounters a similar question.

Diagnosis

What went wrong

The first Overseer call reads the full attempt trace — the PM plan, the main agent's tool calls, the raw answer, the correct answer, and the resolution type. It produces a specific diagnosis: not "answer was wrong" but "agent retrieved the right source but extracted the wrong field" or "multi-hop chain broke at the second lookup because the intermediate result wasn't verified."

Abstraction

What to generalise

The second Overseer call takes the diagnosis and abstracts it. The gap record it produces is not tied to the specific question — it describes a failure pattern applicable to a class of questions. "When a question involves a named entity with a common name, verify identity before extracting attributes" is useful. "Question 42 about John Smith failed" is not.

Retrieval

How gaps inform the PM

Before the PM agent plans each new attempt, relevant gap records are retrieved from Supabase and included in its brief. The PM uses them to anticipate failure modes specific to this question type. Gap records accumulate across attempts — the PM brief gets progressively richer, and planning gets more precise.

The Overseer diagnoses only. It never modifies the pipeline directly. Human review is required for any change to agent prompts or pipeline logic — the gap record is a proposal, not an instruction.

The pipeline

Each GAIA attempt runs through four sequential stages. The pipeline is deliberately lean: one planning agent, one execution agent, one two-call Overseer. Complexity is concentrated where it matters — in the main agent's reasoning — not distributed across a long chain of specialists.

Fewer, better-oriented agents outperform long chains. Each additional agent in a chain is a potential failure point and a source of context dilution. The PM exists to set frame; the main agent exists to reason; the Overseer exists to learn.

Planning

Project Manager (Haiku)

Receives the GAIA question and a brief assembled from relevant gap records retrieved from Supabase. Classifies the question type, identifies likely tool requirements, flags known failure modes from prior attempts. Produces a structured JSON plan consumed by the main agent. The PM never touches tools — its only job is to set the reasoning frame correctly before execution begins.

↓

Execution

Main agent (Sonnet)

Claude Sonnet receives the question, the PM plan, and full tool access: web search, code execution, file I/O, and multi-hop retrieval chains. Executes the plan, adapting as intermediate results require. Produces the final answer and a resolution trace — the raw record of every tool call, intermediate result, and reasoning step. Format detection runs pre-scoring to ensure the answer matches the expected output format.

↓

Diagnosis

Overseer — Diagnosis (Haiku)

Runs only on incorrect attempts. Reads the full trace: PM plan, tool call sequence, intermediate results, final answer, correct answer, and resolution type. Produces a specific diagnosis of what failed and why — identifying the exact point in the reasoning chain where the attempt went wrong. Does not propose a fix.

↓

Abstraction

Overseer — Abstractor (Haiku)

Takes the Diagnosis output and abstracts it into a reusable gap record. Strips the question-specific detail to produce a failure pattern applicable to a class of questions. The gap record is written to Supabase and becomes available to the PM on all subsequent attempts of the same question type. Diagnosis and Abstraction run in parallel with the database insert for speed.

Figure 1. The four-stage GAIA pipeline. The Overseer runs only on incorrect attempts. Gap records accumulate in Supabase and are retrieved by the PM at the start of each new attempt, making the planning brief progressively richer over time.

The self-improvement loop

The loop is what separates this pipeline from a static benchmark runner. Every incorrect attempt produces a gap record. Gap records accumulate. The PM brief gets richer. Performance improves not because the model gets better — but because the pipeline learns what the model tends to get wrong, and plans around it.

Attempt

Question received

A GAIA question arrives. Relevant gap records are retrieved from Supabase based on question type and content similarity. The PM receives the question plus a brief assembled from those records.

Plan

PM produces plan

Haiku classifies the question, identifies tool requirements, and flags failure modes from prior gap records relevant to this question type. Outputs structured JSON consumed by Sonnet.

Execute

Sonnet reasons

Full tool access. Web search, code execution, file I/O, multi-hop chains. Produces an answer and a resolution trace. Format detection runs pre-scoring to match expected output format.

Score

Answer evaluated

The attempt is scored against the GAIA ground truth. Correct attempts are recorded. Incorrect attempts trigger the Overseer. The resolution type is logged — format error, retrieval failure, reasoning gap, tool limit, etc.

Diagnose

Overseer — Diagnosis

Haiku reads the full trace for the incorrect attempt. Produces a specific diagnosis of where and why the reasoning chain failed. Runs in parallel with the Abstractor call for speed.

Abstract

Overseer — Abstractor

Takes the diagnosis and produces a reusable gap record. Strips question-specific detail to describe a failure pattern applicable to a class of questions. Written to Supabase alongside the Diagnosis.

Store

Gap record persisted

The gap record enters the retrieval pool. On the next attempt with a similar question type, the PM brief will include it. The system does not need to encounter the identical question to benefit — pattern similarity is sufficient.

Retrieve

Next attempt improves

The loop closes. Future PM briefs are informed by the accumulated gap record pool. The pipeline's effective capability grows with each cycle — not because any model changed, but because planning gets progressively better-informed.

Review

Human oversight

Gap records inform planning automatically. Changes to agent prompts or pipeline logic require human review. The Overseer proposes; it does not implement. This constraint is permanent.

The Overseer

The Overseer is two sequential Haiku calls, not a single model. The split matters: Diagnosis requires close reading of a specific trace; Abstraction requires stepping back from that trace to describe a general pattern. Combining them in one call produces diagnoses that are either too specific to be reusable or too generic to be actionable.

Both calls run in parallel with the Supabase insert for speed. The gap record is available to the PM on the very next attempt.

What the Overseer does

Diagnose

Reads the attempt trace, identifies the specific failure point, abstracts it into a reusable gap record. Operates only on incorrect attempts. Produces structured output consumed by the planning layer — it does not communicate with the main agent directly.

What the Overseer never does

Modify

The Overseer cannot change agent prompts, pipeline logic, or its own instructions. Gap records inform planning automatically. Everything else requires human review. The constraint is permanent — the value of the loop is the quality of diagnosis, not the autonomy of the system to act on it.

Trace infrastructure

Every attempt is recorded in gaia_attempts in Supabase. The trace is not a log — it is the Overseer's primary input and the evidence base for the self-improvement loop. A trace that cannot explain why an attempt failed cannot produce a useful gap record.

A trace is only complete if it can answer: exactly what did the PM plan, exactly what did the main agent do, and exactly where did the reasoning chain break?

Field	Purpose
`pm_brief`	The full brief passed to the PM, including retrieved gap records. Allows comparison of planning quality across attempts on similar questions.
`pm_plan`	The PM's structured JSON output — question classification, tool sequence, flagged failure modes. The Overseer reads this to assess whether the plan was appropriate for the question.
`tool_calls`	Raw tool use blocks from the main agent's execution. Includes search queries, code inputs/outputs, and file operations. The Overseer uses this to identify where retrieval or reasoning diverged from the plan.
`raw_answer`	The main agent's answer before format normalisation. Distinguishes format errors from reasoning errors — a correct answer in the wrong format is a different failure mode than a wrong answer.
`resolution_type`	Classification of how the attempt resolved: correct, format_error, retrieval_failure, reasoning_gap, tool_limit, or server_tool_use. Enables pattern analysis across failure categories.
`overseer_diagnosis`	The Diagnosis Haiku's specific assessment of what failed. Stored verbatim — not summarised — so the Abstractor has full fidelity input.
`gap_record`	The Abstractor's reusable failure pattern. What enters the retrieval pool for future PM briefs.

Results and component status

Current performance on the GAIA benchmark as of April 2026. Score improvements correspond directly to pipeline changes — each major architectural addition is traceable to a measurable score delta.

Component	Status	Impact
Main agent (Sonnet 4.6) with tool access	Live	Baseline execution
Format detection + pre-score containment	Live	60.4% → 65.5%
PM agent (Haiku) with gap record brief	Live	65.5% → 73%
Overseer — two-call Diagnosis + Abstractor	Live	Gap record quality improved
Parallelised Abstractor + DB insert	Live	Latency reduced
Auto-recovery on transient failures	Live	Reliability improved
SHR (structured hint retrieval)	Live	Active
Raw search query logging for Overseer	Backlog	Improves retrieval diagnosis
Vision tool for image questions	Backlog	Targets image-heavy gap category
SHR download support	Backlog	Extends file handling

Current score: 73% — 43 correct from 59 attempted. Known gaps: proprietary API access, Wayback Machine retrieval, image-only questions, and deep multi-hop chains where intermediate verification is required.

Design principles

These decisions were made deliberately and are traceable to measurable outcomes. Each represents a choice that could have gone differently — and in earlier versions, did.

Fewer agents, not more

Each agent in a chain is a failure point and a source of context dilution. The pipeline has three roles: plan, execute, diagnose. Adding agents without a clear quality justification makes the system harder to debug and slower to improve.

The Overseer diagnoses only

The Overseer produces gap records, not prompt changes. Human review is required for any modification to agent instructions or pipeline logic. The constraint is permanent — trust in the loop is built by keeping this boundary clear.

Diagnosis and abstraction are separate calls

A single Overseer call conflates close reading of a specific trace with generalisation across a class of questions. Splitting them produces gap records that are both accurate and reusable — the combination that makes the retrieval pool valuable over time.

Format errors are not reasoning errors

A correct answer in the wrong format and a wrong answer are different failure modes and require different responses. The resolution_type field enforces this distinction in every trace. Format detection runs pre-scoring to eliminate the former before the answer is evaluated.

The trace is the evidence base

Every improvement claim must be traceable to a score delta. Score improvements are measured against the full attempted set, not a curated subset. A change that improves performance on one question category while degrading another is a regression, not a win.

Gap records compound

The value of the loop is not visible in a single attempt. It becomes visible across 50. The PM brief gets richer with each cycle, planning gets more precise, and the failure modes that remain are genuinely hard — not the same mistakes repeating.