Architecture Deep Dive

The Architecture of AI Research Engineer

How Archimedes turns a single sentence into a literature review, an experimental plan, working code, real metrics, and a finished paper — with no human in the loop.

June 16, 202614 min readSystems

TL;DR

AI Research Engineer is an open-source multi-agent system that automates ML research end to end. Google's Agent Development Kit (ADK) orchestrates planning, reflection, and review; Claude (via the Claude Code CLI) handles surgical code implementation; a FAISS-backed evolutionary loop can optimize a metric across generations; and every run produces an inspectable, replayable trace — literature map, plan, code, experiments, metrics, failures, and a compiled paper.

What Is AI Research Engineer?

AI Research Engineer is a framework, not a single model. It is a graph of specialized LLM agents — each with a narrow job, a strict prompt, and a defined handoff — wired together into a workflow that mirrors how a real research lab operates: someone proposes an idea, someone checks it against the literature, someone plans the experiments, someone writes the code, someone watches the metrics, and someone writes the paper.

The project ships under the internal codename Archimedes. Give it a hypothesis, a paper to replicate, a dataset, or a benchmark to beat, and it runs the entire research lifecycle autonomously, leaving behind a full audit trail of how it got there.

The Core Loop: From Input to Reproducible Trace

Every invocation accepts one of four input types and produces the same eight-part output:

Input: a hypothesis, a paper, a dataset, or a benchmark.
Output: a literature map, a research plan, working code, executed experiments, measured metrics, recorded failures, a final paper, and a replayable session log.

This input/output contract is what makes the system composable: the same engine backs the CLI (ai-research-engineer "…" --mode orchestrated), a plain Python API (AIEngineer), and any HTTP service built on top of it.

The Agent Graph in Orchestrated Mode

In --mode orchestrated, the root workflow — ai_research_engineer_workflow — runs five phases in sequence, each implemented as its own ADK sub-agent or loop agent.

1. Ideation Loop

idea_generator_agent proposes a hypothesis; novelty_scorer_agent checks it against ArXiv and Semantic Scholar and scores it for novelty and feasibility. The two run in a loop — ideation_loop — with a review-confirmation agent that can send the idea back for another pass before the loop exits.

2. High-Level Planning Loop

plan_maker_agent turns the accepted hypothesis into a milestone-based experimental design — baselines, ablations, and concrete success criteria. plan_reviewer_agent critiques it inside high_level_planning_loop until the plan is sound, and high_level_plan_parser converts the approved plan into discrete, machine-readable stages.

3. Stage Orchestrator + Implementation Loop

stage_orchestrator feeds one stage at a time to the implementation_loop, where the coding agent (Claude Code) writes and iteratively refines the code for that stage. After each stage, success_criteria_checker verifies the stage's empirical criteria against the actual run output, and stage_reflector — acting as an adaptive Principal Investigator — rewrites the remaining stages in light of what was just learned. Plans are not static; they evolve with the evidence.

4. Paper Writing Loop

Once every stage is complete, paper_writer_agent synthesizes the knowledge base, the experiment logs, and the metrics into a manuscript; paper_reviewer_agent reviews it inside paper_writing_loop for rigor and clarity before the LaTeX is compiled to PDF.

Three Execution Engines: Orchestrated, Simple, Evolve

The CLI's --mode flag does not just change a setting — it routes execution through one of three structurally different engines.

orchestrated — the full ADK agent graph described above. Best for open-ended research questions.
simple — bypasses planning entirely and hands the prompt directly to the Claude Code agent. Faster and cheaper for narrowly-scoped coding tasks.
evolve — replaces the implementation loop with EvolutionLoopAgent, an autonomous Darwinian optimization loop.

How the evolve loop actually works

Each generation, EvolutionLoopAgent samples a parent node from a FAISS vector database of previously-tried code variants, weighted toward higher-scoring nodes. It hands the parent's code and motivation to the coding agent with an explicit mutation task, runs the resulting script, and reads the new empirical score back out of results.json. An analyzer agent then reflects on whether the mutation helped or hurt, and the new node — code, score, and analysis — is committed back to the database. A BestSnapshotManager tracks the all-time best generation so the state-of-the-art result is never lost, even if a later mutation regresses.

Structural Code Intelligence

Reading entire files to understand a codebase wastes context window and invites mistakes. The review_agent instead queries a codebase knowledge graph built with Graphify, performing AST-level inspection of function signatures, call chains, and blast radius — reported to cut token usage by 71.5x compared to reading raw source. This is how the system verifies mathematical correctness in a 10,000-line implementation without ever loading all 10,000 lines into a prompt.

Context Window Management

Long-running research sessions can generate thousands of tool calls. Once a session crosses 40 events, the framework triggers LLM-based event compression: the history is summarized and collapsed into a single context event, keeping multi-day sessions comfortably under a 1M-token window without losing the decisions that mattered.

The Research Vault: Workspace Layout

Every run writes into a predictable, inspectable directory structure:

knowledge_base/ — synthesized literature notes and architecture blueprints.
literature/ — raw full-text sources pulled from ArXiv and Semantic Scholar.
workflow/ — implementation code, training loops, and model modules.
results/ — metric logs, checkpoints, and comparison plots.
manuscript/ — the final, compiled LaTeX/PDF paper.

This structure exists specifically to defeat context amnesia: an agent that gets compacted or restarted mid-session can re-orient itself by reading the vault instead of re-deriving everything from scratch.

Fully Observable: The Streaming Event Model

Internally, every agent action is normalized into one of eight typed events — message, function_call, function_response, file_created, usage, keepalive, error, and completed — and emitted as an async generator. That same stream can be consumed directly in Python, piped to a CLI, or forwarded over Server-Sent Events to a browser, which is what makes the entire research session replayable after the fact rather than a black box.

Domain-Aware Prompting

The --domain flag injects domain-specific planning and review heuristics into every agent in the graph. Five domain packs ship today: AI/ML, finance, bioinformatics, algorithms, and physics — each tuning what counts as a rigorous baseline, an acceptable ablation, and a publishable result in that field.

The Toolbelt

Agents act on the world through a sandboxed set of tools, all scoped to the run's working directory:

research_ops — Semantic Scholar impact-filtering, multi-source paper search, and ArXiv full-text ingestion, with built-in rate limiting.
code_graph_ops — Graphify-backed codebase graph queries for AST-level structural review.
data_ops — DuckDB-powered SQL over Parquet files without loading datasets into memory.
latex_ops — compiles .tex manuscripts to PDF and surfaces syntax errors.
file_ops / web_ops — sandboxed file I/O and HTTP fetch tools, read-only and path-validated against the working directory.

Getting Started

The entire pipeline above is one CLI invocation away, and the same engine is available as a plain async Python class for embedding into other systems.

uv run ai-research-engineer "Investigate sparse mixture-of-experts \
  routing in low-resource settings" --mode orchestrated

Full installation steps, CLI flags, the Python API, and the evolve-mode walkthrough live in the docs.

Read the Docs →

Frequently Asked Questions

What is AI Research Engineer?

AI Research Engineer (codename Archimedes) is an open-source, multi-agent framework that automates the full lifecycle of machine learning research: hypothesis generation, literature review, experiment planning, code implementation, empirical validation, and final manuscript writing.

What is the difference between orchestrated, simple, and evolve mode?

Orchestrated mode runs the full pipeline — ideation, planning, stage-by-stage implementation, reflection, and paper writing. Simple mode skips planning and runs the Claude Code agent directly for narrow coding tasks. Evolve mode runs a Darwinian optimization loop that samples a FAISS database of past code variants, mutates the best one, and keeps whatever improves the empirical score.

Which AI models does AI Research Engineer use?

Orchestration and planning run on Gemini models via Google ADK and LiteLLM/OpenRouter, while code implementation is delegated to Claude (Claude Code CLI, default claude-sonnet-4-5) for surgical, AST-aware code edits.

Can it replicate an existing paper instead of inventing something new?

Yes. The --research-mode flag toggles between "novelty" (inventing new architectures and validating them against the literature) and "replication" (strict reproduction of a target paper or benchmark).

Is the research output reproducible?

Every run produces a structured workspace (the Research Vault) containing the literature reviewed, the plan, the code, experiment logs, metrics, and the final manuscript, plus a full event-by-event session trace that can be replayed.

Is AI Research Engineer open source?

Yes, it is released under the MIT license and available on GitHub at github.com/archimedes-run/ai-research-engineer.

Read the source. It is all open.

View on GitHub Back to Blog