Introduction

In a groundbreaking engineering experiment, Anthropic researcher Nicholas Carlini pushed the boundaries of what’s possible with large language models by orchestrating 16 parallel Claude instances to build a fully functional C compiler from scratch. The result? A 100,000-line Rust-based compiler capable of compiling the Linux 6.9 kernel across multiple architectures—all without human intervention beyond the initial setup.

This isn’t just an impressive technical achievement; it represents a fundamental shift in how we can leverage AI agents for complex, long-running software development projects.

Key Takeaways

Autonomous Agent Teams: Multiple Claude instances working in parallel on a shared codebase without active human supervision
Massive Scale: 2,000 Claude Code sessions, 2 billion input tokens, 140 million output tokens, ~$20,000 in API costs
Real-World Capability: The compiler successfully builds Linux 6.9, QEMU, FFmpeg, SQLite, Redis, and can even compile Doom
Parallel Specialization: Different agents took on specialized roles—debugging, documentation, code quality, performance optimization
Critical Lessons: Success required exceptional test design, environmental structure, and careful consideration of LLM limitations

The Problem: Moving Beyond Interactive AI Assistance

Traditional AI coding assistants like Claude Code operate in an interactive paradigm: a user defines a task, the model works for a few minutes, returns results, and waits for follow-up guidance. This human-in-the-loop approach limits the scope of achievable projects.

Carlini asked: What if we could remove the human bottleneck and let agents work autonomously for days or weeks?

The Solution: Agent Teams Architecture

The Core Loop

The foundation is deceptively simple—a bash script that runs Claude in an infinite loop:

#!/bin/bash

while true; do
  COMMIT=$(git rev-parse --short=6 HEAD)
  LOGFILE="agent_logs/agent_${COMMIT}.log"

  claude --dangerously-skip-permissions \
    -p "$(cat AGENT_PROMPT.md)" \
    --model claude-opus-X-Y &> "$LOGFILE"
done

The agent prompt instructs Claude to:

Break problems into small pieces
Track current work
Identify the next task
Keep going until everything works perfectly

Critical insight: The loop never stops. Claude must continuously identify and solve problems without waiting for human intervention.

Parallelization Strategy

Running multiple agents in parallel addresses two fundamental limitations:

1. Concurrency

A single Claude session can only work on one task at a time. With 16 agents running in parallel, multiple bugs can be debugged and features implemented simultaneously.

2. Specialization

Different agents can take on different roles:

Core development agents fixing bugs and implementing features
Documentation agent maintaining READMEs and progress files
Code quality agent identifying and merging duplicate code
Performance optimization agent improving compiler efficiency
Design critique agent restructuring code from a Rust expert’s perspective

Synchronization Mechanism

The synchronization algorithm is elegantly simple:

Agent takes a "lock" on a task by writing current_tasks/parse_if_statement.txt
Agent works on the task in isolation
Agent pulls from upstream, merges changes, pushes its work
Agent removes the lock
Fresh Claude session spawns in new container
Repeat

Git handles conflict resolution automatically. If two agents try to claim the same task, Git’s synchronization forces the second to pick a different one. Merge conflicts are frequent, but Claude handles them intelligently.

Deep Dive: Lessons from Programming with Agent Teams

Lesson 1: Write Exceptionally High-Quality Tests

Claude will autonomously solve whatever problem you define. If your test harness has gaps, Claude will “solve” the wrong problem and introduce subtle bugs.

Carlini’s approach:

Sourced high-quality compiler test suites (GCC torture tests, etc.)
Built verifiers for open-source packages
Implemented continuous integration to prevent regressions
Designed new tests whenever Claude made systematic mistakes

Example problem: Late in the project, new features frequently broke existing functionality. The solution? Stricter CI enforcement that prevented commits from breaking previously passing tests.

Lesson 2: Design for Claude, Not Humans

The test harness must account for LLM-specific limitations:

Context Window Pollution

Problem: Thousands of lines of test output pollute the context window.

Solutions:

Tests print minimal output (a few lines max)
Detailed logs go to files Claude can read when needed
Log files use grep-friendly formats (e.g., ERROR: reason on same line)
Pre-compute aggregate statistics so Claude doesn’t have to

Time Blindness

Problem: Claude can’t tell time and will happily spend hours on low-value work.

Solutions:

Print incremental progress infrequently
Implement --fast mode that runs 1-10% random sample of tests
Make subsamples deterministic per-agent but random across VMs
This ensures all tests get coverage while each agent can quickly identify regressions

Lesson 3: Enable Effective Parallelism

Early in the project, parallelization was trivial—hundreds of failing tests meant each agent could pick a different one. But at 99% pass rate, the situation changed dramatically.

The Linux Kernel Challenge

Compiling the Linux kernel is one giant task, not hundreds of independent tests. All 16 agents hit the same bug, fixed it, and overwrote each other’s changes. Parallelism provided zero benefit.

The breakthrough solution: Use GCC as an “online known-good compiler oracle”

Randomly compile 90% of kernel files with GCC
Compile remaining 10% with Claude's compiler
If kernel works → problem isn't in Claude's subset
If kernel breaks → binary search to identify problematic files
Each agent works on different files simultaneously

This enabled true parallelism. Each agent debugged different files until Claude’s compiler could compile everything.

Advanced technique: Delta debugging to find pairs of files that failed together but worked independently.

Lesson 4: Maintain Context Through Documentation

Fresh agents spawn in clean containers with zero context. To help Claude orient itself quickly:

Extensive READMEs: Updated frequently with current project status
Progress files: Track what’s been tried, what failed, what’s next
Failure logs: Document dead-end approaches so other agents don’t retry them
Task descriptions: Clear explanations of what each lock file represents

You can see this in action in the GitHub repository—read through the commit history and watch agents systematically work through problems.

The Results: What Claude Built

Capabilities

The 100,000-line compiler successfully:

Compiles Linux 6.9 on x86, ARM, and RISC-V architectures
Builds major projects: QEMU, FFmpeg, SQLite, PostgreSQL, Redis
99% pass rate on GCC torture test suite
Runs Doom (the ultimate litmus test)
Clean-room implementation: No internet access, depends only on Rust standard library

Limitations

Even at this scale, Opus 4.6 reached its limits:

16-bit x86 mode: Can’t build the 16-bit real mode bootstrapper needed to boot Linux (calls out to GCC for this phase)
No native assembler/linker: Still somewhat buggy, demo used GCC’s assembler and linker
Not universal: Compiles many projects but isn’t a drop-in GCC replacement yet
Inefficient code generation: Even with all optimizations enabled, outputs less efficient code than GCC with optimizations disabled
Code quality: Reasonable Rust code, but not expert-level

Most critically: New features frequently broke existing functionality. The project pushed right up against the capabilities of current models.

The Implications: A New Paradigm for AI-Assisted Development

What This Enables

Agent teams fundamentally expand the scope of autonomous AI development:

Early models: Tab-completion in IDEs
GPT-3 era: Complete function bodies from docstrings
Claude Code: Interactive pair programming
Agent Teams: Implement entire complex projects autonomously

The Risks

Carlini expresses both excitement and unease:

“I used to work in penetration testing, exploiting vulnerabilities in products produced by large companies, and the thought of programmers deploying software they’ve never personally verified is a real concern.”

Key dangers:

Tests might pass while hiding subtle bugs
No real-time human oversight to catch errors
Easy to assume “passing tests = done” when this is rarely true
Autonomous code generation at scale could introduce systemic vulnerabilities

Looking Forward

This experiment demonstrates that autonomous, long-running agent teams are possible today with current models. As models improve:

Projects of even greater complexity become feasible
Cost-per-project decreases
Speed of autonomous development accelerates
Need for new safety strategies becomes critical

Carlini concludes:

“I did not expect this to be anywhere near possible so early in 2026. The rapid progress in both language models and the scaffolds we use to interact with them opens the door to writing an enormous amount of new code.”

Practical Applications and Takeaways

For developers and organizations considering agent teams:

When to Use Agent Teams

✅ Good fit:

Long-running projects with clear success metrics (tests)
Problems that can be decomposed into parallel sub-tasks
Well-defined technical requirements
Sufficient budget for extended API usage

❌ Poor fit:

Projects requiring frequent architectural decisions
Vague or evolving requirements
Tasks where test quality is uncertain
Critical systems where unverified code is unacceptable

Cost-Benefit Analysis

This project cost ~$20,000 in API fees. Compare that to:

Hiring a team of developers for weeks
The opportunity cost of not completing the project
The learning value of pushing model capabilities to their limits

For research, exploration, and certain commercial applications, this trade-off can make sense.

Key Success Factors

Test quality is paramount: Invest heavily in test design upfront
Design for LLMs, not humans: Account for context limits, time blindness, etc.
Enable parallelism: Structure work so multiple agents can make independent progress
Maintain context: Extensive documentation helps agents orient quickly
Specialize agents: Different roles prevent conflicts and improve efficiency

Conclusion

Anthropic’s agent teams experiment represents a watershed moment in AI-assisted development. By orchestrating 16 parallel Claude instances working autonomously for two weeks, Nicholas Carlini demonstrated that current models can implement projects of remarkable complexity—if the harness is designed correctly.

The resulting C compiler—capable of building Linux, QEMU, and Doom—is an impressive artifact. But the real insight is the methodology: how to structure tests, manage parallelism, handle LLM limitations, and enable sustained autonomous progress.

As models continue to improve, agent teams will become increasingly practical for production use. The challenge ahead is developing them responsibly, with appropriate safeguards and human oversight, while harnessing their transformative potential.

The future of software development is arriving faster than expected. Are we ready?

References

Original article: Building a C compiler with a team of parallel Claudes by Nicholas Carlini
Source code: Claude’s C Compiler on GitHub
GCC Torture Tests: Official documentation

This post is a comprehensive walkthrough and analysis of Anthropic’s groundbreaking work. For the latest updates, follow the GitHub repository where Claude continues pushing improvements.

Anthropic's Agent Teams: A Deep Dive into Building a C Compiler with Parallel AI Agents