Introduction
In a groundbreaking engineering experiment, Anthropic researcher Nicholas Carlini pushed the boundaries of what’s possible with large language models by orchestrating 16 parallel Claude instances to build a fully functional C compiler from scratch. The result? A 100,000-line Rust-based compiler capable of compiling the Linux 6.9 kernel across multiple architectures—all without human intervention beyond the initial setup.
This isn’t just an impressive technical achievement; it represents a fundamental shift in how we can leverage AI agents for complex, long-running software development projects.
Key Takeaways
- Autonomous Agent Teams: Multiple Claude instances working in parallel on a shared codebase without active human supervision
- Massive Scale: 2,000 Claude Code sessions, 2 billion input tokens, 140 million output tokens, ~$20,000 in API costs
- Real-World Capability: The compiler successfully builds Linux 6.9, QEMU, FFmpeg, SQLite, Redis, and can even compile Doom
- Parallel Specialization: Different agents took on specialized roles—debugging, documentation, code quality, performance optimization
- Critical Lessons: Success required exceptional test design, environmental structure, and careful consideration of LLM limitations
The Problem: Moving Beyond Interactive AI Assistance
Traditional AI coding assistants like Claude Code operate in an interactive paradigm: a user defines a task, the model works for a few minutes, returns results, and waits for follow-up guidance. This human-in-the-loop approach limits the scope of achievable projects.
Carlini asked: What if we could remove the human bottleneck and let agents work autonomously for days or weeks?
The Solution: Agent Teams Architecture
The Core Loop
The foundation is deceptively simple—a bash script that runs Claude in an infinite loop:
#!/bin/bash
while true; do
COMMIT=$(git rev-parse --short=6 HEAD)
LOGFILE="agent_logs/agent_${COMMIT}.log"
claude --dangerously-skip-permissions \
-p "$(cat AGENT_PROMPT.md)" \
--model claude-opus-X-Y &> "$LOGFILE"
done
The agent prompt instructs Claude to:
- Break problems into small pieces
- Track current work
- Identify the next task
- Keep going until everything works perfectly
Critical insight: The loop never stops. Claude must continuously identify and solve problems without waiting for human intervention.
Parallelization Strategy
Running multiple agents in parallel addresses two fundamental limitations:
1. Concurrency
A single Claude session can only work on one task at a time. With 16 agents running in parallel, multiple bugs can be debugged and features implemented simultaneously.
2. Specialization
Different agents can take on different roles:
- Core development agents fixing bugs and implementing features
- Documentation agent maintaining READMEs and progress files
- Code quality agent identifying and merging duplicate code
- Performance optimization agent improving compiler efficiency
- Design critique agent restructuring code from a Rust expert’s perspective
Synchronization Mechanism
The synchronization algorithm is elegantly simple:
1. Agent takes a "lock" on a task by writing current_tasks/parse_if_statement.txt
2. Agent works on the task in isolation
3. Agent pulls from upstream, merges changes, pushes its work
4. Agent removes the lock
5. Fresh Claude session spawns in new container
6. Repeat
Git handles conflict resolution automatically. If two agents try to claim the same task, Git’s synchronization forces the second to pick a different one. Merge conflicts are frequent, but Claude handles them intelligently.
Deep Dive: Lessons from Programming with Agent Teams
Lesson 1: Write Exceptionally High-Quality Tests
Claude will autonomously solve whatever problem you define. If your test harness has gaps, Claude will “solve” the wrong problem and introduce subtle bugs.
Carlini’s approach:
- Sourced high-quality compiler test suites (GCC torture tests, etc.)
- Built verifiers for open-source packages
- Implemented continuous integration to prevent regressions
- Designed new tests whenever Claude made systematic mistakes
Example problem: Late in the project, new features frequently broke existing functionality. The solution? Stricter CI enforcement that prevented commits from breaking previously passing tests.
Lesson 2: Design for Claude, Not Humans
The test harness must account for LLM-specific limitations:
Context Window Pollution
Problem: Thousands of lines of test output pollute the context window.
Solutions:
- Tests print minimal output (a few lines max)
- Detailed logs go to files Claude can read when needed
- Log files use grep-friendly formats (e.g.,
ERROR: reason on same line) - Pre-compute aggregate statistics so Claude doesn’t have to
Time Blindness
Problem: Claude can’t tell time and will happily spend hours on low-value work.
Solutions:
- Print incremental progress infrequently
- Implement
--fastmode that runs 1-10% random sample of tests - Make subsamples deterministic per-agent but random across VMs
- This ensures all tests get coverage while each agent can quickly identify regressions
Lesson 3: Enable Effective Parallelism
Early in the project, parallelization was trivial—hundreds of failing tests meant each agent could pick a different one. But at 99% pass rate, the situation changed dramatically.
The Linux Kernel Challenge
Compiling the Linux kernel is one giant task, not hundreds of independent tests. All 16 agents hit the same bug, fixed it, and overwrote each other’s changes. Parallelism provided zero benefit.
The breakthrough solution: Use GCC as an “online known-good compiler oracle”
1. Randomly compile 90% of kernel files with GCC
2. Compile remaining 10% with Claude's compiler
3. If kernel works → problem isn't in Claude's subset
4. If kernel breaks → binary search to identify problematic files
5. Each agent works on different files simultaneously
This enabled true parallelism. Each agent debugged different files until Claude’s compiler could compile everything.
Advanced technique: Delta debugging to find pairs of files that failed together but worked independently.
Lesson 4: Maintain Context Through Documentation
Fresh agents spawn in clean containers with zero context. To help Claude orient itself quickly:
- Extensive READMEs: Updated frequently with current project status
- Progress files: Track what’s been tried, what failed, what’s next
- Failure logs: Document dead-end approaches so other agents don’t retry them
- Task descriptions: Clear explanations of what each lock file represents
You can see this in action in the GitHub repository—read through the commit history and watch agents systematically work through problems.
The Results: What Claude Built
Capabilities
The 100,000-line compiler successfully:
- Compiles Linux 6.9 on x86, ARM, and RISC-V architectures
- Builds major projects: QEMU, FFmpeg, SQLite, PostgreSQL, Redis
- 99% pass rate on GCC torture test suite
- Runs Doom (the ultimate litmus test)
- Clean-room implementation: No internet access, depends only on Rust standard library
Limitations
Even at this scale, Opus 4.6 reached its limits:
- 16-bit x86 mode: Can’t build the 16-bit real mode bootstrapper needed to boot Linux (calls out to GCC for this phase)
- No native assembler/linker: Still somewhat buggy, demo used GCC’s assembler and linker
- Not universal: Compiles many projects but isn’t a drop-in GCC replacement yet
- Inefficient code generation: Even with all optimizations enabled, outputs less efficient code than GCC with optimizations disabled
- Code quality: Reasonable Rust code, but not expert-level
Most critically: New features frequently broke existing functionality. The project pushed right up against the capabilities of current models.
The Implications: A New Paradigm for AI-Assisted Development
What This Enables
Agent teams fundamentally expand the scope of autonomous AI development:
- Early models: Tab-completion in IDEs
- GPT-3 era: Complete function bodies from docstrings
- Claude Code: Interactive pair programming
- Agent Teams: Implement entire complex projects autonomously
The Risks
Carlini expresses both excitement and unease:
“I used to work in penetration testing, exploiting vulnerabilities in products produced by large companies, and the thought of programmers deploying software they’ve never personally verified is a real concern.”
Key dangers:
- Tests might pass while hiding subtle bugs
- No real-time human oversight to catch errors
- Easy to assume “passing tests = done” when this is rarely true
- Autonomous code generation at scale could introduce systemic vulnerabilities
Looking Forward
This experiment demonstrates that autonomous, long-running agent teams are possible today with current models. As models improve:
- Projects of even greater complexity become feasible
- Cost-per-project decreases
- Speed of autonomous development accelerates
- Need for new safety strategies becomes critical
Carlini concludes:
“I did not expect this to be anywhere near possible so early in 2026. The rapid progress in both language models and the scaffolds we use to interact with them opens the door to writing an enormous amount of new code.”
Practical Applications and Takeaways
For developers and organizations considering agent teams:
When to Use Agent Teams
✅ Good fit:
- Long-running projects with clear success metrics (tests)
- Problems that can be decomposed into parallel sub-tasks
- Well-defined technical requirements
- Sufficient budget for extended API usage
❌ Poor fit:
- Projects requiring frequent architectural decisions
- Vague or evolving requirements
- Tasks where test quality is uncertain
- Critical systems where unverified code is unacceptable
Cost-Benefit Analysis
This project cost ~$20,000 in API fees. Compare that to:
- Hiring a team of developers for weeks
- The opportunity cost of not completing the project
- The learning value of pushing model capabilities to their limits
For research, exploration, and certain commercial applications, this trade-off can make sense.
Key Success Factors
- Test quality is paramount: Invest heavily in test design upfront
- Design for LLMs, not humans: Account for context limits, time blindness, etc.
- Enable parallelism: Structure work so multiple agents can make independent progress
- Maintain context: Extensive documentation helps agents orient quickly
- Specialize agents: Different roles prevent conflicts and improve efficiency
Conclusion
Anthropic’s agent teams experiment represents a watershed moment in AI-assisted development. By orchestrating 16 parallel Claude instances working autonomously for two weeks, Nicholas Carlini demonstrated that current models can implement projects of remarkable complexity—if the harness is designed correctly.
The resulting C compiler—capable of building Linux, QEMU, and Doom—is an impressive artifact. But the real insight is the methodology: how to structure tests, manage parallelism, handle LLM limitations, and enable sustained autonomous progress.
As models continue to improve, agent teams will become increasingly practical for production use. The challenge ahead is developing them responsibly, with appropriate safeguards and human oversight, while harnessing their transformative potential.
The future of software development is arriving faster than expected. Are we ready?
References
- Original article: Building a C compiler with a team of parallel Claudes by Nicholas Carlini
- Source code: Claude’s C Compiler on GitHub
- GCC Torture Tests: Official documentation
This post is a comprehensive walkthrough and analysis of Anthropic’s groundbreaking work. For the latest updates, follow the GitHub repository where Claude continues pushing improvements.
The AI Coding Wars Heat Up: Claude Opus 4.6 vs GPT-5.3-Codex
Click to load Disqus comments