Arc – Survey
ARC-AGI Living Survey
A continuously updated map of methods, results, and lessons from ARC-AGI
Survey methodology (and how the “living” part works)
This page summarizes the most decision-relevant findings from our ARC-AGI survey. The main product is a living survey that we
update as new papers, reports, and competition results appear—so readers can track what changed, what transferred, and what still fails as the benchmark evolves.
Continuous Dataset collection. We maintain a structured catalog of ARC-AGI-related works (a database). Early in the project, we fixed a stable set of fields so entries remain comparable over time—covering benchmark version and split, evaluation setting, method ingredients, reported performance, compute/cost when available, and concise notes on claims and limitations. The sources to collect the related works are: x,y,z…
New items are continuously discovered via keyword monitoring, citation trails, competition writeups, and community links. Three reviewers then screen the material, extract the same fields for each work, reconcile differences, and assign taxonomy labels. This keeps updates fast without sacrificing consistency.
Update policy. The living survey is refreshed regularly; we treat it as a maintained reference rather than a one-off snapshot.
Table. ARC-AGI benchmark evolution (why comparisons across versions are hard)
| Feature | ARC- AGI-1 |
ARC- AGI-2 |
ARC- AGI-3 |
|---|---|---|---|
| Release date | Nov 2019 | Mar 2025 | Jul 2025 |
| Format | Static | Static | Interactive |
| Number of tasks | 1,000 | 600 | Variable |
| Grid size | ≤ 30×30 | ≤ 30×30 | 64×64 |
| Color palette | 10 colors | 10 colors | 16 colors |
| Best AI performance | 90.5% | 54.2%† / 24.03%‡ |
12.58% |
| Human performance | 100% | 100% | 100% |
| Performance gap | 9.5% | 46%–76% | 87.42% |
†Commercial system. ‡Competition winner (ARC Prize 2025).
Interpretation: each generation keeps the “few examples → infer the rule” premise, but adds difficulty (and in ARC-AGI-3, interaction). The human baseline stays flat at 100%; the gap widens sharply.
The headline result: a performance cliff across generations
The survey’s central empirical pattern is the performance cliff: improvements on ARC-AGI-1 do not yield comparable gains on ARC-AGI-2, and the gap becomes dramatic in ARC-AGI-3.
How to read the figure. The left panel shows one representative task from each generation, illustrating how requirements escalate—from static pattern induction to deeper compositional structure, and finally (ARC-AGI-3) interactive exploration. The right panel uses stacked bars: the darker segment is best AI performance; the lighter segment is the remaining gap to a 100% human baseline.
Why we feature this on the website. It’s the quickest “state of the field” visual: it communicates that ARC-AGI progress is not just about pushing a single score upward, but about building methods that transfer across benchmark generations without collapsing under new types of novelty.
Figure. Example tasks + performance cliff
What tends to work (and how we classify methods)
To make hundreds of results legible, we label solver approaches with a compact taxonomy—capturing both the data regime (what the solver learns from) and the solver ingredients (what it does at inference time). We emphasize three recurring ingredients that show up in many stronger systems:
• Induction: inferring an explicit rule/program from the examples.
• Transduction: direct mapping from input to output using learned representations.
• Test-time adaptation: generate hypotheses, check against constraints, and refine.
The diagram summarizes the landscape: flows start from dataset type (left), pass through approach families (middle), and end in performance bands (right). A consistent lesson is that high performance often comes from combinations, especially when adaptation/verification is used to correct candidate solutions.
Why this is useful to readers. Instead of reading dozens of papers, the taxonomy lets you quickly answer: “What family is this method in?”, “What ingredients does it rely on?”, and “Is this the kind of approach that historically reaches the higher performance bins?”
- Duration:
Project team members: Sahar Vahdati , Andrei Aioanei , Haridhra Suresh , Jens Lehmann
Project’s website

Project Partners



