Running Karpathy's Autoresearch on a Mac. What I learned from adapting Karpathy's autoresearch idea to Apple Silicon so agent-driven LLM experiments can run locally instead of only on large GPU servers.

Apr 23, 2026 · 3 min read · Autoresearch, LLMs, Apple Silicon, Research Engineering

Running Karpathy's Autoresearch on a Mac

One of the projects I spent time on recently was adapting Karpathy's autoresearch idea so it could run on a MacBook instead of assuming a large NVIDIA setup.

The original idea is compelling because it changes the role of the researcher. Instead of manually tuning one experiment at a time, you give an agent a compact training setup, a fixed time budget, and a set of instructions. The agent edits the training code, runs a short experiment, checks the metric, keeps what works, discards what does not, and repeats.

That sounds simple in theory. In practice, making that workflow usable on Apple Silicon requires a lot of engineering discipline.

The core idea in plain English

Autoresearch is not "AI does science by magic."

It is a very structured loop:

keep the codebase small
restrict what the agent is allowed to edit
give every run the same time budget
compare experiments using one stable metric
log everything so the next iteration has context

The benefit is not just speed. It is that the research process becomes more reproducible.

What I changed for macOS

The original workflow is easier to imagine on a GPU server with CUDA-specific assumptions. A MacBook is a different environment:

attention kernels differ
compile paths behave differently
memory limits are tighter
batch size choices matter much earlier

So the work became less about inventing a new model and more about making the research loop survive on smaller hardware.

The most important changes were:

removing hard assumptions around FlashAttention-style kernels
falling back to native PyTorch attention paths when needed
tuning memory usage for Apple Silicon constraints
keeping the training setup small enough that short autonomous runs still finish reliably

That matters because if the loop is fragile, agents do not "research" effectively. They just spend their time crashing.

Why the fixed 5-minute budget matters

One design choice I like a lot in autoresearch is the fixed wall-clock budget.

Every run gets the same amount of time. That means the question becomes:

what is the best model or training change I can discover under this exact budget?

That is much more useful than comparing experiments where one secretly trained much longer than another.

For local research, this matters even more. On a Mac, you are not trying to imitate a giant cluster. You are trying to create a loop that is:

stable
cheap
comparable
easy to run overnight

What this taught me about research tooling

The most interesting part of the project was not a single model trick. It was learning that good research tooling is mostly about constraints.

If you want agents to improve a model autonomously, you need:

a tiny surface area to edit
clear "do not touch this" boundaries
one metric that decides whether an idea stays
logs that make bad experiments useful instead of wasted

That is why the project stays intentionally small. One preparation file. One training file. One instruction file. The simplicity is part of the method.

Why I think this matters

A lot of discussion around agentic research focuses on intelligence. I think a lot of the real leverage is in research environment design.

If the environment is clean, constrained, and measurable, even modest autonomous systems become much more useful.

That is why I found the macOS version worth building. It lowers the cost of experimenting with the idea and makes autonomous research feel less like a demo and more like an actual workflow.

The practical takeaway

My main takeaway from working on this is simple:

autonomous research becomes much more believable when it runs on ordinary hardware with clear rules.

That does not make the problem easy. But it does make it testable.

And once something is testable, you can improve it.