Back to writing

Running Karpathy's Autoresearch on a Mac. What I learned from adapting Karpathy's autoresearch idea to Apple Silicon so agent-driven LLM experiments can run locally instead of only on large GPU servers.

Running Karpathy's Autoresearch on a Mac

One of the projects I spent time on recently was adapting Karpathy's autoresearch idea so it could run on a MacBook instead of assuming a large NVIDIA setup.

The original idea is compelling because it changes the role of the researcher. Instead of manually tuning one experiment at a time, you give an agent a compact training setup, a fixed time budget, and a set of instructions. The agent edits the training code, runs a short experiment, checks the metric, keeps what works, discards what does not, and repeats.

That sounds simple in theory. In practice, making that workflow usable on Apple Silicon requires a lot of engineering discipline.

The core idea in plain English

Autoresearch is not "AI does science by magic."

It is a very structured loop:

  • keep the codebase small
  • restrict what the agent is allowed to edit
  • give every run the same time budget
  • compare experiments using one stable metric
  • log everything so the next iteration has context

The benefit is not just speed. It is that the research process becomes more reproducible.

What I changed for macOS

The original workflow is easier to imagine on a GPU server with CUDA-specific assumptions. A MacBook is a different environment:

  • attention kernels differ
  • compile paths behave differently
  • memory limits are tighter
  • batch size choices matter much earlier

So the work became less about inventing a new model and more about making the research loop survive on smaller hardware.

The most important changes were:

  • removing hard assumptions around FlashAttention-style kernels
  • falling back to native PyTorch attention paths when needed
  • tuning memory usage for Apple Silicon constraints
  • keeping the training setup small enough that short autonomous runs still finish reliably

That matters because if the loop is fragile, agents do not "research" effectively. They just spend their time crashing.

Why the fixed 5-minute budget matters

One design choice I like a lot in autoresearch is the fixed wall-clock budget.

Every run gets the same amount of time. That means the question becomes:

what is the best model or training change I can discover under this exact budget?

That is much more useful than comparing experiments where one secretly trained much longer than another.

For local research, this matters even more. On a Mac, you are not trying to imitate a giant cluster. You are trying to create a loop that is:

  • stable
  • cheap
  • comparable
  • easy to run overnight

What this taught me about research tooling

The most interesting part of the project was not a single model trick. It was learning that good research tooling is mostly about constraints.

If you want agents to improve a model autonomously, you need:

  • a tiny surface area to edit
  • clear "do not touch this" boundaries
  • one metric that decides whether an idea stays
  • logs that make bad experiments useful instead of wasted

That is why the project stays intentionally small. One preparation file. One training file. One instruction file. The simplicity is part of the method.

Why I think this matters

A lot of discussion around agentic research focuses on intelligence. I think a lot of the real leverage is in research environment design.

If the environment is clean, constrained, and measurable, even modest autonomous systems become much more useful.

That is why I found the macOS version worth building. It lowers the cost of experimenting with the idea and makes autonomous research feel less like a demo and more like an actual workflow.

The practical takeaway

My main takeaway from working on this is simple:

autonomous research becomes much more believable when it runs on ordinary hardware with clear rules.

That does not make the problem easy. But it does make it testable.

And once something is testable, you can improve it.