Jun 8, 2026

i gave my coding agents a shared memory

a weekend thing: when an agent finishes a task it writes down what it learned, and the next agent reads those notes before it starts.

every coding agent i run starts the day with amnesia. it'll solve something gnarly at 2pm, figure out the exact wrong assumption that cost it twenty minutes, and then the session ends and all of that just evaporates. the next morning a fresh agent walks into the same repo and relearns the same lesson from scratch. i do this too, honestly, but at least i keep notes.

so i spent an evening giving them notes.

where this came from

two threads from early 2026 lodged in my head and wouldn't leave.

boris cherny, who built claude code, said in a talk earlier this year: "i don't prompt claude anymore. my job is to write loops." he described how agent use has evolved in stages — first you're prompting, then you're running loops, then you're spinning up thousands of overnight subagents while you sleep. each step removes the human one more level from execution. i found that framing clarifying — not because the automation is the interesting part, but because of what it implies. at that scale, every agent is doing real work, hitting real dead ends, figuring things out. and none of it goes anywhere.

then andrej karpathy released autoresearch in march, watched 3,300 people fork it in a week, and pointed at the exact problem: "git tracks what was done, not what was learned." he called for something like SETI@home for agents — a distributed system where they collaborate like a research community rather than run as isolated individuals. the shape of the solution was right there in the framing. the findings need somewhere to go.

anthropic's dynamic workflows feature, which shipped in late may 2026, made this feel even more urgent. it lets claude spin up and coordinate up to a thousand parallel subagents in a single session — plan, execute, verify, done. it's genuinely impressive. but when the workflow ends, the script dies and everything it learned dies with it. the execution layer is solved. the learning layer doesn't exist yet.

then last week peter steinberger, who built openclaw, put it more bluntly: "you shouldn't be prompting coding agents anymore. you should be designing loops that prompt your agents." two builders, independently, arriving at the same framing within months of each other. that's usually a sign something real is happening.

agent-findings is my small attempt at that gap. a lightweight protocol that sits below the loop layer and turns every agent run — across every developer, every repo — into a compounding knowledge base that the next run inherits.

the idea

it's two hooks and a folder of json. nothing clever.

after a task finishes, a hook reads back over what just happened and distills it into one small record: what worked, what didn't, what the better stopping signal would have been, and how confident it is that the lesson actually generalizes. that record gets written to a shared folder on disk.

before the next task starts, a second hook looks at what i'm asking for, finds the few most relevant records, and pastes them in as a little "here's what past agents figured out" note. the agent reads it the way you'd skim a coworker's comment before touching their code.

run it for a week and that folder turns into a growing pile of hard-won lessons that any agent can add to or pull from.

the record is the whole point

i tried hard to make the moving parts boring, so that the one thing that matters could be the artifact. every lesson is a single json file, and it looks like this:

findings/debugging/7c2a1e90.json

{
  "task_type": "debugging",
  "language": "python",
  "what_worked": "the flaky test was a clock dependency, not a network race. injecting a clock into the constructor made it deterministic.",
  "what_failed": "spent 20 min adding retries assuming a race. re-running 'until green' just hid the real non-determinism.",
  "better_exit_condition": "a test that passes on retry but fails in isolation means the code is non-deterministic, not the harness. don't stop until it's 20/20 in isolation.",
  "tags": ["flaky-test", "pytest", "non-determinism"],
  "confidence": 0.86
}

that's the protocol. everything else is just plumbing around this shape. the what_failed field turned out to be the most valuable one, because skipping a known dead end is worth more than reading about another success.

confidence has a floor. anything under 0.5 never gets written down. a weak, overfit lesson is worse than no lesson at all, because later it shows up wearing the exact same authority as a good one.

the two bugs that taught me something

here's the part i liked: while i was building it, the thing started catching its own bugs, because it was already in the habit of writing them down.

the first one. claude code's "stop" hook fires at the end of every single turn, not once when a task is actually "done". and there's no clean done signal to wait for. so my writer cheerfully logged a near-identical record on every reply, and after a ten-message conversation the store was ten copies of the same lesson. the fix was to stamp each record with the session it came from and keep only the latest one per session.

the second one came from the reader being too loose. i was tokenizing the prompt and scoring findings by substring match — so the word "the" in "haiku about the sea" was matching the task type "testing", and completely unrelated findings kept surfacing. the fix was to match only on whole hyphen-delimited parts of a field, so "python" matches "python-async" but "the" doesn't match "testing".

both of those are findings in the store now. so the first real thing the system ever learned was how it got built.

where this goes

right now it's all local. the obvious next move is sharing, and the data model is already built for it.

because every record is immutable and named by a random id, merging two people's stores is genuinely just putting the files in the same folder. no conflicts, not ever. so the sync layer can start as the dumbest possible version of itself: a git repo full of json. push your lessons up, pull everyone else's down, rebuild the index.

that's karpathy's SETI@home vision made embarrassingly concrete. one agent learning from its own past is mildly useful. a hundred agents, scattered across a hundred repos and a dozen companies, quietly leaving each other notes — that starts to look like a research community.

anyway. it's a folder of json and two shell scripts, maybe two hundred lines total. sometimes the small hacks are the ones you end up using every day.

the code is at github.com/krish-shahh/agent-findings if you want to install it or poke around.