writing

my gt research: catching agents that burn money in loops (week one)

first weekly update on STRATA, my cs 8903 project at georgia tech: detecting malicious mcp servers that trick agents into wasting 100x the compute while still giving you the right answer.

the project in two sentences: a malicious mcp server can edit nothing but its tool descriptions and steer an ai agent into wasteful tool-call loops that inflate cost 100x or more, while still returning the correct answer, so nobody notices. STRATA is a monitor that catches these attacks from the shape of the agent's tool-call behavior, since the content looks benign and the tokens look normal.

i'm doing this as cs 8903 research at georgia tech this summer, advised by prof. vijay madisetti, and i'm writing a short update every week. mostly so future me can see how the thing actually unfolded instead of the cleaned-up version that ends up in the paper. this one covers everything up to my first real progress meeting, which was today. since the paper is still in progress, these posts will stay light on the how and heavier on the what.

the problem, slightly longer

the attack class is called resource amplification, and the paper that names it best is overthinking-loops (feb 2026): up to 142x token amplification, 971x on single problems, induced purely through tool-facing text. descriptions, validation messages, the stuff the agent reads to decide what to call. the agent gets nudged into cyclic patterns: call the tool again to "verify", refine the result one more time, take a quick detour through a side chain. each call looks reasonable. the answer comes back correct. the bill is 100x.

beyond-max-tokens reports 658x with a different trick (per-turn token inflation instead of loops), and toolflood gets a 95% attack success rate by flooding retrieval. tool poisoning has a CVE and an owasp agentic top-10 entry now, so this isn't a purely academic threat.

the research gap: the overthinking-loops authors tested the obvious token-level defense, showed it fails, and wrote that defenses should reason about tool-call structure instead. then they didn't build one. nobody has. that's the gap strata goes after.

i also have a personal stake here. at clearwater i built a production mcp server running a distributed engine for 35+ llm agents serving 100+ researchers. that's exactly the integration layer this attack targets, and an operator at that scale has no visibility into whether a connected server is silently inflating cost per query. if i were still expanding that fleet, i'd want this solved first.

progress this week

my advisor's rule is replicate before you build, so that's most of what this week was.

the replications all ran from the authors' shipped artifacts, so none of them needed a gpu:

  • mcptox: recovered the reported o1-mini attack success rate of 72.8% exactly, from their 1,348 labeled cases.
  • agentdojo: the tool-filter defense gives 6.8% attack success on gpt-4o, matching the cited baseline, from their committed run transcripts.
  • agentlab: characterized their 200-attack set. 87% of attacks succeed while still completing the original task, and that "still completes the task" property is the whole reason this attack class is stealthy.

then the re-implementations of the two papers most central to my project (more on why in the roadblocks section):

  • overthinking-loops: rebuilt the paper's three cyclic tool families as protocol-compatible mcp tools and verified them against a live served qwen3-8b. the poisoned tools inflate output cost 2.6x to 4.2x over benign runs. total spend for that validation: about 9 cents.
  • beyond-max-tokens: rebuilt the attack per the paper. it reproduces amplification above 400x in its capped test.

on the writing side, every reference system plus our proposed monitor is now diagrammed, all in one scroll, and the paper skeleton is set up with the intro and related work drafted.

roadblocks (and how they got handled)

the biggest one: the two most important papers shipped no code. overthinking-loops and beyond-max-tokens are the attacks strata exists to detect, and neither released an implementation. i couldn't replicate from artifacts like the others, so i re-implemented both from the papers. the risk with a re-implementation is that you build something that doesn't actually behave like the published attack, so each one got its own verification: overthinking-loops against a live model (the 2.6x to 4.2x inflation above), beyond-max-tokens against the paper's own capped amplification test (>400x). not the same as the authors' code, but verified enough to build on.

second: live numbers don't match paper numbers (yet). the 2.6x to 4.2x i measured on qwen3-8b is a long way from the 142x in the overthinking-loops paper. that's expected, since their headline numbers come from bigger reasoning models on harder tasks, but it's worth being honest that the cheap validation run only confirms the tools induce the behavior, not the full magnitude. confirming amplification at scale is what the upcoming runs are for.

and one that could have been a problem but wasn't: compute. everything this week ran either on cpu from shipped artifacts or on a tiny live check. total spend so far is under $5 of a $200 runpod budget. a cup, not the ocean.

next week

scale up trace generation on a single fixed model, start evaluating the monitor against the standard baselines, and broaden the attack coverage past one family. also keep writing the paper.

more next week.