DEV Community: pueding

Google Releases Gemma 4 12B: Encoder-Free Multimodal Projection

pueding — Tue, 16 Jun 2026 11:17:41 +0000

What: Google released Gemma 4 12B, an open multimodal model whose headline trick is encoder-free multimodal projection — it turns images and audio into tokens by projecting them straight into the token space, instead of running them through a dedicated encoder network.

Why: The separate vision and audio encoders most multimodal models carry are extra parameters, compute, and latency that run before the language model sees anything; dropping them is a big reason a 12B model can field pictures and sound inside 16 GB of memory.

vs prior: Versus the standard recipe — a frozen vision transformer (ViT) plus a projector bolted onto a text model — Gemma 4 12B has no separate encoder at all: each image patch becomes a token through one matrix multiply directly into the backbone.

Think of it as

A meeting where guests either go through a translator or speak the language.

              IMAGE / AUDIO ARRIVES
                       │
        ┌──────────────┴──────────────┐
        │                             │
 ┌──────▼───────┐             ┌───────▼──────┐
 │  THE OLD WAY │             │ ENCODER-FREE │
 │via translator│             │ speak direct │
 └──────┬───────┘             └───────┬──────┘
        │                             │
  a whole vision/audio          one matrix-multiply
  encoder runs first            projects to a token
        │                             │
        ▼                             ▼
 ✗ extra params + latency      ✓ same token space,
   before the LLM looks          ~16 GB, lower latency

text token = a guest who already speaks the room's language
vision/audio encoder = a separate translator the old way routes pictures and sound through
encoder-free projection = one matrix-multiply that puts vision and audio into the room's language directly
shared token space = the single language every guest speaks once inside

Quick glossary

Encoder-free (VLM) — A multimodal model with no separate encoder for non-text inputs — rather than run an image through a vision network first, it projects the raw input straight into the model's token space. The lineage runs through research models like Fuyu and EVE.

Vision encoder / ViT — A Vision Transformer — a stack of attention-and-MLP layers that turns an image into feature vectors. In the usual recipe it sits in front of the language model as a second network; encoder-free designs delete it.

Patch — An image is cut into a grid of small squares (e.g. 16×16 pixels). Each patch is flattened into a list of raw numbers and treated as one unit of input — the visual equivalent of a text token.

Projection — A single matrix multiply that maps a vector of one size onto a vector of another. Here it maps a flattened image patch onto a vector the same width as a word's embedding — so the result is a token; audio is folded into that same space.

Token / embedding space — A transformer doesn't read words or pixels; it reads dense vectors. The "embedding space" is the shared vector format every input must arrive in — putting images and audio there is what lets one backbone read all three.

Native audio — Audio handled inside the model as tokens, rather than transcribed to text by a separate speech model first. Gemma 4 12B is the first mid-sized Gemma to take audio in natively.

The news. On June 3, 2026, Google released Gemma 4 12B, an Apache-2.0 model that drops the separate vision and audio encoders most multimodal models bolt on. Instead it projects both kinds of input straight into the language backbone: vision through a lightweight module — reportedly a single matrix multiply plus positional and normalization terms — and audio into the same dimensional space as text tokens. It is the first mid-sized Gemma to take native audio input, runs on 16 GB of VRAM or unified memory, and reportedly scores near Google's larger 26B mixture-of-experts model. Read the announcement →

Picture the meeting. A text prompt is a guest who already speaks the room's language — it walks in and starts talking. A picture and a sound clip don't: the usual fix hires a separate translator for each, a whole second staffer who listens, re-voices everything, and only then lets the guest join. Those translators are the model's vision and audio encoders — extra networks that run before the language model sees a thing. Gemma 4 12B fires the translators. It teaches pictures and sound to speak the room's language directly, in one quick step, so every guest — text, image, audio — sits at the same table as an ordinary token.

Underneath the metaphor, "speaking the room's language" means landing in the model's embedding space — the dense vectors a transformer actually consumes. A token ID becomes a vector by a lookup; an image patch becomes one by a projection. As a toy example, cut a 256×256 image into 16×16 patches and you get 256 patches, each a flat list of 16·16·3 = 768 raw numbers. The old way pushes patches like these through a vision transformer — tens of attention-and-MLP layers — before the LLM gets a single feature. Gemma's encoder-free path instead, by Google's description, applies a single matrix multiply (plus a positional term and normalization) that turns each patch straight into a token, the same shape as a word's embedding. Audio is projected into that same space too. The whole pre-LLM encoder stack collapses to that one projection — and the backbone itself takes over the visual and acoustic processing.

Approach	How an image enters	Separate encoder?	Cost profile
Encoder-based (ViT + projector)	image → vision transformer (tens of layers) → projector → tokens	yes — a full vision network runs first	more parameters and latency before the first output token
Encoder-free (Gemma 4 12B)	patches → one matrix multiply (+ position/norm) → tokens	no separate encoder	~16 GB, lower pre-decode latency (Google, reported)

Removing the encoder stack has consequences, but the wins are concrete. A separate vision tower is parameters you store, compute you run, and latency you pay before the first output token; deleting it is a big reason a 12B model can field images and audio inside 16 GB rather than needing a datacenter card, and part of why Google can claim quality near its 26B mixture-of-experts model despite the smaller, simpler stack. The catch is that the backbone now has to learn visual and acoustic structure itself, with no pretrained encoder doing that work for it — which is plausibly why this ships as a 12B model trained for it from the start rather than a vision adapter glued onto an existing text model. The architectural specifics beyond the single-matmul description are not yet fully documented.

The payoff is a cleaner idea of what "multimodal" even requires. You don't strictly need a bespoke eye and ear bolted onto a language model; if every input can be projected into the same token space, one backbone can read all of them. Gemma 4 12B is a bet that for a small, open model meant to run on modest hardware, fewer moving parts beats a heavier, more specialized stack.

Goes deeper in: LLM Internals → Embeddings → From Token IDs to Vectors

Related explainers

GLM-5V — native multimodal vs vision-bolted — the neighboring question: training a model multimodal from the start versus adapting a text model, a different axis than removing the encoder
Gemini Omni — modality unification in a shared token space — the same "one token space for every modality" idea, taken to full any-to-any generation
Gemma 4 QAT — quantization-aware training — the other route to running a real model on modest hardware: shrink the bits, instead of removing the encoder

FAQ

What is encoder-free multimodal projection?

It is a way to make a language model multimodal without a separate vision or audio encoder. Instead of running an image through a dedicated network first, the model cuts it into patches and turns each patch into a token with a single matrix multiply — projecting it directly into the same embedding space as text tokens. Audio is handled the same way. One backbone then reads text, image, and audio tokens as one stream.

Why does removing the vision encoder matter?

A separate vision encoder is extra parameters to store, extra compute to run, and extra latency before the language model produces its first token. Dropping it is a big part of why Gemma 4 12B can handle images and native audio inside about 16 GB of memory and still report quality near Google's larger 26B mixture-of-experts model. The trade-off is that the backbone has to learn visual and acoustic structure itself, which is why the design ships as a model trained for it rather than a bolt-on.

How does it relate to native multimodal models like GLM-5V?

They answer different questions. "Native vs vision-bolted" is about training: was the model multimodal from the start, or was a vision module added to a finished text model? "Encoder-free" is about architecture: is there a separate encoder network at all, or does the input get projected straight into the token space? A model can be natively trained and still use a vision encoder; Gemma 4 12B is unusual in being both natively multimodal and encoder-free.

Originally posted on Learn AI Visually.

NVIDIA Blackwell Leads AgentPerf, the First Agentic-AI Infra Benchmark: Trajectory-Replay Benchmarking

pueding — Mon, 15 Jun 2026 11:20:31 +0000

What: The AgentPerf benchmark from Artificial Analysis is the first test built for agentic-AI infrastructure: instead of timing one chat completion, it replays recorded multi-step agent trajectories to see how a serving system holds up under real agent load.

Why: Agents don't send one prompt — they run long chains of model calls and tool executions, so a serving system's real job is sustaining many such runs at once. AgentPerf measures exactly that: concurrent agents held above a speed limit, normalized by power.

vs prior: A single-shot completion benchmark sends one prompt and reports tokens per second — and misses the bursty, stateful, KV-cache-heavy load a real agent creates. Trajectory replay reproduces that load, so the score reflects real production agent load, not a sprint time.

Think of it as

An EPA mileage test that replays a real drive cycle, not a top-speed sprint.

                  MEASURING A SERVING SYSTEM
                             │
             ┌───────────────┴───────────────┐
             │                               │
     ┌───────▼───────┐               ┌───────▼───────┐
     │  SPRINT TEST  │               │  DRIVE CYCLE  │
     │ one completion│               │ one agent run │
     └───────┬───────┘               └───────┬───────┘
             │                               │
    peak tokens/sec on             replay the stop-go,
    a single prompt                multi-step agent load
             │                               │
             ▼                               ▼
   ✗ flatters the rack,           ✓ agents per megawatt:
     ignores real load              agents held over an SLO

single chat completion = a one-shot top-speed sprint down a straight
agent trajectory = a recorded drive cycle — stop, go, idle, accelerate
AgentPerf = the dyno that replays the real drive cycle, not the sprint
per-token SLO = a minimum speed the car must hold the whole way
agents per megawatt = miles per gallon for the whole fleet

Quick glossary

AgentPerf — Artificial Analysis's benchmark for agentic-AI infrastructure. It drives a serving system with recorded coding-agent trajectories across 12+ programming languages and scores how many concurrent agents the system sustains under a per-token speed limit, normalized by power.

Agent trajectory — The full recorded run of an agent: chained LLM calls interleaved with tool executions — read a file, run code, see the error, try again — many steps to finish one task. See AI Agents → The Agent Loop.

Per-token SLO — A service-level objective on output speed — a floor on tokens per second the system must hold for each agent. AgentPerf measures at both 20 and 60 tok/s. See LLM Serving → Serving Metrics.

Goodput — Only the work that actually meets the SLO — here, the concurrent agents staying above the token-rate floor — as opposed to raw throughput, which counts everything regardless of latency. See Throughput vs Goodput.

Agents per megawatt — AgentPerf's headline metric: concurrent agents meeting the SLO, divided by the power the system draws. An efficiency number — useful work per unit of energy — like miles per gallon for an inference fleet.

GB300 NVL72 / HGX H200 — The two NVIDIA systems compared: the rack-scale Blackwell GB300 NVL72 versus the prior-generation HGX H200. Both run DeepSeek V4 Pro in the reported result.

The news. On June 12, 2026, Artificial Analysis released AgentPerf, billed as the industry's first benchmark for agentic-AI infrastructure. Rather than single chat completions, it replays real coding-agent trajectories — file reads, code execution, iteration — across 12+ programming languages, and scores how many concurrent agents a system sustains under a per-token SLO, normalized by power. NVIDIA reports its GB300 NVL72 serves up to 20× more agents per megawatt than an HGX H200 system, running DeepSeek V4 Pro and measured at both 20 and 60 tokens/sec. Read the announcement →

Picture the fuel-economy sticker on a new car. The number that ends up on the window isn't a quarter-mile drag time — a single sprint down an empty straight tells you almost nothing about the commute you'll actually drive. The figure drivers care about, miles per gallon, comes from a dynamometer replaying a recorded city drive cycle: stop, go, idle, accelerate, the messy real pattern. A single sprint measures the wrong thing; the recorded drive cycle measures the thing you live with. A single chat completion is that sprint. An agent's run is the drive cycle. AgentPerf is the dyno.

The reason the distinction matters is that an agent run looks nothing like one prompt-and-reply. It is a long loop of model calls interleaved with tool executions — read a file, run the code, look at the failure, edit, try again — many steps to finish a single task. That load is bursty and stateful: the context grows with every step, leaning hard on KV-cache reuse, decode comes in stop-go spurts, and many such runs land on the system at once. A benchmark that sends one prompt and reports peak tokens per second is timing the sprint, not the commute.

So AgentPerf replays recorded coding-agent trajectories and asks a different question: how many agents can the system keep above a per-token speed limit at the same time? That is a goodput measurement — count only the agents actually holding the SLO, not raw token throughput — and then divide by the power the rack draws. The unit that falls out, agents per megawatt, is miles-per-gallon for an inference fleet: useful work per unit of energy.

How you benchmark	What it sends	What it misses
Single chat completion	one prompt → one response	the bursty, multi-step load a real agent creates
Peak-throughput LLM bench	many independent prompts	KV reuse and sustained concurrency within one long run
AgentPerf (trajectory replay)	recorded multi-step agent runs	— (scores concurrent agents under an SLO, then agents per megawatt)

What "20× per megawatt" means

Hold two things fixed: the power, at one megawatt, and the SLO, at 60 tokens per second. Suppose an HGX H200 rack sustains 60 concurrent agents that stay above that floor on its megawatt (illustrative). The one ratio AgentPerf actually reports is the comparison: the GB300 NVL72 sustains up to 20× as many on the same megawatt — roughly 1,200 agents on that scaling. The lever isn't only more FLOPs. Agent trajectories share a huge common prefix — the system prompt, the tool definitions, the conversation so far — so KV-cache reuse and continuous batching are what turn raw compute into sustained agents, and a single-completion benchmark never exercises that reuse. Same megawatt, up to 20× the agents — because the test finally rewards sustained, KV-reuse-heavy agent load instead of a one-shot sprint. (Only the 20× ratio, the 20/60 tok/s SLOs, and the GB300-vs-H200 comparison come from NVIDIA; the 60-agent baseline is illustrative.)

Goes deeper in: Agent Engineering → Cost & Latency → The Cost Profile of an Agent

Related explainers

NVIDIA AI Factories — tokens per megawatt — the metric cousin: AgentPerf's agents per megawatt is the agent-level version of tokens per megawatt
WeaveBench — trajectory-aware grading — also replays the whole agent run, but to grade correctness; AgentPerf replays it to measure infrastructure
FutureSim — harness-level agent eval — the broader shift to evaluating agents at the harness level, not single-shot QA

FAQ

What is AgentPerf?

AgentPerf is a benchmark from Artificial Analysis, billed as the first test for agentic-AI infrastructure. Instead of timing single chat completions, it replays recorded multi-step coding-agent trajectories — file reads, code execution, and iteration across 12+ programming languages — and scores how many concurrent agents a serving system sustains under a per-token SLO, normalized by power (agents per megawatt). In NVIDIA's reported result, a GB300 NVL72 system serves up to 20× more agents per megawatt than an HGX H200 system on DeepSeek V4 Pro.

How is trajectory-replay benchmarking different from a normal LLM benchmark?

A normal LLM benchmark sends one prompt and measures the response — tokens per second, time to first token. An agent, though, runs a long trajectory: chained model calls interleaved with tool executions, with a growing context and bursty decode. Trajectory replay drives the system with those recorded multi-step runs instead of single prompts, so it stresses the scheduler, KV-cache reuse, and sustained decode under concurrency — the load that real agents actually create.

What does agents per megawatt measure?

Agents per megawatt is AgentPerf's headline metric: the number of concurrent agents a system keeps above the per-token SLO, divided by the power it draws. It is a goodput-style efficiency number — useful work per unit of energy — analogous to miles per gallon for an inference fleet. It rewards systems that sustain many real agent runs at once on the same power budget, not just peak token throughput on a single prompt.

Originally posted on Learn AI Visually.

NVIDIA RTX Spark Superchip: Unified CPU–GPU Memory

pueding — Sun, 14 Jun 2026 11:18:51 +0000

What: NVIDIA's RTX Spark "superchip" (unveiled around Computex / Build 2026) pairs a 20-core Grace CPU with a Blackwell RTX GPU that together address one 128GB unified memory pool over NVLink-C2C — the idea this page explains is unified coherent CPU–GPU memory.

Why: On an ordinary discrete GPU, any data the GPU touches must first be copied from CPU system RAM into GPU VRAM across the PCIe bus — a copy that dominates the moment a model is too big to fit in VRAM. A shared pool lets the GPU read the bytes where they already sit, deleting that copy.

vs prior: A discrete GPU walls its VRAM off behind PCIe and shuttles data both ways with explicit host↔device copies (cudaMemcpy); RTX Spark's coherent unified pool removes the wall, so CPU and GPU see the same physical addresses — no staging copy, no PCIe round-trip.

Think of it as

Two chefs sharing one counter instead of passing plates through a hatch.

                      THE DATA TO COOK
                              │
              ┌───────────────┴───────────────┐
              │                               │
      ┌───────▼───────┐               ┌───────▼───────┐
      │ DISCRETE GPU  │               │   RTX SPARK   │
      │  (the hatch)  │               │ (one counter) │
      └───────┬───────┘               └───────┬───────┘
              │                               │
     slide each plate                reach across to the
     through one hatch                same shared counter
     (a PCIe copy)                    (NVLink-C2C, in place)
              │                               │
              ▼                               ▼
   ✗ chefs wait on the              ✓ no copy: grab it
     hatch, not cooking               where it already sits

CPU = the prep chef who gathers and stages the ingredients
GPU = the line chef who does the fast cooking
PCIe copy = sliding every plate through one narrow serving hatch
unified memory pool = one shared counter both chefs reach across
NVLink-C2C = the wide-open pass-through that replaces the hatch

Quick glossary

Unified (coherent) memory — A single physical memory pool that both the CPU and GPU address directly. "Coherent" means a write by one processor is visible to the other without an explicit transfer — so there is no host→device copy step.

PCIe — Peripheral Component Interconnect Express — the bus a discrete GPU sits on. A PCIe 5.0 ×16 link tops out near ~64 GB/s, glacial next to a GPU's on-package bandwidth of roughly several TB/s (order-of-magnitude figure). See GPU & CUDA → Memory Hierarchy → NVLink.

VRAM — The GPU's own high-bandwidth memory (GDDR or HBM), physically separate from CPU system RAM on a discrete card. Once a model's working set exceeds VRAM, data must be streamed in from elsewhere.

NVLink-C2C — NVIDIA's chip-to-chip coherent interconnect that bonds the Grace CPU and the GPU into one memory domain — far wider than PCIe and cache-coherent, which is what makes the shared pool possible.

Grace CPU — NVIDIA's Arm-based server/desktop CPU, designed to sit next to a GPU over NVLink-C2C and share a memory pool rather than talk across PCIe.

Host & Device — CUDA's names for the two sides of the copy: the host is the CPU (and its RAM), the device is the GPU (and its VRAM). The classic pattern allocates device memory, copies host→device, launches the kernel, then copies device→host.

FP4 Tensor Core — A 4-bit floating-point matrix unit (fifth-generation on Blackwell). RTX Spark leans on FP4 to fit large models on-device — quantization shrinks the bytes; unified memory removes the copy of those bytes.

The news. On June 2, 2026, around Computex / Build 2026, NVIDIA unveiled RTX Spark, a consumer "superchip" aimed at on-device AI agents. It combines a Blackwell RTX GPU (6,144 CUDA cores, fifth-generation FP4 Tensor Cores) with a 20-core Grace CPU over NVLink-C2C, delivering up to 1 petaflop of AI compute and 128GB of unified memory. RTX Spark laptops and compact desktops ship this fall from ASUS, Dell, HP, Lenovo, Microsoft Surface, and MSI. Read the coverage →

Picture the two chefs for a second. The prep chef chops and stages every ingredient on his bench; the line chef does the fast searing under the heat. On a normal setup they work in separate rooms joined by one narrow serving hatch — every tray of mise en place has to be slid through that slot before the line chef can touch it, and finished plates slid back. For a two-cover lunch the hatch is fine. For a 200-cover banquet, the hatch is the bottleneck: both chefs spend more time shoving trays through the slot than actually cooking. RTX Spark knocks out the wall. Now both chefs work at one long shared counter — the line chef reaches over and grabs the mise en place exactly where the prep chef left it. No hatch, no sliding, no copy.

In CUDA terms, the hatch is the PCIe bus and the trays are cudaMemcpy. A discrete GPU keeps its fast VRAM physically separate from the CPU's system RAM; before a kernel can run, the input is copied host→device across PCIe, and the result copied back. The classic four-step dance: allocate device memory, copy the input host→device across PCIe, launch the kernel, then copy the result device→host.

A PCIe 5.0 ×16 link tops out around ~64 GB/s — quick in isolation, but glacial next to a GPU's on-package bandwidth of roughly several TB/s (order-of-magnitude figure). For a model that fits in VRAM you pay the copy once and amortize it. For a model bigger than VRAM, you stream weights across PCIe layer by layer, every forward pass, and the copy — not the matmul — sets your token rate. That's the regime where decode goes memory-bandwidth-bound and the GPU's compute cores sit idle waiting for bytes.

RTX Spark deletes the staging copy outright. A Grace CPU and a Blackwell GPU are bonded over NVLink-C2C into a single 128GB coherent pool. Coherent is the load-bearing word: both processors see the same bytes at the same addresses, and a write by one is visible to the other with no explicit transfer. The GPU stops being a walled-off device you ship data to and becomes a peer that reads the data in place — the same shift that the memory ladder work frames as moving the bottleneck back toward on-package bandwidth, where it belongs.

This is why NVIDIA pitches RTX Spark as an on-device agent machine. Local agents juggle big context windows, KV caches, and sometimes several models at once — state that is awkward to shuttle across PCIe but trivial to share in a unified pool. A 70B-class model at 4-bit weights needs ~35GB; it won't fit in a typical discrete laptop GPU's 8–16GB of VRAM, so today it either spills to system RAM over PCIe (slow) or simply won't run. With 128GB of unified memory, the same model just lives in the pool and the GPU addresses all of it. (NVIDIA has not published the consumer part's exact NVLink-C2C bandwidth, so treat the on-package figures below as illustrative.)

Where the copy time actually goes

A back-of-envelope walk-through (illustrative numbers; substitute your own workload). Take a 34GB 4-bit model that does not fit in a 16GB discrete GPU. On the discrete path, running one forward pass means streaming all 34GB of weights across PCIe 5.0 at ~64 GB/s → about ~0.53 s of pure copy per pass. During decode that's roughly one pass per token, so the copy alone caps you near ~1.9 tokens/s before a single multiply happens, and the GPU cores idle the whole time. On the unified path, the GPU addresses all 34GB in the shared pool directly at on-package bandwidth — call it ~0.5 TB/s for a consumer part (illustrative) → reading the same 34GB takes about ~0.07 s, roughly ~8× less wait, and the bottleneck moves back to compute where it should be. The model size didn't change; the copy disappeared.

How systems connect CPU and GPU memory

System	CPU ↔ GPU memory	Interconnect	Host→device copy?
Discrete GPU (PCIe card)	separate VRAM + system RAM	PCIe 5.0 ~64 GB/s (setup-dependent)	Yes — both ways
Integrated GPU (iGPU)	shared system RAM	on-die	No, but low bandwidth
Apple Silicon (UMA)	unified system memory	on-package fabric	No
NVIDIA Grace Hopper (GH200)	unified, coherent	NVLink-C2C ~900 GB/s (GH200 figure)	No
NVIDIA RTX Spark (2026)	unified 128GB, coherent	NVLink-C2C (consumer bandwidth undisclosed)	No — zero-copy

A caveat worth attaching to the headline: unified memory removes the copy, not the bandwidth wall. The pool is still finite-bandwidth memory, so a model that's memory-bandwidth-bound on a discrete card is still bandwidth-bound on RTX Spark — it just stops paying the PCIe tax on top. And NVIDIA quotes "1 petaflop" as a low-precision (FP4) peak, not a sustained number. The structural win is real and narrow: the host↔device copy goes away, which is exactly the tax that makes over-VRAM models painful on today's laptops.

Goes deeper in: GPU & CUDA → Memory Hierarchy → NVLink

Related explainers

Jetson Thor — Edge Blackwell vs datacenter Blackwell — the robotics cousin: the same Blackwell silicon, also on a unified-memory SoC
Vera Rubin NVL72 — rack-scale NVLink domain — NVLink at the other extreme: 72 GPUs as one fabric (GPU↔GPU), versus RTX Spark's CPU↔GPU NVLink-C2C
MobileMoE — DRAM-aware MoE scaling — the algorithmic side of fitting big models in tight on-device memory

FAQ

What is unified CPU–GPU memory, in one paragraph?

Unified memory is a single physical memory pool that both the CPU and the GPU address directly. On a discrete GPU, the CPU's system RAM and the GPU's VRAM are separate, so data must be copied across the PCIe bus before the GPU can use it (host→device) and copied back afterward. A unified, coherent pool — like the 128GB pool RTX Spark shares over NVLink-C2C — lets the GPU read the bytes exactly where they sit. No staging copy, no PCIe round-trip.

Why does eliminating the PCIe copy matter for on-device AI?

Because the copy, not the math, is often the bottleneck. A PCIe 5.0 link moves data at roughly ~64 GB/s. When a model is larger than the GPU's VRAM, the weights must stream across PCIe on every forward pass, and the GPU's compute cores idle while they wait. For a 34GB 4-bit model on a 16GB discrete GPU, that copy alone can cap throughput near ~1.9 tokens/s (illustrative). Sharing one 128GB pool lets the model live in memory and the GPU read it in place, moving the bottleneck back to compute and on-package bandwidth.

How is RTX Spark's unified memory different from a discrete GPU or from Apple Silicon?

A discrete GPU has separate VRAM behind PCIe and needs explicit host↔device copies. Apple Silicon and integrated GPUs already share one memory pool, but typically at standard system-memory bandwidth. RTX Spark's approach bonds a Grace CPU and a Blackwell GPU over NVLink-C2C — a wide, cache-coherent chip-to-chip link — into a 128GB coherent pool, so it gets the no-copy benefit of unified memory while keeping a discrete-class GPU on the other end of the link. NVIDIA's Grace Hopper (GH200) datacenter parts use the same NVLink-C2C idea.

Originally posted on Learn AI Visually.

Google Ships Gemma 4 QAT Checkpoints: Quantization-Aware Training

pueding — Sat, 13 Jun 2026 11:18:17 +0000

What: Google shipped quantization-aware-trained (QAT) checkpoints for the Gemma 4 family — open weights that were trained to survive being squeezed down to 4-bit (and 2-bit on the decode layers).

Why: Low-bit weights are how a real model fits on a phone: Google reports the compact E2B size lands at about a 1 GB memory footprint, small enough to run on consumer hardware instead of a datacenter GPU.

vs prior: Versus post-training quantization (PTQ) — which rounds the weights to the low-bit grid after training and falls off an accuracy cliff at very low bit-widths — QAT simulates that rounding during training, so the weights learn to sit on the grid in the first place.

Think of it as

a singer rehearsing on a cheap keyboard with only a few keys

                   THE NOTE TO HIT
                          │
              ┌───────────┴───────────┐
              │                       │
      ┌───────▼───────┐       ┌───────▼───────┐
      │  PTQ          │       │  QAT          │
      │ (round after) │       │ (train on it) │
      └───────┬───────┘       └───────┬───────┘
              │                       │
     sing free, then          rehearse on the
     auto-tune onto           few keys all along
     the nearest key          so notes land there
              │                       │
              ▼                       ▼
       ✗ far note snaps        ✓ note already sits
         hard — sour             on a key — clean
        (accuracy cliff)        (no cliff)

model weight = a note the singer wants to hit
4-bit grid = the few keys the cheap keyboard actually has
post-training quantization = auto-tuning a freely-sung take onto the nearest key afterward
quantization-aware training = rehearsing on those keys all along, so every note already lands on one
accuracy cliff = how sour it sounds when auto-tune drags a far-off note onto a key

Quick glossary

Quantization-Aware Training (QAT) — Training (or fine-tuning) the model while simulating the low-bit rounding on every forward pass, so the weights learn to land on the quantization grid. The result is a checkpoint that holds up far better at low bit-width than the same model quantized after the fact.

Post-Training Quantization (PTQ) — The cheap default: take a finished full-precision model and round its weights to the low-bit grid afterward, with no retraining. Fast, but the rounding error it introduces is exactly what QAT is built to avoid. GPTQ and AWQ are PTQ methods.

Bit-width / precision — BF16 ("Brain float", 16 bits, from Google Brain) stores a weight in 2 bytes with a near-continuous range; INT4 stores it in 4 bits, i.e. only 16 possible values. Fewer bits = fewer grid points = bigger rounding error.

Q4_0 / GGUF — Q4_0 is a 4-bit weight format; GGUF is the on-disk file format llama.cpp loads (e.g. gemma-4-E2B-it-qat-q4_0.gguf). Gemma 4 also ships "compressed tensors" for serving in vLLM.

Straight-through estimator (STE) — Rounding has a zero gradient almost everywhere, so you can't normally backprop through it. STE is the trick QAT uses: round on the forward pass, but pass the gradient through as if rounding were the identity — letting training "feel" the grid.

Mixed precision by layer — Not every layer is equally fragile. Gemma 4's mobile format keeps the reasoning-critical layers at higher precision and pushes the bulky token-generation (decode) layers down to 2-bit, where the memory savings are largest.

The news. On June 5, 2026, Google released quantization-aware-trained checkpoints for the Gemma 4 family, spanning the compact E2B and E4B edge models up through 12B and larger sizes. Alongside the standard Q4_0 4-bit format, a new mobile schema applies targeted 2-bit quantization to the token-generation layers while keeping the core reasoning layers at higher precision, plus an optimized KV cache and static activations. With the mobile format, Gemma 4 E2B's reported footprint drops to about 1 GB. Checkpoints ship as GGUF for llama.cpp and as compressed tensors for vLLM. Read the announcement →

Picture the singer at a cheap keyboard that has only a handful of keys. The pitches she actually wants to sing live between those keys. The lazy way to record is to sing freely and then auto-tune the take onto the nearest key afterward — and if a note was sitting halfway between two keys, the snap yanks it a long way and the whole phrase sounds sour. That sour snap is post-training quantization: the model finishes training wherever its weights landed, and only then do you round them to the coarse low-bit grid. Quantization-aware training is the disciplined alternative — the singer rehearses on those exact keys the entire time, so every note she learns already lands on one. When you finally record in low fidelity, nothing has to move.

Underneath the metaphor, the "keys" are the grid of values a low-bit format can store, and the "sour snap" is rounding error. Drop a weight from 16-bit down to 4-bit and you go from a near-continuous range to just 16 representable values — so the rounding step has to shove each weight onto the nearest of those few points. PTQ does this once, at the end, to weights that never anticipated it. QAT instead simulates the rounding on every training step (using a straight-through estimator so gradients still flow), so the network learns weights that already sit on the grid — and learns to compensate elsewhere for the little that can't. That rounding gap, which widens as the bit-width drops, is exactly what PTQ pays and QAT trains away.

The reason anyone bothers is memory. Take a 2-billion-parameter model (illustrative — E2B is Google's compact "effective-2B" size). At BF16 (2 bytes per weight) the weights alone need 2,000,000,000 × 2 = 4 GB. Round them to 4-bit (about half a byte each) and that's roughly 1 GB — a ~4× shrink. Gemma 4's mobile format then pushes the bulky decode layers down to 2-bit while protecting the reasoning-critical layers, and — with the KV cache and activations optimized on top — Google reports that format brings E2B's footprint to about 1 GB, small enough to run on phone-class hardware. The mixed-precision-by-layer idea is simple: spend your bits where the model is fragile, save them where it is robust.

Approach	When rounding is applied	4-bit accuracy	Cost to produce
Post-training quantization (PTQ)	after training, once	falls off a cliff at very low bit-width (setup-dependent)	cheap — no retraining
Quantization-aware training (QAT)	simulated on every training step	higher quality than standard PTQ at 4-bit (Google)	needs a training / fine-tune pass

The catch is that QAT is not free: someone has to run that extra training pass, which is why it ships from a lab with the GPUs rather than as a one-line conversion you run at home. That is exactly why a vendor releasing pre-quantized QAT checkpoints matters — Google eats the training cost once, and everyone downloading the GGUF gets 4-bit weights without the accuracy cliff they'd hit by quantizing the model themselves. PTQ still has its place when you can't retrain, and aggressive low-bit work brings its own headaches like outlier weights — but for a model meant to live on a phone, training on the grid is what makes the small size honest.

Goes deeper in: LLM Internals → Quantization → The Quantization Process

Related explainers

LongLive 2.0 — NVFP4 W4A4 training and inference — the same "train at low precision, not just serve at it" idea, taken all the way to 4-bit activations
QCA — outlier injection for PTQ — the failure mode on the other side: what makes post-training quantization break, and how PTQ fights back without retraining
MobileMoE — DRAM-aware scaling — the other half of "fits on a phone": shrinking the active memory of a mixture-of-experts model, not its bit-width

FAQ

What is quantization-aware training (QAT)?

QAT trains or fine-tunes a model while simulating low-bit rounding on every forward pass, so the weights learn to land on the quantization grid. Because the network adapts to the rounding during training, the final checkpoint can be stored at low precision — Gemma 4 ships at 4-bit, with 2-bit decode layers in its mobile format — with much less quality loss than rounding the weights afterward.

How is QAT different from post-training quantization?

Post-training quantization (PTQ) rounds a finished full-precision model down to the low-bit grid once, at the end, with no retraining — cheap, but it introduces rounding error the model never learned to absorb, which becomes an accuracy cliff at very low bit-widths. QAT moves that rounding into training, so the weights already sit on the grid and the model compensates for what little error remains.

How does Gemma 4 fit in about 1 GB on a phone?

Two things stack. First, 4-bit weights are roughly 4× smaller than BF16 (about half a byte per weight instead of two bytes). Second, Gemma 4's mobile format pushes the bulky token-generation layers down to 2-bit while keeping reasoning-critical layers higher, and optimizes the KV cache and activations. Google reports the compact E2B size lands at about a 1 GB footprint with the mobile format, and QAT is what keeps that aggressive squeeze from wrecking quality.

Originally posted on Learn AI Visually.

MiniMax M3 Ships Open-Weight 1M Context: MiniMax Sparse Attention (MSA)

pueding — Fri, 12 Jun 2026 11:29:16 +0000

What: The MiniMax M3 release — an open-weight model with a 1M-token context and 59% on SWE-Bench Pro — is built on MiniMax Sparse Attention (MSA), a block-sparse attention that gathers only the slices of the cached past each token actually needs.

Why: At a million tokens, ordinary attention's cost grows with the square of the length. MSA reportedly cuts per-token compute about 20× and delivers >9× faster prefill and >15× faster decode — the kind of serving efficiency that helps a 1M-context model ship with open weights.

vs prior: Where dense (full) attention compares every token against every earlier token, MSA partitions the KV cache into blocks and selects only the relevant ones — and its authors say it partitions more precisely than earlier sparse schemes like DSA or MoBA, while matching full attention on the vast majority of capabilities.

Think of it as

a librarian who fetches only the few relevant shelves

                       ONE QUERY
                           │
              ┌────────────┴────────────┐
              │                         │
      ┌───────▼───────┐         ┌───────▼───────┐
      │ DENSE (full)  │         │  MSA (MiniMax)│
      │ attention     │         │  block gather │
      └───────┬───────┘         └───────┬───────┘
              │                         │
     re-reads every book        glances at the labels,
     in the whole library       pulls a few shelves
     for every question         that bear on the query
              │                         │
              ▼                         ▼
       ✗ cost grows with         ✓ ~20x less compute
         the square of             at 1M tokens —
         the length                same answer

query token = a reader asking one question
KV cache = the whole library of everything read so far
KV block = one labeled shelf of related notes
dense attention = re-reading every book for every question
MSA block gather = pulling only the handful of shelves that matter

Quick glossary

MSA (MiniMax Sparse Attention) — MiniMax M3's attention mechanism. Instead of every token attending to every earlier token, it cuts the cached past into blocks, scores which blocks matter for each query, and computes attention over only the selected few.

KV cache — The stored Key and Value vectors for every token already processed, so the model never recomputes the past. It grows with context length — at 1M tokens it is enormous. Background: KV Cache → Memory Cost.

Dense (full) attention — The standard mechanism: each query compares against all earlier keys, so the work scales with the square of the sequence length (O(n²)). See Attention → Computing Attention Scores.

Block-sparse attention — Skipping most of the attention matrix on purpose. The keys are grouped into contiguous blocks; a lightweight selector keeps only the blocks a query needs and ignores the rest — so the model computes far fewer comparisons without retraining a different model class.

KV-outer gather Q — MiniMax's name for MSA's memory access pattern: for each query (Q), the engine gathers the selected outer KV blocks from cache before computing attention. It is a gather (strided) access pattern, not a dense sweep.

Prefill vs decode — Prefill reads the whole prompt in parallel; decode emits one token at a time. MSA reports separate speedups for each (>9× prefill, >15× decode) because they stress the hardware differently.

DSA / MoBA — Earlier block-sparse attention schemes (DeepSeek's sparse attention and Mixture-of-Block-Attention). MiniMax says MSA partitions the KV cache more precisely than both, keeping quality closer to full attention.

SWE-Bench Pro — A hard software-engineering benchmark: the model must resolve real GitHub issues end to end. M3 reportedly scores 59%, putting an open-weight model in frontier coding territory.

The news. On June 1, 2026, MiniMax released M3, an open-weight model that pairs frontier-level coding (59% on SWE-Bench Pro), a 1M-token context window, and native multimodality. The headline architecture change is MiniMax Sparse Attention (MSA) — a block-sparse attention the team reports cuts per-token compute about 20× at one million tokens. Read the release →

Picture the metaphor for a moment. A reader walks into a vast library — every note the model has ever taken sits on the shelves — and asks one question. The lazy approach is to re-read every book in the building before answering. That always works, but the effort grows brutally: double the library and you roughly quadruple the reading, because each new question also has to consider every new book. A good librarian doesn't do that. The notes are filed onto labeled shelves, the librarian glances at the labels, and pulls only the handful of shelves that actually bear on the question. Same answer, a fraction of the walking.

That is exactly the trade dense attention makes — and exactly the one MSA refuses. In a standard transformer, every token has to compare itself against every earlier token, so the attention work scales with the square of the sequence length. At a few thousand tokens nobody notices; at a million it dominates everything else the model does.

MiniMax Sparse Attention replaces the full sweep with a gather. It cuts the cached past into blocks — think of each block as one labeled shelf — scores which blocks are relevant to the current query, and computes attention over only the selected blocks. MiniMax calls the resulting memory pattern a "KV-outer gather Q": for each query, the engine gathers the chosen KV blocks instead of streaming the whole cache. The team reports this partitions the cache more precisely than earlier block-sparse schemes like DSA or MoBA, which is why M3 holds quality — it matches full attention on the vast majority of capabilities while skipping most of the comparisons.

Where the ~20× actually comes from

Hold the setup fixed and walk the arithmetic. Picture a query at the one-millionth token. Dense attention compares it against all 1,000,000 cached keys. MSA first groups those keys into blocks — say 128 keys each, so roughly 7,800 blocks (illustrative) — scores them, and keeps only the ones that matter. If the selector keeps about 5% of blocks (illustrative), the query now touches ~50,000 keys instead of 1,000,000 — a 20× drop in per-token comparisons, which lines up with the ~20× per-token compute cut MiniMax reports at 1M context. The savings show up twice in serving: >9× faster prefill (the prompt is read in one parallel pass) and >15× faster decode (each new token now gathers a few blocks instead of the whole cache).

How the attention variants compare

Approach	What each query looks at	Cost vs context length	Note
Dense (full) attention	every earlier token	grows with the square (n²)	the baseline; exact but expensive
Sliding-window	a fixed nearby window	linear, but drops far context	cheap; loses long-range recall
DSA / MoBA (block selection)	top-scored blocks	sub-quadratic	prior block-sparse schemes
MSA (MiniMax)	top-scored KV blocks, gathered	~20× less per-token compute at 1M (MiniMax; setup-dependent)	"partitions more precisely than DSA / MoBA"

A caveat worth keeping: the ~20× compute, >9× prefill, and >15× decode figures are MiniMax's own numbers at the 1M-context operating point, and sparse-attention speedups are setup-dependent — block size, how many blocks the selector keeps, sequence length, and the hardware all move them. The qualitative win (gather a few shelves, not the whole library) is the durable lesson; the exact multiplier is a reported headline, not a guarantee at every length.

Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores

Related explainers

IO-optimal approximate attention — near-linear IO — a different route to sub-quadratic attention: cut memory traffic rather than select blocks
Tangram — per-head KV cache budgets — another way to shrink long-context attention cost, by sizing each head's KV budget instead of selecting blocks
Parallax — local linear attention — the linear-attention alternative to the block-sparse approach MSA takes

FAQ

What is MiniMax Sparse Attention (MSA)?

MSA is the attention mechanism inside MiniMax's open-weight M3 model. Instead of having every token attend to every earlier token (dense attention, whose cost grows with the square of the sequence length), MSA partitions the KV cache into blocks, scores which blocks are relevant to each query, and computes attention over only the selected few — a "KV-outer gather Q" access pattern. MiniMax reports it cuts per-token compute about 20× at a 1M-token context while matching full attention on most capabilities.

Why does MSA matter?

Long context is the binding cost for modern LLMs: at a million tokens, dense attention dominates both compute and memory bandwidth. By gathering only the relevant KV blocks, MSA reportedly delivers more than 9× faster prefill and more than 15× faster decode at 1M context, and over 4× faster than Flash-Sparse-Attention. That kind of serving efficiency is what helps make a frontier-coding (59% SWE-Bench Pro), 1M-context model practical to ship with open weights.

How does MSA relate to DSA, MoBA, and the KV cache?

All three are block-sparse attention schemes that select a subset of the KV cache to attend to, rather than the whole thing. MiniMax says MSA partitions the cache more precisely than DSA (DeepSeek sparse attention) or MoBA (mixture of block attention), which is why it keeps quality closer to full attention. It sits one layer above the KV cache itself: the cache stores every token's Key and Value vectors, and MSA decides which blocks of that cache each query is allowed to read.

Originally posted on Learn AI Visually.

Google Releases DiffusionGemma: Parallel Block Decoding

pueding — Thu, 11 Jun 2026 11:18:16 +0000

What: Google released DiffusionGemma, an open-weight model whose headline trick is parallel block decoding — it writes text by refining a whole block of tokens at once through iterative denoising, instead of predicting one next token at a time.

Why: Decoding is the slow, sequential part of running an LLM: emitting N tokens normally costs N forward passes that each wait on the last. Laying down a block in parallel is why DiffusionGemma reports up to 4x faster decode and 1000+ tokens/sec on an H100.

vs prior: Versus standard autoregressive decoding — left-to-right, one token per forward pass under a causal mask — DiffusionGemma starts from a canvas of 256 placeholder tokens and refines them all at once with bidirectional attention, so it can revise an early token using later context.

Think of it as

a Polaroid photo developing all at once vs a printer typing left to right

                       THE PARAGRAPH
                           │
             ┌─────────────┴─────────────┐
             │                           │
     ┌───────▼────────┐         ┌────────▼───────┐
     │    PRINTER     │         │    POLAROID    │
     │ (autoregress.) │         │  (diffusion)   │
     └───────┬────────┘         └────────┬───────┘
             │                           │
     types one token            lays the whole block
     left-to-right, each        down at once, then
     waiting on the last        sharpens it in passes
             │                           │
             ▼                           ▼
        ✗ N passes              ✓ a few parallel
          per N tokens            passes per block
        (sequential)            (up to ~4x faster)

printer (autoregressive) = types one token left-to-right, each waiting on the last
blank Polaroid = a block of 256 placeholder tokens, all present at once but unreadable
the photo developing = iterative denoising — the whole block sharpens in parallel over a few passes
no corner-first rule = bidirectional attention, so any token can use any other to fix itself

Quick glossary

Autoregressive decoding — The standard way LLMs write: predict the next token, append it, feed the longer sequence back in, repeat. Each token needs its own forward pass, and they happen strictly in order — that sequential chain is what makes decode slow.

Diffusion language model — A text model that borrows the recipe behind image generators: start from noise (random placeholder tokens) and repeatedly denoise toward a clean output. Unlike image diffusion it works over discrete tokens, refining a block rather than left-to-right.

Iterative denoising — The refinement loop. Each pass locks in the high-confidence tokens and re-evaluates the rest, so a blank block sharpens into readable text over a handful of passes instead of one token at a time.

Bidirectional (non-causal) attention — Attention with no left-to-right rule: every position can look at every other, future included. It is what lets the model fix an early token using context that appears later — the opposite of the causal mask autoregressive decoders rely on.

Forward pass — One run of the input through the network. Autoregressive decode pays one forward pass per token; DiffusionGemma emits 256 tokens from each pass, then spends a few more passes cleaning them up.

Mixture-of-Experts (MoE) — A model split into many expert sub-networks where each token activates only a few. DiffusionGemma is 26B total / ~3.8B active, so it has the knowledge of a big model but the per-token compute of a small one.

The news. On June 10, 2026, Google released DiffusionGemma, an Apache-2.0 model that generates text by iterative denoising rather than left-to-right sampling. It seeds a block with placeholder tokens and refines 256 tokens in parallel per forward pass using bidirectional attention, reaching 1000+ tokens/sec on an H100 and 700+ tokens/sec on an RTX 5090, and fitting in 18 GB of VRAM when quantized. It is a 26B-parameter mixture-of-experts model with about 3.8B active. Read the announcement →

Picture two machines printing the same paragraph. The first is a dot-matrix printer: it types one character left-to-right and the next character can't start until the last one lands — that is autoregressive decoding, the way nearly every LLM you have used writes one token at a time. The second is a Polaroid: the whole photo comes out at once, blank and blurry, then sharpens everywhere simultaneously over a few seconds. DiffusionGemma is the Polaroid. It lays down a whole block of placeholder tokens up front and then develops them in parallel, so the paragraph appears all at once and gets clearer with each pass.

Underneath the metaphor, "developing the photo" is iterative denoising. The model seeds a block with 256 noisy placeholder slots, then makes several refinement passes; each pass locks in the tokens it is now confident about and re-evaluates the rest. The trick that makes this legal is bidirectional attention — dropping the causal mask that forces a normal decoder to only look backward. Because every slot can attend to every other slot, future included, the model can self-correct an early token using words that only got resolved later. A left-to-right decoder can never do that: once it commits token 5, tokens 6 onward can lean on it, but it can't lean on them.

Property	Autoregressive (standard Gemma)	Parallel block decoding (DiffusionGemma)
How a token is produced	predict the single next token, append, repeat	seed a block of placeholders, denoise all at once
Tokens per forward pass	1	256 (Google)
Attention	causal (look backward only)	bidirectional (look both ways)
Can fix an earlier token?	no — already committed	yes — re-evaluated each pass
Reported decode speed	baseline	up to ~4x faster, 1000+ tok/s on H100 (Google, reported)

Why does generating in blocks win? Run the numbers on a 512-token answer (illustrative). The autoregressive printer needs 512 forward passes — one per token, each stalled waiting on the previous, which is exactly why decode is the latency-bound, memory-starved phase of LLM inference. DiffusionGemma instead lays those 512 tokens down as two blocks of 256 and refines each over a handful of denoising passes — say ~16 passes total (illustrative; Google reports the speedup, not the pass count). That collapses hundreds of strictly-sequential steps into a few parallel ones, and a parallel-friendly pass keeps the GPU busy, which is where the up to 4x faster decode and 1000+ tokens/sec on an H100 come from.

The catch is that each denoising pass is heavier than a single autoregressive step. Bidirectional attention re-reads the whole block every pass, so it can't reuse a backward-only KV cache the way a causal decoder does, and the headline 4x is measured on dedicated GPUs where that parallel work has lanes to fill. DiffusionGemma offsets the cost with a mixture-of-experts design — 26B total parameters but only ~3.8B active per token — and ships in 18 GB of VRAM when quantized, so it still fits a high-end consumer GPU such as the RTX 5090 the source benchmarks. The payoff is a different shape of LLM: not a faster printer, but a model that drafts a paragraph all at once and sharpens it — a live, open-weight alternative to left-to-right decoding.

Goes deeper in: LLM Internals → Text Generation → One Token at a Time

Related explainers

PSD — parallel speculative decoding for diffusion LLMs — a different lever on the same family: speed up diffusion decoding by drafting and verifying, rather than tuning the denoising schedule
dMoE — block-level expert routing — the memory side of serving a diffusion-LLM mixture-of-experts, the model family DiffusionGemma belongs to
Gemma 4 12B — encoder-free multimodal projection — a sibling open Gemma release chasing efficiency from the architecture side instead of the decoding side

FAQ

What is parallel block decoding?

It is a way to generate text a whole block at a time instead of one token at a time. DiffusionGemma seeds a block with 256 placeholder tokens, then makes several "denoising" passes that lock in the confident tokens and re-evaluate the rest, so the whole block sharpens in parallel. Because it uses bidirectional attention, the model can revise an early token using context that appears later — something an autoregressive, left-to-right decoder cannot do.

Why is it faster than autoregressive generation?

Autoregressive decoding produces one token per forward pass, and the passes happen strictly in order, so a 512-token answer needs 512 sequential steps. DiffusionGemma emits 256 tokens per pass and finishes a block in a handful of passes, collapsing hundreds of serial steps into a few parallel ones. Google reports up to 4x faster decode and 1000+ tokens/sec on an H100. The trade-off is that each denoising pass is heavier and can't reuse a backward-only KV cache, so the win is largest on dedicated GPUs.

How does it relate to diffusion image models and normal text generation?

It borrows the core idea from image diffusion — start from noise and repeatedly denoise toward a clean result — but applies it to discrete tokens and refines a block rather than a 2D image. Compared with normal autoregressive text generation, it swaps "predict the next token under a causal mask" for "refine a whole block under bidirectional attention." The output is still text; only the decoding procedure changes.

Originally posted on Learn AI Visually.

Agent-Harness Scaling Law: Feedback Quality Predicts Success, Not Raw Compute: Effective Feedback Compute (EFC)

pueding — Wed, 10 Jun 2026 11:15:58 +0000

What: A new agent-harness scaling-law paper introduces Effective Feedback Compute (EFC) — a single quantity that predicts whether an agent finishes a task from the quality of the feedback its harness returns each step, scored on four axes and normalized by how hard the task is.

Why: It reframes agent reliability as a feedback-quality problem, not a token-budget problem — plotted against EFC, harness-run success follows a clean law (R²≈0.94–0.99), while against raw compute the same runs barely fit (R²≈0.33–0.42).

vs prior: Prior reliability work leaned on raw-compute scaling — more tokens, more tool calls, bigger reasoning budgets — but EFC shows that axis is nearly flat, since lifting only feedback quality moved success from 0.27 to 0.90 with cost and tool-call counts held fixed.

Think of it as

a student with a sharp tutor instead of just re-reading the textbook

                  SAME EXAM, SAME HOURS LOGGED
                             │
               ┌─────────────┴──────────────┐
               ▼                            ▼
       ┌───────────────┐          ┌───────────────┐
       │  RE-READ THE  │          │  SHARP TUTOR  │
       │    TEXTBOOK   │          │  per problem  │
       │ (raw compute) │          │ (feedback Q)  │
       └───────┬───────┘          └───────┬───────┘
               │                          │
      pages logged, but          points at the exact
      no correction lands        mistake — and it sticks
               │                          │
               ▼                          ▼
         ✗ grade ~0.27              ✓ grade ~0.90
         effort, no signal         signal absorbed

agent harness = the study setup that feeds you a correction each round
raw compute = hours logged and pages re-read
feedback quality = how useful the tutor's correction is each time
informativeness = the tutor points at the exact mistake, not "study harder"
validity = the correction is actually right, not misleading
non-redundancy = the tutor doesn't repeat a note you already wrote down
retention = you keep the correction in your notes for the next problem
EFC = total useful correction absorbed, divided by how hard the exam is

Quick glossary

EFC — Effective Feedback Compute — the paper's core metric. It measures how much useful feedback signal a harness feeds back into the agent loop, scored on four axes (informativeness, validity, non-redundancy, retention) and normalized by task demand. It is the x-axis of the proposed scaling law, replacing "tokens and tool calls spent."

Agent harness — The scaffolding around the model — the loop that runs tool calls, observes results, and feeds the next observation back to the model. The harness is what delivers feedback, so it is where EFC is won or lost. Covered in Agent Engineering → Production Harness Architecture.

Scaling law — An empirical curve that predicts an outcome (here, task success rate) from one quantity (here, EFC). A tight scaling law means the curve explains most of the variation; a loose one means the quantity is a poor predictor.

R² (fit quality) — The fraction of variation in success the curve explains, from 0 (the x-axis predicts nothing) to 1 (it predicts everything). EFC reaches R²≈0.94–0.99; the raw-compute baseline only 0.33–0.42. Higher R² = a better predictor.

The four feedback axes — Informativeness (does the message localize the error?), validity (is the correction actually right?), non-redundancy (is it new, or a repeat?), and retention (does the agent still have it later?). EFC is built from all four, so a harness can fail on any one of them.

Task demand — How much corrective signal a task actually needs to be solved. EFC divides feedback quality by task demand so harnesses can be compared fairly across easy and hard tasks — the same crisp feedback is worth more on a demanding task than a trivial one.

The news. On May 28, 2026, researchers posted an agent-harness scaling-law paper to arXiv introducing Effective Feedback Compute (EFC) — a metric that predicts agent success from the quality of feedback the harness returns, not the compute it spends. Plotted against EFC, harness-run success rates fit a clean scaling law (reported R²≈0.94–0.99 across datasets); plotted against raw compute, the same runs barely fit (R²≈0.33–0.42, rising to ~0.88 only with a hand-built multivariate baseline). In one controlled comparison, lifting feedback quality moved success from 0.27 to 0.90 with token cost and tool calls held fixed.

Picture two students prepping for the same exam. The first logs ten hours re-reading the textbook cover to cover — enormous effort, page after page. The second spends one hour with a sharp tutor who, after each practice problem, points at the exact line where the reasoning went wrong, confirms the fix is correct, never repeats a note already written down, and makes sure it lands in the margin for next time. On exam day the second student wins, and it is not close. The hours-logged number — the raw compute — told you almost nothing. The number that predicted the grade was how much useful correction actually got absorbed. That second number is what this paper names Effective Feedback Compute, and the claim is that agent harnesses behave the same way.

The mechanism is a re-definition of the x-axis. Instead of counting tokens or tool invocations, EFC measures the useful signal the harness feeds back each step — scored on four axes (informativeness, validity, non-redundancy, retention) — and then normalizes by task demand so a crisp correction counts for more on a hard task than an easy one. That normalized quantity becomes the horizontal axis of a scaling law that fits success rates across the paper's datasets. The practical reading for anyone building agents: the lever is not your reasoning budget but what your harness chooses to log and return after every tool call.

This is why the raw-compute axis goes flat. A harness can burn an enormous budget returning low-quality feedback — a terse exit code 1 with no stack trace (low informativeness), a linter warning that is actually a false positive (low validity), the same "tests failed" string ten turns in a row (high redundancy), or an error the agent has already forgotten by the time it matters (low retention). All of that is real compute and real tool calls, and on the EFC axis it is worth almost nothing. The tutor who just says "study harder" for an hour spent the hour; the student learned nothing. Worse, in a long rollout the low-signal steps let compounding errors accumulate unchecked, so the spend actively buys you a longer path to the same failure.

Where the feedback gap actually comes from

Hold three variables fixed. One agent. One task. Two runs at the same budget — 40 tool calls, ~120K tokens each. The only difference is the harness's feedback quality. In Run A, every step returns a terse pass/fail string; say each step carries about 0.1 units of useful, valid, non-redundant, retained signal, so over 40 steps the agent accumulates 40 × 0.1 = 4 units. The task demands roughly 30 units to solve, so EFC = 4 / 30 ≈ 0.13 — low on the law's curve, landing near the 0.27 success rate the paper reports at the bottom of its range. In Run B, the harness returns the failing assertion, the offending input, and a one-line diff each step — call it 0.8 units per step, 40 × 0.8 = 32 units, EFC = 32 / 30 ≈ 1.07, high on the curve and up near 0.90 success. Same cost, same tool count, ~8× the effective feedback (illustrative decomposition calibrated to the paper's 0.27→0.90 and R² headline figures — the per-step unit values and task-demand figure are stand-ins, not measured constants). The success jump is the headline; the per-call yield jump is the deeper story.

Scaling-law x-axis	What it counts	Fit to success (R²)
Raw compute	tokens + tool calls spent	~0.33–0.42 — poor (paper)
Multivariate compute baseline	several spend features combined	~0.88 — better, hand-built (paper)
Effective Feedback Compute (EFC)	4-axis feedback quality ÷ task demand	~0.94–0.99 — tight (paper)

A caveat worth stating plainly: this is a scaling-law fit on the paper's own datasets, and a tight fit is a strong correlation, not a guaranteed control knob. EFC is also harder to move than a token budget — "return better feedback" is a design problem, not a slider, and scoring the four axes reliably is itself non-trivial. The honest framing is that EFC gives you a yardstick and a direction: instrument the feedback your harness returns, A/B candidate changes in shadow, and treat feedback quality as a first-class number alongside latency and cost. Whether the exact coefficients transfer to your stack is exactly the kind of thing you should measure, not assume.

Goes deeper in: AI Agents → Evals & Diagnostics → Error analysis first

Related explainers

PushBench — Quantitative Goal Persistence (QGP) — another harness-level number for long-horizon agent reliability
FutureSim — harness-level agent eval — why evaluating the harness, not the model alone, is the trend
Cursor Composer 2.5 — targeted textual feedback RL — the training-time analogue: a sharp, targeted correction beats a blunt end-of-rollout reward

FAQ

What is Effective Feedback Compute (EFC)?

EFC is a metric that predicts agent-harness success from the quality of the feedback the harness returns each step, rather than from the raw compute it spends. It scores feedback on four axes — informativeness, validity, non-redundancy, and retention — and normalizes by task demand so harnesses can be compared fairly across easy and hard tasks. Plotted against EFC, the paper reports success rates fitting a scaling law at R²≈0.94–0.99, far tighter than the ~0.33–0.42 fit against raw compute.

Why does feedback quality predict success better than raw compute?

A harness can spend an enormous budget returning low-quality feedback — terse pass/fail strings, false-positive warnings, repeated messages, or errors the agent has already forgotten. That is real compute that carries almost no useful signal, so the raw-compute axis goes nearly flat. EFC captures the signal that actually reaches the agent, which is why it fits success so much more tightly. In one controlled comparison, lifting only feedback quality moved success from 0.27 to 0.90 with token cost and tool-call counts held fixed.

How do I improve a harness's EFC in practice?

Treat the feedback your harness returns as a first-class design surface: make tool-call results localize the error (informativeness), verify the signal is correct before returning it (validity), suppress repeated or stale messages (non-redundancy), and persist corrections so they survive later in the rollout (retention). Because EFC is a measurable yardstick rather than a slider, the practical loop is to instrument the feedback you return, A/B candidate changes in shadow mode, and track feedback quality alongside latency and cost.

Originally posted on Learn AI Visually.

AutoLab Benchmarks Frontier Agents on Long-Horizon R&D Tasks: Iterative Experiment-Loop Evaluation

pueding — Tue, 09 Jun 2026 11:25:04 +0000

What: The AutoLab benchmark scores agents with iterative experiment-loop evaluation — 36 realistic R&D tasks (optimize a system, tune a CUDA kernel, build a model) where the agent has to propose a change, run an experiment, measure the result, and refine, over and over.

Why: Across 17 frontier models, the strongest predictor of success was sustained iteration that incorporates empirical feedback plus time-awareness — knowing when to keep going — rather than the quality of the first answer.

vs prior: Most LLM benchmarks grade a single answer once; AutoLab grades the whole propose → run → measure → refine loop under a budget, exposing two failure modes a one-shot score is blind to: stopping too early and burning the budget with no measured progress.

Think of it as

tuning a race car in the pit, reading lap times until qualifying closes

         SAME CAR, SAME LAP BUDGET (12 laps)
                          │
        ┌─────────────────┬─────────────────┐
        ▼                 ▼                 ▼
   ┌─────────┐       ┌─────────┐       ┌─────────┐
   │ PARK    │       │ RE-TUNE │       │ TIME +  │
   │ EARLY   │       │ NEVER   │       │ TUNE    │
   │         │       │ TIME    │       │ EVERY   │
   │ 4 laps, │       │ 12 laps,│       │ LAP     │
   │ then    │       │ no clock│       │ 8 timed │
   │ quit    │       │ reading │       │ laps    │
   └────┬────┘       └────┬────┘       └────┬────┘
        ▼                 ▼                 ▼
   stops at          random-walks       compounds to
   ~0.46             ~0.27              ~0.76
   ✗ budget          ✗ no measured      ✓ best lap
     left unused       progress           wins slot

task = set the fastest lap before qualifying closes
experiment loop = adjust the setup → run a lap → read the lap time → adjust again
empirical feedback = the lap time on the stopwatch, not a guess from the spec sheet
budget = the laps you have before the qualifying flag drops
stopping early = parking after two laps with time still on the clock
burning the budget = re-tuning every lap but never reading the timer
persistence = keep timing and tuning until the very last lap

Quick glossary

Long-horizon task — A task that takes many steps and a real budget to finish — not one question with one answer, but a goal you reach by doing work, checking it, and adjusting. AutoLab's tasks run for many tool-using steps.

Experiment loop — The repeating cycle at the heart of R&D work: propose a change → run an experiment or benchmark → measure the outcome → refine. AutoLab scores whether an agent actually keeps this loop turning, not just whether its first attempt looked good.

Empirical feedback — A result you measured by running something — a benchmark number, a test pass/fail, a latency reading — as opposed to a guess. The key move is conditioning the next edit on a number the agent ran itself.

Time-awareness — The agent's sense of how much budget is left and whether more iteration is worth it. Failing it shows up two ways: quitting with budget unspent, or thrashing until the budget runs out with nothing to show.

Agent harness — The runtime that wraps a model into an agent — it schedules tool calls, runs the experiments, and feeds results back into the loop. The same model in a better harness can score very differently.

CUDA-kernel optimization — One of AutoLab's four domains: rewrite a GPU kernel to run faster, then benchmark it to see if it actually did. It is a textbook measure-and-refine loop — and it ties this agent benchmark to the GPU & CUDA track.

The news. Posted to arXiv on June 3, 2026, AutoLab is a benchmark of 36 long-horizon R&D tasks across four domains — system optimization, puzzle & challenge, model development, and CUDA-kernel optimization — that ask an agent to propose changes, run experiments, measure outcomes, and iterate. Evaluating 17 state-of-the-art models, the dominant predictor of success was persistence in repeatedly benchmarking, editing, and incorporating empirical feedback — not the quality of the initial response. Most frontier models either stopped prematurely or burned their budget with minimal progress; Claude-opus-4.6 showed the strongest long-horizon optimization behavior. Read the paper →

Picture a pit crew with a fixed number of laps before qualifying closes. The car that wins the slot isn't the one that posted the best first lap — it's the one whose crew keeps reading the lap time, adjusting the setup, and sending it back out until the flag drops. AutoLab is built on exactly this insight for agents: it hands an agent a real engineering goal and a budget, then watches not the first attempt but whether the agent keeps the experiment loop — propose → run → measure → refine — turning all the way to the deadline.

That loop is the whole concept of iterative experiment-loop evaluation. A classic LLM benchmark asks one question and grades one answer; the agent never gets to run anything. AutoLab instead scores the agent on tasks where it must execute its own experiments and read its own results — the errors compound across a long trajectory, so the only way to climb is to measure, learn, and correct. Crucially, the useful signal here is empirical feedback the agent generates itself (it benchmarks its own kernel and reads the number), which is a different lever from feedback a harness hands back step-by-step.

The benchmark's headline finding is that frontier models fail this in two distinct ways, and both are about knowing when to stop. Some agents stop too early — they post a decent second attempt and quit with most of the budget unspent. Others burn the whole budget but skip the measure step: they keep editing without conditioning each change on a result, so the score random-walks and never compounds. The agents that did well — led by Claude-opus-4.6 — spent their reasoning budget on a disciplined measure-then-refine cadence, which is exactly the time-awareness a one-shot eval can never see.

Why does this matter beyond a leaderboard? Because it relocates the bottleneck for long-horizon agents from raw capability to behavior under a budget. The same skill that tops AutoLab — sustained, measured iteration — is what production teams care about when an agent tunes a config, optimizes a kernel, or chases a flaky test over an afternoon. That makes AutoLab a production-eval signal, not just an academic one: it predicts whether an agent will actually grind a real task to a good result instead of giving up or spinning.

AutoLab domain	What the agent iterates on	What it measures each loop
System optimization	Configs, flags, resource allocation	Throughput / latency of a benchmark run
CUDA-kernel optimization	A GPU kernel's implementation	Wall-clock kernel time vs a baseline
Model development	Training / architecture choices	A validation metric on a held-out set
Puzzle & challenge	Candidate solutions to a hard problem	Pass / fail against the checker

Four domains, 36 tasks total across them; the exact per-task scores are reported in the paper, and the row examples above are illustrative of the loop structure (AutoLab, arXiv 2606.05080).

Where the budget actually goes (numbers illustrative — AutoLab reports the model ranking and the persistence finding, not these per-task point values). Hold three things fixed: a budget of 12 experiment runs, a starting score of ~0.23 (the first answer — roughly the same for all three agents), and a per-loop gain that only lands when the agent measures. Agent A makes 4 measured runs at about +0.06 each, reaches ~0.46, then stops with 8 runs unused. Agent B spends all 12 runs but skips the measure step, so its edits aren't conditioned on a read result — its score random-walks around ~0.27 and never compounds. Agent C makes 8 measured runs, each conditioned on the last result, compounding to ~0.76. Same start, same budget; the entire gap comes from how the loop was spent, not from the first try.

Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors

FAQ

What is iterative experiment-loop evaluation?

It is scoring an agent on whether it keeps a propose → run → measure → refine loop turning, rather than grading a single answer. AutoLab gives the agent a real R&D task and a budget, then rewards measured iteration toward a better result instead of a good-looking first attempt.

Why does sustained iteration beat initial answer quality?

On long-horizon tasks the first attempt is rarely the best one, and errors compound. The agents that win are the ones that read an empirical result, correct, and repeat — using their whole budget. AutoLab found this disposition, not first-shot quality, was the dominant predictor across 17 models.

How does AutoLab relate to benchmarks like EFC and QGP?

They are complementary lenses on long-horizon agent reliability. EFC isolates the quality of the feedback signal a harness returns; QGP measures whether an agent finishes a fixed count of work without spinning; AutoLab measures whether the agent sustains its own measure-and-refine loop under a budget on realistic R&D tasks.

Originally posted on Learn AI Visually.

MCP SEP-2106: Full JSON Schema 2020-12 in Tool I/O

pueding — Mon, 08 Jun 2026 11:18:18 +0000

What: MCP SEP-2106 — merged into the protocol on May 18, 2026 — lets an MCP tool describe its inputs and outputs with the full JSON Schema 2020-12 keyword set in inputSchema and outputSchema, and widens structuredContent from object-only to any JSON value.

Why: Composition (oneOf / anyOf / allOf), conditionals (if / then / else), and references ($ref / $defs) let a tool author push contract rules out of free-form description prose and into the schema, where runtimes and SDKs can validate them before the call ever reaches the tool.

vs prior: The previous MCP spec accepted only a narrow JSON Schema subset (object root with a basic type / properties / required vocabulary); composition, conditionals, refs, and non-object output shapes were not part of the wire vocabulary and had to live in tool description prose.

Think of it as

It's like a job application form with conditional sections, alternatives, and refs.

           THE TOOL'S inputSchema (a form)
                        │
        ┌───────────────┴────────────────┐
 ┌──────▼───────┐                 ┌───────▼──────┐
 │ BEFORE 2106  │                 │ AFTER 2106   │
 │ plain fields │                 │ fields PLUS  │
 │ + prose note │                 │ oneOf / if / │
 │ at the bottom│                 │ then / $ref  │
 └──────┬───────┘                 └───────┬──────┘
        │                                 │
  rules live in                    rules live in
  English prose                    the schema
        │                                 │
        ▼                                 ▼
 ✗ runtime CANNOT                 ✓ runtime REJECTS
   check them                       bad calls early

MCP tool inputSchema = the application form a tool requires from the agent
basic keywords (type / properties / required) = plain text fields and required checkboxes
oneOf / anyOf / allOf = pick exactly one / any combination / all of these alternatives
if / then / else = if you marked 'married', also fill spouse details
$ref / $defs = see the 'Company Address' subform on page 4

Quick glossary

MCP — The Model Context Protocol — a JSON-RPC wire protocol that lets LLM clients (Claude, ChatGPT, IDEs) discover and call tools served by external processes. See the MCP step in the Tool Use module.

SEP — A Specification Enhancement Proposal — the MCP equivalent of a Python PEP or a TC39 proposal. Each SEP is a numbered RFC merged into the spec only after review.

inputSchema / outputSchema — The two JSON Schema documents an MCP server attaches to a tool definition — one for the arguments the agent must send, one for the structured value the tool returns. The runtime validates traffic against them before either side sees a malformed payload.

structuredContent — The field inside a tool result that carries a typed value alongside the human-readable content blocks. Pre-SEP-2106 the TypeScript type was { [key: string]: unknown } — objects only; after SEP-2106 it is plain unknown, so arrays and primitives are wire-legal too.

JSON Schema 2020-12 — The 2020-12 draft of the JSON Schema spec — the most recent stable version. Adds composition (oneOf / anyOf / allOf / not), conditionals (if / then / else), references ($ref / $defs), and tighter $dynamicRef semantics over the older draft-07 vocabulary MCP previously implied.

oneOf / anyOf / allOf — JSON Schema composition keywords. oneOf = match exactly one of N subschemas; anyOf = match at least one; allOf = match every subschema (intersection).

if / then / else — JSON Schema conditional keywords. If a value matches the if subschema, it must also match then; otherwise it must match else. Lets a single schema express "if roundTrip is true, return_date is required."

$ref / $defs — JSON Schema reference keywords. $defs declares reusable named subschemas; $ref points at one of them by JSON Pointer. Lets a long schema avoid copy-pasting the same address or money sub-shape three times.

The news. On May 18, 2026, SEP-2106 merged into the MCP specification. The change widens the schema vocabulary that tools may use to describe their input and output: inputSchema now allows the full JSON Schema 2020-12 keyword set inside its required type: "object" root, outputSchema drops the object-root constraint entirely and accepts any 2020-12 schema, and structuredContent is retyped from object-only to plain unknown. Loosening on paper — but the SEP is explicit that compatibility is asymmetric: a newer server emitting a non-object structuredContent or a composition-rich schema may be rejected by an older client that hasn't been updated, so the SEP recommends servers also emit a serialized TextContent fallback for non-object results during the transition.

Picture a job application that, until last week, only let you fill in plain text fields and checkboxes — name, address, "married?" yes/no. If the form needed something conditional ("if married, also provide spouse name") or alternative ("attach exactly ONE of passport, driver's license, or state ID"), the only way to express it was a paragraph of free-text instructions at the bottom of the page. SEP-2106 hands the form designer a richer template language: now the conditional, the alternatives, and the cross-references to other subforms are spelled out on the form itself, in a way the form's automated validator can actually check before the application gets routed.

The technical reason mirrors the metaphor. Before SEP-2106, the MCP wire spec implied a narrow JSON Schema subset — basically the keywords a 2014-era schema validator would understand: type, properties, required, items, enum, additionalProperties. If a tool needed to express "either a one-way booking (no return date) or a round-trip booking (return date required)," the schema author had two bad options: split it into two separate tools (now the model has to pick), or leave it as one tool with a permissive schema and a paragraph of natural-language instructions in description. The first option inflates the agent's tool registry; the second relies on the model honoring prose constraints that the runtime can't enforce.

Three surfaces, three changes

SEP-2106 touches three places on the wire, with slightly different shapes of change.

Surface	Before SEP-2106	After SEP-2106
`inputSchema` root	must be `type: "object"` (SEP-2106 commit)	must be `type: "object"` (unchanged)
`inputSchema` keywords inside the object root	restricted vocabulary the spec named — `type` / `properties` / `required` (SDKs typically also accepted `items`, `enum`, `additionalProperties`)	full JSON Schema 2020-12 — adds `oneOf` / `anyOf` / `allOf` / `not`, `if` / `then` / `else`, `$ref` / `$defs`, and the rest of the 2020-12 keyword set (SEP-2106 commit)
`outputSchema`	basic, object-rooted (mirrored `inputSchema`) (SEP-2106 commit)	fully flexible — any 2020-12 schema, including array roots, primitive roots, and composition (SEP-2106 commit)
`structuredContent` TypeScript type	`{ [key: string]: unknown }` — object only (SEP-2106 commit)	`unknown` — array, primitive, union, object all wire-legal (SEP-2106 commit)

The root constraint on inputSchema is preserved because every tool call still ships a JSON-RPC arguments object — the call is arguments: { ... }, not arguments: 7. What changed is everything inside that object, plus the symmetric story for what a tool can return.

A worked example

Picture a book_flight tool. Before SEP-2106, its inputSchema could declare four fields — from, to, departure, optional return — using the restricted vocabulary the spec named (type, properties, required). To express "round-trip flights require return, one-way flights forbid it," the author had three options: split into two tools (book_one_way, book_round_trip), leave a permissive schema and write a paragraph of description prose, or both. After SEP-2106, the same tool fits in one schema using composition:

{
  "type": "object",
  "properties": {
    "from": {...}, "to": {...},
    "departure": {...}, "return": {...},
    "roundTrip": { "type": "boolean" }
  },
  "required": ["from", "to", "departure", "roundTrip"],
  "oneOf": [
    {
      "properties": { "roundTrip": { "const": true } },
      "required": ["return"]
    },
    {
      "properties": { "roundTrip": { "const": false } },
      "not": { "required": ["return"] }
    }
  ]
}

The new schema reaches for oneOf, two branch subschemas with their own properties and required, a not, and two const guards — every one of those keywords lived in the JSON Schema 2020-12 standard already, but none were in the wire vocabulary MCP would accept before this SEP. The runtime can now reject a malformed call before it ever reaches the tool, instead of relying on the LLM to read and honor a paragraph of English in the description field.

Why this lands now

Two pressures converged. First, tool authors kept hitting the prose-vs-schema boundary: every nontrivial real-world tool grew a description paragraph explaining what its schema couldn't say, and that paragraph then needed to be re-explained to every model that called the tool. Second, the structured tool I/O step of the agent stack — where output validation lives — assumed an object-rooted structuredContent shape that forced tools returning a list (list_files) or a scalar (count_rows) to wrap their result in { "value": ... }. Both pressures land at the schema vocabulary, so SEP-2106 widens both at once.

The rollout story is more nuanced than "strictly loosening." Existing tools that already used only the previously-allowed keywords keep working unchanged, and the wire protocol stays backward-compatible at the schema vocabulary level — composition keywords like oneOf are legal JSON either way, so an older client that doesn't validate them will simply skip the extra checks (the schema still parses, just with weaker validation). The friction is asymmetric: a newer server emitting a non-object structuredContent or a primitive-rooted outputSchema may be rejected by an older client whose type checks still expect an object, which is why the SEP recommends servers also emit a serialized TextContent fallback for non-object results during the transition. SDK consumers also see one TypeScript source break — the narrower { [k]: unknown } type loses to plain unknown, and any code that depended on the narrower type needs to widen its own annotations to match.

Goes deeper in: AI Agents → Tool Use → Structured tool I/O and AI Agents → Tool Use → MCP

FAQ

What changed in MCP SEP-2106 in one sentence?

SEP-2106 lets MCP tool authors describe their inputs and outputs with the full JSON Schema 2020-12 keyword set — composition (oneOf / anyOf / allOf / not), conditionals (if / then / else), and references ($ref / $defs) — and widens structuredContent from an object-only TypeScript type to plain unknown, while keeping inputSchema's root type: "object" constraint unchanged.

Why does richer tool-schema vocabulary matter for agents?

The wire vocabulary is the only contract the runtime can validate before traffic reaches the tool. Anything that lives in the tool's free-form description prose has to be re-explained to every model that calls the tool, and the runtime can't reject a malformed call until the tool itself errors out. Pushing rules like "if roundTrip is true then return is required" into the schema means the SDK can reject the call before invocation and the model gets a structured error it can react to, instead of a tool-side stack trace.

Does SEP-2106 break existing MCP tools?

Existing tool definitions remain valid because the change only adds allowed keywords and widens types — nothing is removed. Compatibility is asymmetric, though: a newer server emitting a non-object structuredContent or a primitive-rooted outputSchema may be rejected by an older client whose type checks still expect an object. The SEP recommends servers also emit a serialized TextContent fallback for non-object results during the transition. There is also one source-level TypeScript break — consumers whose generic types narrowed structuredContent from unknown to { [key: string]: unknown } see a type error when they upgrade SDK versions, fixed by widening the consumer's type to match.

Originally posted on Learn AI Visually.

MarginGate: Margin-Gated Verification for Batch-Invariant Decoding

pueding — Sun, 07 Jun 2026 11:16:59 +0000

What: The MarginGate paper (arXiv) targets a subtle serving bug with margin-gated verification for batch-invariant decoding: temperature-0 BF16 decoding is treated as reproducible, yet the same prompt can emit different tokens decoded alone versus inside a larger batch.

Why: Reproducibility is load-bearing for debugging, evals, caching, and audits — yet in BF16 greedy serving, the batch a request lands in can silently change which token it emits from one run to the next.

vs prior: Always-on FP32 verification also restores determinism, but MarginGate re-checks only the sparse low-margin steps to reach it at roughly 2× less verification overhead in the paper.

Think of it as

An airport security line with a fast lane and a secondary-screening booth.

                      DECODE STEP
                          │
                 how wide is the margin?
                          │
           ┌──────────────┴──────────────┐
           │                             │
    ┌──────▼───────┐             ┌───────▼──────┐
    │  clean scan  │             │   near-tie   │
    │ wide margin  │             │  tiny margin │
    └──────┬───────┘             └───────┬──────┘
           │                             │
     FAST LANE (BF16)            SECONDARY (FP32)
     wave through                re-check the step
           │                             │
           ▼                             ▼
    ✓ same token, every         ✓ flip caught; K/V
      batch (no jitter)           column repaired

decode step = a traveler reaching the security checkpoint
logit margin = how clearly their boarding pass scans
high-margin step = a clean scan → waved through the fast lane (BF16)
low-margin step = a borderline scan → pulled into secondary screening (FP32)
K/V cache column repair = fixing the one mis-tagged bag before boarding

Quick glossary

BF16 (bfloat16) — A 16-bit floating-point format used for fast inference. It keeps FP32's exponent range but drops mantissa bits, so rounding errors are larger — enough that the order of a sum can change the result.

FP32 — 32-bit floating point — slower but far more precise. MarginGate uses it as the trusted reference to re-check only the steps that might be wrong.

logit margin — The gap between the top-1 and top-2 token scores at a decode step. A large margin means the winner is unambiguous; a tiny margin means a small numerical nudge can flip it.

greedy decoding (temperature 0) — Always emit the single highest-scoring token. People assume this is deterministic — the catch is that "highest-scoring" can change when the arithmetic changes.

floating-point reduction order — Summing numbers in a different order gives slightly different results in finite precision (addition isn't perfectly associative). GPU kernels pick their reduction order based on batch size — so the logits shift.

batch-invariance — The property MarginGate restores: a request produces the same tokens no matter how many other requests share its batch.

K/V cache — The cached keys and values from earlier tokens. When a step is repaired, MarginGate swaps the offending column of this cache so the rest of the sequence stays consistent. See the KV Cache module.

continuous batching — A serving technique where requests join and leave the running batch every step — which is exactly why a request's batch size (and its results) can vary run to run. See Batching.

The news. On May 28, 2026, a paper introduced MarginGate (arXiv 2605.30218), starting from an uncomfortable fact: temperature-0, greedy BF16 decoding is usually assumed to be reproducible, yet the same request can return different tokens depending on how many other requests happen to share its batch. MarginGate measures that batch-induced token flips are rare, then verifies only the steps at risk. Read the paper →

Picture an airport security line. Almost every traveler has a boarding pass that scans cleanly, so the agent waves them straight through the fast lane — that's a decode step with a wide logit margin, where the top token wins by a mile and no amount of numerical jitter would change it. The trouble is the occasional borderline pass: a near-tie between the top two tokens. For those travelers, a tiny nudge decides which way they go — and at temperature 0, that nudge can come from something as invisible as the batch they were standing in.

Why would the batch matter? Because the GPU sums each token's scores in a reduction order that depends on batch size, and in BF16 addition isn't perfectly associative — re-order the sum and the last bit can change. For a confident step that is harmless. For a near-tie it can flip the winner, so the very same prompt emits one token when decoded alone and another when it rides inside a larger batch. The root cause lives one level down, in how BF16 trades mantissa bits for speed versus FP32.

MarginGate's move is to gate on the margin. High-margin steps keep the cheap BF16 fast lane untouched. Only the sparse low-margin steps are sent to secondary screening — a re-computation in FP32, the same verify-then-correct shape that speculative decoding uses. If the trusted FP32 result disagrees with what BF16 produced, MarginGate repairs the step by swapping the offending column of the K/V cache so the rest of the sequence stays consistent. The expensive check fires on a handful of travelers, not the whole terminal.

How much does that save? Take a 1,000-token completion (illustrative). MarginGate flags the low-margin steps — about 18%, or ~180 steps — for an FP32 re-check, while the other ~820 keep the fast path. Of those 180, only a few are genuine flips: the paper measures flip rates of 0.3–1.3% of all steps (just 0.48% for Llama-3.1-8B on MATH500), so on the order of 3–13 tokens would actually have changed. In the paper's tested settings, MarginGate catches and repairs each one. Always-on verification would instead re-run all 1,000 steps in FP32 for the identical result — which is why margin-gating reports ~2× lower overhead (2.23× and 1.99× in the paper) while still restoring 100% sequence-level determinism on the models the paper tested (Llama-3.1-8B and Qwen2.5-14B).

Strategy	Steps re-checked	Determinism	Relative overhead
Trust BF16 (no verify)	none	✗ batch-dependent	1× (baseline)
Always-on FP32 verify	every step	✓ 100%	~2× the gate, varies by model (paper)
MarginGate (margin-gated)	~15–18% (paper)	✓ 100%	~2× lower than always-on (2.23× / 1.99×, paper)

The deeper lesson is that temperature 0 was never a determinism guarantee — it only fixes the sampling rule, not the arithmetic underneath it. MarginGate is cheap precisely because the failure is rare and predictable from the margin: you don't have to distrust every token, just the few that are genuinely on the fence.

Goes deeper in: LLM Internals → Batching → Continuous Batching, and LLM Serving → Serving Metrics & SLOs.

FAQ

What is batch-invariant decoding?

Batch-invariant decoding means a request produces the exact same tokens regardless of how many other requests share its GPU batch. It is the property most people assume temperature-0 greedy decoding already has — and MarginGate is a method for restoring it cheaply when it has quietly broken.

Why does temperature-0 BF16 inference give different tokens in a batch?

Because the GPU sums each step's scores in a reduction order that depends on batch size, and BF16 addition isn't perfectly associative, the logits shift by a tiny amount. On a near-tie between the top two tokens (a low logit margin), that tiny shift can flip which token wins, so the same prompt can emit a different token alone versus inside a larger batch. The paper measures these flips at roughly 0.3–1.3% of steps on the models it tested.

How is MarginGate different from always-on FP32 verification?

Always-on verification re-checks every decode step in FP32; it restores determinism but carries roughly 2× the verification overhead MarginGate does in the paper. MarginGate verifies only the sparse low-margin steps — about 15–18% in the paper — and repairs a true flip by swapping the offending K/V cache column, reaching the same determinism the paper reports (100% sequence-level on Llama-3.1-8B and Qwen2.5-14B).

Originally posted on Learn AI Visually.

MCP 2026-07-28 RC: Stateless Transport

pueding — Sat, 06 Jun 2026 11:16:43 +0000

What: The MCP 2026-07-28 release candidate reworks transport so the tools/call request itself carries every field a server needs to handle it — protocol version, capabilities, auth context, routing keys. The headline framing is stateless transport: any server in a fleet can serve any request, with no per-session pin to a specific instance.

Why: The previous design forced sticky routing: a session was bound to a single server for its lifetime, so load balancers had to either pin connections by session ID or replicate session state out-of-band. Horizontal scaling, blue/green deploys, and crash-recovery all suffered. The 2026-07-28 RC is the headline change of the next stable MCP spec — and it touches every harness that talks to MCP.

vs prior: Earlier MCP transports treated the first request as a handshake that established server-local state; subsequent requests had to land on the same instance. The new design drops the in-process session: each request is self-contained, and when long-lived cross-request state is genuinely needed (subscriptions, sampling sessions, auth tokens) it lives in a shared store any server can read — not in one server's memory.

Think of it as

A self-addressed envelope at a post office with many windows.

                    tools/call  (one letter)
                              │
                ┌─────────────┴─────────────┐
                │                           │
        ┌───────▼───────┐           ┌───────▼───────┐
        │ sticky clerk  │           │ self-addressed│
        │ (one window)  │           │   envelope    │
        └───────┬───────┘           └───────┬───────┘
                │                           │
       state in HER drawer          address + tracking
       notebook, hers alone         ID on the envelope
                │                           │
                ▼                           ▼
       ✗ wait at her window         ✓ any open window
         again; if she's              serves it — open
         out, trail is lost           ten more, all equal

tools/call = handing a letter to a clerk
sticky routing = a clerk who only remembers your shipment from a notebook on her desk — come back to HER for status
self-addressed request = a letter with the destination, sender, and tracking ID printed on the envelope — any window reads it
shared session store (when needed) = the post office's central tracking database — any clerk queries it
horizontal scaling = open ten more windows in the same office; any one serves you

Quick glossary

MCP — The Model Context Protocol — an open protocol for connecting LLM hosts to external tool servers. The host runs the model and the agent's tool loop; servers expose tools, resources, and prompts over JSON-RPC.

SEP — Specification Enhancement Proposal — MCP's RFC-style change document. The 2026-07-28 RC bundles twenty-two scoped SEPs covering the transport rework, the new Extensions framework, MCP Apps, Tasks, and authorization fixes.

Sticky routing — A load-balancing pattern where a session ID is pinned to a single backend instance for its lifetime. The load balancer hashes the session ID and always routes to the same server. Works fine until that one server is overloaded, restarted, or replaced.

Self-contained request — A request shape where every field the server needs to handle it — protocol version, declared client capabilities, routing keys, auth context — travels with the request itself. The server does not assume any prior state from earlier messages on the same socket.

Shared session store — An out-of-process store (Redis-equivalent, a database, an object store) that any server in the fleet can read and write. Used for the small subset of MCP interactions that genuinely need cross-request state — long-lived subscriptions, sampling sessions, OAuth tokens. The transport itself is still stateless; the store is an implementation pattern for state that has to survive across requests.

Tasks extension (SEP-2663) — The async-handle model for long-running tools: a server returns a Task handle the client drives with tasks/get, tasks/update, tasks/cancel. It composes naturally with stateless transport because the task handle is the only cross-request key the client needs.

The news. On May 22, 2026, the MCP project landed PR #2750 — the blog announcement for the 2026-07-28 specification release candidate. The post leads with the stateless transport rework as the headline change, with a before/after HTTP example showing a self-contained tools/call request. Extensions, MCP Apps, and Tasks follow as the new capability story; the authorization changes are summarized by the failure modes they fix rather than enumerated SEP-by-SEP. All twenty-two scoped SEPs are linked from the announcement.

Picture the post office with many windows. The slow path is the sticky clerk: you hand your letter to clerk #3, and clerk #3 jots the details in a notebook only she keeps in her drawer. If you come back to check on your shipment, you have to wait at her window — none of the other clerks can tell you anything. If clerk #3 is busy, or goes on break, or quits, the trail of your shipment goes with her. The line at her window grows; the other windows are quiet. That is exactly what sticky-routed MCP looks like today. The agent's tool-use loop opens a session, the load balancer pins that session to one server, and every follow-up call has to land on that same server. One server gets the traffic; the others sit idle.

The fast path is the self-addressed envelope. You write the destination, the sender, and a tracking ID on the front of every letter, and the post office stops needing any one clerk to remember anything about your shipment. Any open window will do. That is the 2026-07-28 framing: each tools/call carries the protocol version it expects, the client capabilities it declared, any routing keys the server fleet needs, and the auth context — all in the request itself. The server reads the envelope and acts. No drawer notebook. No "come back to me." A second request half a second later can land on a different server entirely and produce identical behavior.

There is a real subtlety worth saying out loud. A few MCP interactions genuinely do need cross-request memory — long-lived subscriptions, sampling sessions, OAuth tokens that have to outlive a single call. The new design does not pretend those don't exist. It externalizes them: the central tracking database the metaphor mentions is a shared store (a Redis-equivalent, a database, an object store) that any server queries when it needs to hydrate that bit of cross-request state. The transport is still stateless — the request itself is self-contained — and the implementation pattern of a shared store is what makes the small slice of stateful behavior work across a fleet. Mixing those two ideas up is easy and worth keeping straight: the protocol's change is at the transport layer; the shared store is one way servers can choose to persist what little state has to outlive a request.

The capacity argument writes itself. Consider 300 concurrent agent sessions, each holding open MCP traffic at ~2 calls per second, hitting a fleet of 3 servers. Sticky routing assigns each session to one server at session open. Distribution is rarely uniform — three or four "power user" sessions can pin one server's load near saturation while the others sit at 10-20%. Numerically: a typical sticky-imbalance run might leave S1 at ~92% utilization while S2 and S3 sit at ~8% and ~41% (illustrative). Under stateless transport with the same workload, the load balancer can spray every call independently. The same 600 calls/sec land on three servers at ~49% each (illustrative) — a ~1.9× improvement in usable fleet headroom before any vertical scaling.

Where the rework earns its keep

Sticky routing's failure modes are well-known in the agent harness world: one hot server, blue/green deploys that have to drain sessions for minutes, crash recovery that can't transparently re-route. The 2026-07-28 RC closes all three at the transport level. Self-contained requests do not pin to anything, so a deploy that rolls a server out of rotation finishes in seconds — pending requests just hit the next server. A server that crashes drops its in-flight requests, and the client retries against the fleet — the next call lands somewhere else and proceeds. The only state that needs to survive the crash is whatever the workload chose to put in the shared store, which is the small minority of interactions.

The shape of what the RC actually changes is concrete. The table below contrasts the legacy and new transport.

Aspect	Sticky-routed transport (legacy)	Stateless transport (2026-07-28 RC)
Session lifetime	Bound to one server for the session's life	No per-session server binding
Routing key	Session ID hashed to a specific instance	None — any instance, any request
First request	Handshake that creates server-local state	Self-contained, no implicit setup
Cross-request state	In server memory	In a shared store, only when needed (subscriptions, sampling, auth)
Horizontal scale-out	Awkward — uneven load by session hash	Native — load balancer sprays calls
Server restart	Drops the session; client must rebuild	Drops in-flight; retry hits any other server

A related design point is worth knowing. The Tasks extension (SEP-2663) ships a complementary idea one layer up: it gives the client a long-lived taskId it can poll across reconnects. SEP-2663 needed the transport rework to be fully useful — a taskId polled across reconnects only works if the next tasks/get doesn't have to land on the same server that issued the handle. Stateless transport is what makes that work: the taskId is the only cross-request key the client carries, the server fleet hydrates the task's state from the shared store, and the polling call goes to whichever server is least busy.

The boundary of what the RC changes is the transport itself, not the protocol semantics. Tools still return tool results; resources still return resource contents; the wire format of a method call is the same JSON-RPC envelope. What changes is what a server is allowed to assume: nothing about prior calls on the same connection. That single discipline is enough to make every harness operator's life easier and to make the parallel-tool-call patterns the Cost & Latency module recommends actually achievable in a fleet.

FAQ

What does stateless transport mean in the MCP 2026-07-28 RC?

It means the tools/call request itself carries every field a server needs to handle it — protocol version, declared client capabilities, routing keys, auth context. The server is not allowed to assume any state from prior calls on the same connection. A consequence is that any server in a fleet can serve any request, so no sticky session binding is needed at the load balancer.

What replaces sticky routing for state that genuinely has to live across requests?

A shared store. The small subset of MCP interactions that need cross-request memory — long-lived subscriptions, sampling sessions, OAuth tokens — moves out of any one server's process and into a Redis-equivalent (or database, or object store) the entire fleet reads. The transport itself is still stateless; the shared store is an implementation pattern for the slice of state that must survive across requests.

How does the transport rework relate to the Tasks extension (SEP-2663)?

They compose. SEP-2663 lets a server return a long-lived taskId the client polls later. Stateless transport is what makes that poll robust across a fleet: the next tasks/get does not need to land on the same server that issued the handle. Together they let an agent harness survive server restarts, blue/green deploys, and load-balancer reshuffles without any session affinity.

What needs to change in existing MCP server code to support stateless transport?

Concretely: stop reading state from the connection. Any field the server used to learn once at session-establish and remember for the lifetime of the connection — declared client capabilities, protocol version, auth identity, routing tenant — must now be read from each tools/call request instead. Servers that already drove every decision off the incoming request payload need minimal changes. Servers that built up per-connection caches (negotiated capabilities, OAuth introspection results, tenant routing decisions) need to externalize those caches into a shared store the whole fleet reads, or push them to the client to re-send. Most production MCP servers will land in the middle: a few small migrations rather than a rewrite.

How does stateless transport affect MCP authentication and authorization?

Auth context becomes a per-request field rather than a per-session attribute. The 2026-07-28 RC expects every tools/call to carry whatever proof the server needs — a bearer token, a signed capability, a tenant identifier — so any server in the fleet can verify the call without consulting prior connection state. The net effect on a production stack is that a load-balancer reshuffle, a server restart, or a blue/green deploy mid-flight no longer drops the agent's authorization, because no server held it in process memory in the first place. Token introspection caches still live somewhere, but in a shared store the entire fleet shares (Redis-equivalent), not in any single server's per-connection state.

Originally posted on Learn AI Visually.

Token Budgets Paper: Affine-Typed Budget Ownership

pueding — Fri, 05 Jun 2026 11:16:05 +0000

What: The Token Budgets paper catalogs 63 real LLM-agent cost-overrun incidents and ships a Rust crate that models a token/cost budget as an affine-typed (use-at-most-once) resource the compiler tracks.

Why: Cost is a production failure mode, and the paper finds it's multi-agent delegation — not single agents — that drives the overruns: fan out work to parallel sub-agents and each one quietly reserves budget against a cap nobody is decrementing.

vs prior: Versus a runtime budget guard — an assert that fires at spend time, after the tokens are already committed — affine typing makes an overrun a compile-time error, so the unsafe code path can't ship in the first place.

Think of it as

One prepaid gift card a group splits at dinner.

                  ONE $1,000 GIFT CARD
                          │
          ┌───────────────┴───────────────┐
          │                               │
  ┌───────▼───────┐               ┌───────▼───────┐
  │  PHOTOCOPY IT │               │   SPLIT IT    │
  │ (static copy) │               │ (affine move) │
  └───────┬───────┘               └───────┬───────┘
          │                               │
 4 copies x $350 each         $300+$220+$260+$220
 nobody debits the card       money moves out, no copy
          │                               │
          ▼                               ▼
   ✗ bill = $1,400               ✓ total = $1,000
     over a $1,000 cap             bounded to the cap

token budget = the card's balance ($1,000)
sub-agent = a friend who wants to spend
static reservation = everyone photocopies the card and assumes the full balance
overshoot = four copies each spend $350 — the bill hits $1,400 on a $1,000 card
affine ownership = split the card into prepaid sub-cards — money moves out, can't be photocopied

Quick glossary

Token / cost budget — A hard cap on how many tokens (and therefore dollars) one agent task is allowed to spend. Where those tokens go is the first thing a production agent has to account for.

Affine type — A type that may be used at most once. The compiler tracks the value's ownership, so you can move or split it but never copy it — exactly the property a budget needs.

Delegation fan-out — When an orchestrator hands a task to several sub-agents running in parallel. Each child needs some budget, and the question is who keeps the shared total honest.

Static vs adaptive reservation — Static reservation grabs a fixed slice up front and over-provisions 4–6×; adaptive reservation re-estimates per call and over-provisions 2.11× — fewer wasted tokens, but still a runtime accounting trick.

Compile-time vs runtime check — A runtime check tests the budget while the agent runs (too late to un-spend); a compile-time check rejects the unsafe program before it ever runs. Affine typing moves the cap into the second category.

Cohen's kappa — An inter-rater agreement score (1.0 = perfect). The paper's 8-category failure taxonomy reaches 0.837, i.e. two independent reviewers classified the incidents almost identically.

The news. On June 2, 2026, the Token Budgets paper landed: an empirical catalog of 63 production cost-overrun incidents in LLM-agent systems, pulled from a review of 21 orchestration frameworks spanning 2023–2026 and clustered into an 8-category failure taxonomy (inter-rater Cohen's kappa 0.837). As a mitigation, the authors ship a 1,180-line Rust crate that uses affine-type ownership to turn budget violations into compile-time errors. In controlled tests, single-agent runs never overshot (0/30) while multi-agent asyncio delegation overshot every time (30/30); the mitigated runs then logged 0 cap violations across 160 live-API tests. Read the paper →

Picture the group dinner. There's one prepaid gift card with $1,000 on it, and four friends who all want to order. The cheap, lazy move is for everyone to photocopy the card and assume they each have the full balance — four copies, four people each cheerfully spending $350, and a $1,400 bill arrives against a card that only ever held $1,000. The card was never debited as people spent, so nothing stopped the overshoot until the bill came. Affine-typed budget ownership is the opposite rule: there is exactly one card, and the only legal operation is to split it into prepaid sub-cards — the money physically moves out of the original, and a photocopy simply isn't allowed.

In an agent system the "photocopy" bug is a delegation fan-out: an orchestrator spawns parallel sub-agents, and each one reserves a chunk of the token budget against a cap that no single owner is decrementing. The paper's headline number is that this pattern overshot 30 out of 30 runs, while a single agent — which spends against one running total — overshot 0 of 30. The fix is to make the budget an affine value: the Rust compiler tracks it as use-at-most-once, so a code path where two sub-agents could both hold the same budget fails to type-check. The cap is enforced by construction rather than by an assert that fires after the tokens are already gone — the same shift from runtime to compile-time that separates a retry loop that quietly re-bills you from one that can't.

Where the budget actually goes

A back-of-envelope walk-through (illustrative cap and slice sizes; the overshoot and over-reservation counts are the paper's). Say the shared cap is 1,000 tokens and the orchestrator fans out to four sub-agents. Under static reservation each child grabs a fixed 350, and because the reservations are effectively copies, the total claimed is 4 × 350 = 1,400 — a 400-token (40%) overshoot that nothing rejects until the spend lands. Make the budget affine and the same 1,000 is split into owned slices — say 300 + 220 + 260 + 220 = 1,000 — where the fourth claim can only take what the first three left behind. The sum is bounded to the cap by construction, which is the property the paper's Rust crate enforces: across 160 live-API tests it logged 0 cap violations, where unbounded multi-agent delegation had overshot all 30 runs. Static reservation's habit of grabbing 4–6× the budget it needs (adaptive trims that to 2.11×) is the same waste, viewed from the other side.

Approach	When the cap is checked	Multi-agent overshoot	Over-reservation
Runtime budget guard	at spend time — after tokens commit	possible (the default failure)	—
Static reservation	up front, no shared cap	30/30 runs (Token Budgets paper)	~4–6× (paper)
Adaptive reservation	re-estimated per call	not reported (paper)	~2.11× (paper)
Affine-typed ownership	compile time — won't type-check	0 violations / 160 tests (paper)	bounded to the cap

The catch is that this only buys you safety where you can express ownership in the type system — a Rust crate gets it for free, a Python orchestrator built on asyncio.gather does not, which is exactly where the paper's 30/30 overshoots came from. But the lesson generalizes past the language: in a multi-agent team the budget is a shared resource, and who is allowed to hold it, and whether they can copy it, is a design decision — not something to discover when the bill arrives.

Goes deeper in: Agent Engineering → Cost & Latency Engineering → Where the tokens go

Related explainers

StreamMA — Streaming inter-agent reasoning — a different multi-agent cost: wall-clock latency from serial handoffs, cut by pipelining rather than by bounding tokens
Maestro — RL orchestrator over frozen experts — the orchestrator-over-sub-agents topology where this fan-out budget problem lives
EFC — feedback-quality scaling law — what actually predicts agent-harness success, the other half of "spend the budget well"

FAQ

What is affine-typed budget ownership?

It models an agent's token or cost budget as an affine-typed value — one the compiler allows you to use at most once. You can split the budget into smaller owned slices or move it to a sub-agent, but you can't copy it, so two parts of the system can never both spend against the same cap. The Token Budgets paper implements this in a Rust crate and reports 0 cap violations across 160 live-API tests.

Why do multi-agent systems overshoot their token budget?

Because delegation fans the work out to parallel sub-agents that each reserve budget against a cap no single owner is decrementing. The reservations behave like copies, so their sum can exceed the real limit. In the paper's controlled tests, multi-agent asyncio delegation overshot 30 of 30 runs while a single agent — spending against one running total — overshot 0 of 30.

How is a compile-time budget check different from a runtime guard?

A runtime guard (an assert or limiter) checks the budget while the agent runs, which is too late to un-spend tokens already committed. A compile-time check rejects the unsafe program before it runs: with affine typing, a code path where two sub-agents could hold the same budget simply fails to type-check, so the cap is enforced by construction rather than by hoping the guard fires in time.

Originally posted on Learn AI Visually.