DEV Community: Amit

Reset Windows Are Product Design

Amit — Sat, 06 Jun 2026 22:21:20 +0000

TL;DR

AI subscriptions do not primarily differ by model quality anymore. They differ by reset window design.
Claude uses a five-hour reset after you hit the included limit. Perplexity restores each Pro Search credit exactly 24 hours after you use it. Devin mixes daily and weekly quota. Copilot and Cursor are closer to monthly credit accounting.
These windows shape user behavior. Burst windows reward sprints. Rolling restore windows reward daily pacing. Weekly quotas reward batching. Monthly credit buckets reward explicit budget management.
The useful comparison is not unlimited versus limited. It is whether the reset logic matches the actual rhythm of the work.
The market still hides this layer too often. Reset mechanics should be front-page product information, not something users reverse-engineer after they stall out mid-task.

The most important design choice in an AI subscription is usually not the model. It is the reset window.

That sounds like billing trivia until you use these products heavily. Then it becomes obvious that a five-hour reset, a rolling 24-hour restore, a daily cap, and a monthly credit bucket do not feel remotely the same. They create different working habits, different failure modes, and different emotional contracts between the user and the vendor.

This is why the current AI subscription market is so confusing. The checkout pages look similar. The behavior design underneath is not.

The Real UX Layer

Here is the layer that matters.

Product	Publicly documented reset design	What it teaches the user
Claude	Included usage resets every five hours once you hit the limit	Work in hard sprints, then stop or pay
Perplexity Pro	Each Pro Search credit returns 24 hours after use	Spend steadily; each search has its own timer
Devin Pro / Max	Pro: daily and weekly quota. Max: weekly quota, no daily cap	Scope jobs and batch autonomous work deliberately
GitHub Copilot	Monthly AI credit allowance	Treat agents as metered features inside a seat
Cursor	Monthly billing cycle with included usage and paid overage	Treat the subscription like a budget starter pack
Gemini	Mostly daily feature limits that can change	Use broadly, but within daily feature lanes
ChatGPT + Codex	Reset exists, but no single universal public window across plans	Learn the system by usage, not by policy page

That table is product design, not accounting.

The reset logic tells users how much the vendor wants usage to spike, how much it wants usage to smooth out, and how much operational unpredictability it is willing to surface.

Burst Windows Produce Sprint Behavior

Claude's paid-plan usage docs are unusually explicit. When you hit the included limit, the plan resets every five hours. If you enable usage credits, you can keep going at standard API rates instead of waiting.

That produces a very specific user rhythm.

You do not casually idle in Claude the way you idle in a chat app. The optimal behavior is to line up work, enter with a bounded task, push hard, and exit when the work is done. The window trains you toward burst discipline because stray conversation and long transcript drift are no longer harmless. They cannibalize the same five-hour allowance you need for the next real task.

This is one reason Anthropic's own guidance emphasizes shorter conversations, lighter tool usage, and fresh threads for new topics. The reset window and the usage advice are the same design choice seen from two angles.

Five hours is not just a limit. It is a behavior shaper.

Rolling Restore Windows Produce Pacing

Perplexity Pro uses a different philosophy. Users get at least 300 Pro Searches per day, and each credit is restored exactly 24 hours after it is used.

That is cleaner than the classic midnight reset model because it maps the budget to actual activity instead of the calendar. If I spend 40 searches at 2:15 PM, those 40 searches come back at 2:15 PM tomorrow. The system is local to my behavior.

That design quietly encourages pacing over bingeing. It discourages the feeling that you should burn everything before midnight because tomorrow is a fresh bucket anyway. It also makes the product easier to reason about when the unit of work is discrete, as search requests generally are.

This is why research subscriptions often feel calmer than coding subscriptions. Search is naturally chunked. Autonomous coding work is not.

Daily and Weekly Quotas Produce Portfolio Thinking

Devin's self-serve billing docs are the clearest example of an agent-native reset model. Pro combines daily and weekly quota. Max removes the daily cap and keeps the weekly quota. The usage docs add two important details: idle sleep does not materially consume usage, and there is no limit on simultaneous sessions.

Those policies change the operational mental model completely.

You stop thinking in terms of one conversation. You start thinking like a portfolio manager. Which jobs deserve to run today? Which ones can wait until tomorrow's daily refresh? Which ones are worth spending from the weekly pool because they may unlock downstream work? Which tasks should be split into smaller parallel sessions because that improves throughput without wasting idle time?

That is not a chat product mentality. That is queue design.

The weekly layer matters because autonomous agents often have spiky value. Some days you want ten scoped runs. Some days you want zero. A weekly envelope absorbs that variation better than a strict daily wall, but only if the daily cap does not pinch too early. Devin Max is effectively selling that flexibility.

Monthly Buckets Produce Budget Awareness

Cursor's pricing docs say Pro includes $20 of API agent usage plus bonus usage, and its billing docs tie the reset to the billing cycle. GitHub Copilot's pricing and billing docs do the same with AI credits layered on top of unlimited completions.

Monthly buckets produce a different behavior again.

They do not tell you when to work during the day. They tell you how honest to be about your budget. This is why Cursor's documentation is so strong. It explicitly says daily agent users often land above the sticker price. The user is invited to think in monthly spend, not monthly entitlement.

Monthly resets fit better when the work itself is already budgeted monthly. Teams buy seats. Individuals expense subscriptions. Managers reconcile spend at the end of the month. The product behavior lines up with the purchasing system.

The downside is that the system can hide waste longer. A bad five-hour window hurts immediately. A sloppy monthly bucket can drift for three weeks before anyone notices the overage logic.

Opaque Windows Create Learned Helplessness

This is where ChatGPT + Codex becomes interesting.

OpenAI is becoming more transparent under the hood. Codex moved to token-based credit pricing on April 2, 2026. Flexible credits for Plus and Pro make the overflow path explicit. But the reset layer for the bundled consumer experience is still harder to reason about than Claude, Perplexity, Devin, or Cursor.

That opacity matters more than people admit. When users cannot predict when capacity will come back, they start self-throttling in ways the product did not intend. They save prompts. They avoid ambitious tasks. Or they push until failure and then feel arbitrary punishment. None of that is good design.

This is the hidden cost of soft boundaries. They feel friendly at signup. They become fuzzy and stressful under load.

Why This Matters More For Agents

Reset windows mattered less when these tools were mostly chat.

They matter much more when the product is expected to run autonomous loops, use tools, inspect repos, or produce long artifacts. How Do AI Agents Spend Your Money? found agentic coding tasks can consume roughly 1000 times more tokens than simpler coding interactions, with up to 30 times variance on the same task. That means the reset system is no longer a background billing detail. It directly governs whether users trust the product enough to hand over real work.

The harder the product leans into autonomous behavior, the more the reset policy becomes part of the product surface.

This is why AI subscriptions are drifting away from classic SaaS psychology. A seat that can burn unpredictable compute in bursts does not behave like email, storage, or project management software. It behaves more like a gateway to a volatile infrastructure budget with a UX wrapper on top.

So What

The best way to compare these products is not by asking which one has the highest headline limit.

The better question is: which reset window matches the natural rhythm of the work?

If the work comes in intense bursts, a five-hour system like Claude makes sense. If the work is steady daily research, Perplexity's rolling restore model is better. If the work is a queue of scoped autonomous jobs, weekly quota plus idle sleep semantics, like Devin's, is much closer to the actual workload. If the work is fundamentally monthly budget management, Cursor and Copilot are more honest contracts.

The missing piece is standardization. Vendors still talk about models, features, and price more clearly than they talk about reset mechanics, even though reset mechanics do more to determine how the subscription actually feels.

The open thread I am still stuck on: do these products eventually standardize around explicit reset-language the way cloud products standardized around pricing primitives, or do vendors keep treating reset logic as a soft, semi-hidden UX lever because ambiguity sells better than clarity?

Part 2 of the Agent Economics series.
← Part 1: AI Subscriptions Are Secretly Usage Models · Part 3: Autonomous Agents Break Flat-Rate Pricing →

How To Optimize Agent Subscriptions Without Getting Tricked

Amit — Sat, 06 Jun 2026 22:20:44 +0000

TL;DR

The highest-return optimization is not writing better prompts. It is running better workloads.
Across the major plans, the same rules keep showing up in different language: shorter sessions, smaller context, fewer unnecessary tools, bounded retries, and more scoped tasks. Anthropic says this explicitly. Devin says it explicitly. Cursor and Copilot make the cost structure visible enough that the same lesson is hard to miss.
The research supports the operational view. How Do AI Agents Spend Your Money? found coding-agent tasks can be around 1000x more expensive than simpler coding interactions, with up to 30x cost variance on the same task. Evaluating AGENTS.md found that more repo context can increase cost and reduce success.
If you optimize for autonomous runs, think like an operator: queue design, context hygiene, retry policy, reset timing, and memory strategy.
The trap is mistaking included usage for free slack. It is not slack. It is a budget with nicer branding.

Most people try to optimize AI subscriptions at the prompt layer first.

That is usually the wrong layer.

Once you are using these products for real coding or research work, the biggest wins come from workload design: what you ask the system to do, how much context you give it, how often you force it to restart, and when you choose to spend from a reset window versus waiting for the next one.

The good news is that the optimization patterns are consistent across products. The bad news is that most users still behave as if the subscription is infinite until a warning banner appears.

Treat The Plan Like A Budget, Not A Perk

The first mistake is psychological.

Subscriptions train people to think in entitlement: I paid for the plan, so I should use it freely. That instinct is mostly harmless in old SaaS categories. It is destructive here.

Claude's paid-plan docs tell users to plan intensive work around five-hour windows. OpenAI's Codex docs expose token-based credit rates under the surface. Cursor's docs say daily agent users often land well above the sticker price. Copilot's docs split unlimited completions from credit-burning agent actions.

The products are already telling you what they are: budgets.

If you keep thinking of them as perks, you will use them lazily and then be surprised when the overage model or reset wall shows up.

Scope Is The Highest-Return Control

The single best way to get more useful autonomous work out of a plan is to narrow the task boundary.

One agent run should do one bounded thing: review this diff, refactor this file set, investigate this test failure, summarize this document pack, draft this migration plan. The moment the task becomes "and then also check these five other things while you're there," the run becomes harder to reason about and easier to overpay for.

Devin's usage guidance is unusually explicit on this point. It recommends splitting large projects into smaller sessions and notes there is no limit on simultaneous sessions. That is not just a convenience feature. It is a cost-control pattern.

Multiple small runs beat one sprawling run because they give you clearer stopping points, smaller context footprints, and fewer useless retries after the original task is already done.

Context Hygiene Beats Prompt Cleverness

The second major lever is context size.

This is where users consistently sabotage themselves. They carry giant transcript history forward. They stuff in entire repos because "more context should help." They keep irrelevant tool definitions active. Then they blame the model when the run gets slower, worse, and more expensive.

The evidence here is now pretty direct. Anthropic's usage-limit best practices recommend shorter conversations, fewer tools, and more careful project usage. Evaluating AGENTS.md found repository-level context files often increased inference cost by more than 20% and reduced success in the tested settings.

More context is not free, and it is not neutral. It changes the economics and often the quality.

The operational rule is simple:

Reuse stable project memory where the product supports it.
Retrieve only the files needed for the current task.
Start a new thread when the job changes.
Delete stale instructions instead of layering new ones on top.

That is not glamorous advice. It works.

Bound Retries Before They Become The Whole Bill

Autonomous systems fail in loops.

A tool call fails. The agent retries. The retry partially works but leaves bad state. The agent reasons about the bad state, retries the wrong thing, and now the entire run is spending tokens on recovery rather than progress.

This is one reason How Do AI Agents Spend Your Money? matters so much. The paper does not just show that agentic coding can be drastically more expensive than simpler interactions. It shows large variance on the same task. That variance is where retries, poor branching, and transcript bloat tend to hide.

The operator move is to set hard personal rules:

If the agent misses the frame twice, restart with a narrower task.
If the tool loop is thrashing, stop the run and inspect state manually.
If the job requires a new subproblem, open a new session instead of letting the current one absorb it.

This is the same logic distributed systems operators use. A retry policy without a termination policy is not resilience. It is budget leakage.

Use Reset Windows Deliberately

The next optimization layer is timing.

If the product uses burst resets, like Claude's five-hour window, do expensive work inside the window and administrative cleanup outside it. Do not waste high-value capacity on vague exploratory chatting if you know you have a real code task coming.

If the product uses rolling restore, like Perplexity's 24-hour return of each Pro Search credit, steady pacing is better than bingeing. If the product uses monthly buckets, like Cursor or Copilot, track what kinds of runs are actually driving spend so you can stop pretending that all usage is equally valuable.

Reset design is not just vendor policy. It is scheduling information.

Match The Product To The Workload

The best plan is often the one whose economic shape fits the job before any optimization begins.

Claude is strong for bursty, hands-on coding if you are disciplined about fresh sessions and context control.
Devin is stronger for queued autonomous jobs because the product contract already assumes scoped sessions, sleeping agents, and parallel work.
Cursor is fine if you want a transparent hybrid and accept that heavy use becomes metered quickly.
Copilot is strongest when completions still carry a large share of your workflow, because those remain unlimited while premium agent behaviors burn credits.
Perplexity and Gemini are better treated as research and general AI work subscriptions than as primary autonomous coding engines.

Optimization cannot fully rescue a plan that is economically mismatched to the workload.

Memory Strategy Matters More Than People Think

The long-run optimization is not prompt craft. It is memory architecture.

Beyond the Context Window argues that persistent memory systems can outperform naive long-context replay on both cost and performance. That lines up with what the best commercial tools are converging toward: retrieval, scoped project knowledge, cache reuse, and lighter working context.

This also connects directly to prompt caching, which already changed the economics of repeated coding sessions. When a workflow keeps resending the same stable tokens, caching and reusable memory are pure advantage. When a workflow keeps dragging forward useless history, no pricing plan will save it.

If I had to reduce the entire category to one sentence, it would be this: the people who get the most out of AI subscriptions are not the people with the prettiest prompts, but the people with the cleanest workload boundaries.

A Practical Operating Pattern

The pattern I trust most across vendors is simple:

Queue work as discrete jobs.
Start each run with only the context needed for that job.
Stop early when the run is drifting.
Restart narrower instead of arguing with a bloated thread.
Save reusable knowledge outside the transcript whenever the product supports it.
Spend burst windows on the expensive parts of the work.
Review what actually consumed budget at the end of the day or week.

None of this is magic. That is the point.

The market keeps encouraging users to think the magic is in the model. For power users, most of the advantage is in the operating discipline wrapped around the model.

So What

If you want more value from these plans, stop optimizing at the sentence level first.

Optimize the run shape. Optimize the reset timing. Optimize the retry boundaries. Optimize the memory strategy. Those are the levers that decide whether a $20 plan feels generous, whether a $100 plan feels justified, and whether an autonomous agent actually saves time instead of quietly turning into a premium background process.

The open thread I am still sitting with: how much of this discipline should remain a user skill, and how much should the products themselves enforce? A lot of these tools already know when a run is bloating, thrashing, or dragging dead context. I am not sure users should have to notice that manually forever.

Part 4 of the Agent Economics series. ← Part 3: Autonomous Agents Break Flat-Rate Pricing

Autonomous Agents Break Flat-Rate Pricing

Amit — Sat, 06 Jun 2026 22:20:08 +0000

TL;DR

Flat-rate SaaS assumes usage is relatively predictable. Autonomous agents destroy that assumption.
OpenAI moved Codex to token-based pricing under the hood on April 2, 2026. Cursor, Claude, Devin, and Copilot all now expose some mix of quotas, credits, or overages because the old “all you can use” story does not survive heavy agentic workloads.
The research backs this up. How Do AI Agents Spend Your Money? found coding-agent tasks can consume around 1000x more tokens than simpler coding tasks, with up to 30x cost variance on the same task.
This is why the market is converging on hybrid contracts: subscription wrapper up front, infrastructure metering underneath.
The real strategic question is not whether flat rate survives. It is whether vendors make the meter explicit or keep disguising it as “plan limits.”

Flat-rate pricing works when the vendor can predict the cost of serving a user well enough to hide the variance.

Autonomous agents make that much harder.

The issue is not that AI companies suddenly forgot how subscriptions work. It is that the workload shape changed. A chat user who asks ten bounded questions is one type of customer. A user who launches five long-running coding agents against a large repo with tool access is another type entirely. Pricing both people as if they were the same seat was always going to break.

The market is now admitting that, even if it still uses subscription language on the outside.

The Cost Shape Changed First

Classic SaaS seats work because marginal usage is usually small, smooth, and socially constrained. People do not open 300 spreadsheets at once. They do not ask email to recursively generate more email until a budget disappears.

Agentic systems are different.

An autonomous coding agent can inspect dozens of files, call tools repeatedly, retry after failures, expand context, write artifacts, and continue for long stretches without a human interrupting it. That makes the workload both expensive and noisy. The cost is not just higher. It is burstier and harder to predict.

The best public number on this is still How Do AI Agents Spend Your Money?. The paper found that agentic coding tasks can consume roughly 1000x more tokens than code chat and code reasoning tasks. It also found up to 30x variance in cost on the same task across runs. That is fatal to simple flat-rate logic.

If one user's "one task" can cost thirty times more than another user's "same task," then the vendor either needs very high prices, strict caps, opaque throttles, or some form of metering.

The market chose all four.

The Subscription Is Still There. The Meter Is Back Underneath.

The cleanest tell is OpenAI's Codex rate card. On April 2, 2026, OpenAI changed Codex from per-message pricing to a token-based credit model aligned with API usage. That is not a cosmetic tweak. It is an admission that agentic coding behaves more like infrastructure consumption than like chat volume.

The subscription did not disappear. Codex still ships inside eligible ChatGPT plans. But the economics underneath now map directly to input, cached input, and output tokens. Flexible credits then handle overflow when included usage runs out.

That is the new market pattern in one product:

Sell a subscription because users like predictable monthly commitments.
Meter the expensive behaviors because the vendor needs cost recovery.
Preserve the language of plan limits so the experience still feels like membership rather than raw infrastructure billing.

Everyone else is converging toward the same shape from different starting points.

The Product Examples Are All Variations Of The Same Economic Truth

Claude

Anthropic's usage credits are a direct bridge from subscription to consumption. The plan includes usage. When you exceed it, you wait for the five-hour reset or continue at standard API rates. That is a hybrid contract.

Cursor

Cursor Pro looks like a normal $20/month plan until you read the docs and see that it includes $20 of API agent usage plus bonus usage. That is not flat rate. It is a prepaid usage bundle with a familiar wrapper.

Devin / Windsurf

Devin's plans use daily and weekly quota plus on-demand credits. Its usage docs explicitly tie usage to actual work performed and note that idle sleep does not materially consume usage. That is not how software seats are normally described. It is how computational workloads are described.

GitHub Copilot

Copilot's AI credits model makes the split even sharper. Completions remain unlimited on paid plans, while the more expensive agentic behaviors burn credits. GitHub is effectively saying: the cheap behavior can stay flat-rate, the expensive behavior cannot.

Gemini and Perplexity

Gemini and Perplexity Pro are useful contrasts because they meter at the feature level more than the token level in the consumer experience. That works better for research because the workload units are more discrete. Even there, though, the flat-rate illusion has already weakened into caps, rolling restores, or hard daily quotas.

These are not different philosophies. They are different disguises for the same economic constraint.

Why Flat Rate Breaks Faster Under Agents Than Under Chat

Two things make agents especially hostile to flat pricing.

First, they compound context cost. Evaluating AGENTS.md found repository-level context files often increased inference cost by more than 20% while reducing success in the tested setup. Agent systems do not just answer the question you asked. They drag around instructions, tools, file state, retries, and accumulated transcript weight. The cost surface expands faster than the visible user action suggests.

Second, they create retry loops and long tails. An agent that hits a tool failure may recover, retry, branch, or continue exploring. That means "run one task" is not a stable unit of work. It can terminate quickly or sprawl.

This is why Beyond the Context Window matters commercially, not just technically. Persistent memory and scoped retrieval are not nice optimizations. They are survival mechanisms for any vendor trying to offer agentic behavior without melting the unit economics.

Flat-rate products can survive high volume if each action is cheap and bounded. Autonomous agent actions are neither.

The Hidden Fight Is Between Marketing Simplicity And Economic Honesty

Users want a number they can budget against. Vendors want a contract they can survive.

That tension is producing the current mess of Pro, Max, credits, quota, premium requests, usage windows, and flexible pricing. The language varies. The structure underneath is converging: some included usage, some throttling layer, some overflow rule, and some attempt to smooth the user into accepting that heavy autonomous work costs more.

The Yale paper Menu Pricing of Large Language Models is useful here because it frames the market as a menu-pricing problem. Once demand becomes heterogeneous and spiky, vendors need multiple tiers and multiple pricing instruments. A single flat price stops being rational.

That is exactly what the subscription market now looks like.

Why This Is Not Just A Vendor Problem

It is tempting to read this as a story about companies quietly pulling back from generous plans.

That is too shallow.

The deeper issue is that users still talk about these tools as if they are buying software seats, while the vendors are increasingly selling access to a volatile compute system with a seat-like onboarding experience. Those are not the same thing. If buyers keep using SaaS instincts, they will keep being surprised by invisible meters, reset windows, and overage paths.

The products are not lying so much as compressing an uncomfortable truth into cleaner packaging: autonomous work is expensive, variable, and not well matched to flat entitlements.

So What

The right question is not whether AI subscriptions will stay subscriptions.

They will, because the wrapper is useful.

The real question is what percentage of the economic truth stays hidden underneath that wrapper. Will the market normalize explicit credit balances, token equivalents, and cost-per-agent-run? Or will it keep selling "higher limits" and "priority access" because that language is easier to market?

My read is that the hybrid model wins. A subscription gets you in. Metering determines how far you can actually go. The only real variation is how much of that metering the vendor is willing to admit in public.

The open thread I am still stuck on: if agents become more capable and more autonomous, does the market eventually stop pretending these are software subscriptions at all and start selling them the way cloud compute is sold, or does the subscription wrapper remain politically necessary even after everyone knows what is happening underneath?

Part 3 of the Agent Economics series.
← Part 2: Reset Windows Are Product Design · Part 4: How To Optimize Agent Subscriptions Without Getting Tricked →

AI Subscriptions Are Secretly Usage Models

Amit — Sat, 06 Jun 2026 22:19:33 +0000

TL;DR

The market now has two adjacent categories: coding-agent subscriptions and broader AI work subscriptions. The first group is Claude, ChatGPT Codex, Cursor, Devin/Windsurf, GitHub Copilot, and Grok. The second group is Google AI Pro / Gemini and Perplexity Pro.
The product question is not price. It is reset cadence, overflow rule, and whether the subscription survives sustained autonomous runs.
Claude is the cleanest burst model: shared usage across surfaces, five-hour reset, optional usage credits. Devin is the most agent-native: daily and weekly quota on Pro, weekly-only on Max, on-demand credits, explicit support for parallel scoped sessions. Cursor and Copilot are the clearest examples of subscriptions turning into included credit bundles.
Gemini and Perplexity belong in the comparison because they are widely bought in the same budget conversation, but they are better understood as research and general AI work subscriptions than primary autonomous coding plans.
The research now points in the same direction as the pricing: agentic coding is structurally expensive, highly variable, and sensitive to context engineering. Token burn is not a side effect. It is the product.

The AI subscription market is no longer one market.

There is now a core category of coding-agent subscriptions and an adjacent category of AI work subscriptions that people buy from the same budget line. They get compared because they all look like monthly plans. They behave differently because the workloads underneath are different.

The thesis: these products are not really selling flat access anymore. They are selling reset policies, credit buckets, overflow rules, and different levels of tolerance for autonomous agentic work.

This is the opener in a short Agent Economics series because the comparison only makes sense once the underlying mechanisms are separated: reset windows, overflow rules, and autonomous-run cost behavior each deserve their own cut.

The Category Map

Here is the landscape I would use as of June 6, 2026.

Core coding and agent subscriptions

Adjacent AI work subscriptions

This is one market from the buyer side. It is not one market from the usage-policy side.

The $20 Decision Matrix

If I had to help someone choose at roughly the $20 price point on June 6, 2026, I would not cut the market only by developer type.

That misses how these plans are actually spreading. Students, researchers, analysts, founders, managers, writers, and mid-career knowledge workers are all buying from the same menu now. The better cut is by work pattern.

Age matters here, but mostly indirectly. It changes digital confidence, attention budget, tolerance for opaque limits, and whether someone wants one general AI membership or multiple specialized subscriptions. I would still avoid hard age stereotypes. The more reliable signal is the shape of the work.

Persona	Main workload	Best fit at around $20	Why
Student or early-career learner	Study help, writing, summarization, occasional coding	ChatGPT Plus	Broadest general-purpose bundle, easiest single subscription if you need one AI plan for many different tasks
Research-heavy knowledge worker	Search, synthesis, source discovery, report building	Perplexity Pro	Clear daily research budget, strongest fit when the core activity is information retrieval and synthesis
Google-centric professional	Docs, Gmail, search, browser, general office work	Google AI Pro	Broad workflow integration and explicit daily caps, better fit for mixed productivity than pure coding throughput
Solo builder or founder	Writing, planning, product thinking, some coding, some research	ChatGPT Plus or Claude Pro	ChatGPT if the need is broad and cross-functional. Claude if the work comes in sharper deep-work bursts
Solo coder who works in intense bursts	Heavy focused coding sprints	Claude Pro	Shared chat plus Claude Code pool, clean five-hour reset logic, strongest fit for bounded sprint work
IDE-first builder who wants explicit overage behavior	Coding inside an editor, with occasional heavy agent runs	Cursor Pro	Best fit if you accept that the $20 plan is a starter budget, not a power-user ceiling
Completion-heavy developer with lighter agent use	Frequent code completion, lighter chat and agent usage	GitHub Copilot Pro	Cheapest serious coding seat; unlimited completions matter if autonomous runs are not the center of the workflow
Operator experimenting with autonomous job queues	Scoped agent runs, repetitive build tasks, delegated execution	Devin Pro	Closest thing to an agent-native contract at the price, but still an entry tier with quota constraints
Writer, marketer, or general creator	Drafting, rewriting, brainstorming, multi-format content work	ChatGPT Plus or Claude Pro	ChatGPT is the broader generalist bundle. Claude is stronger if the value comes from long, focused drafting sessions

Two caveats matter.

First, there is no real flat-rate power-user winner at $20. The whole market is too computationally volatile for that now.

Second, Grok's current direct plan is $30 for SuperGrok, so it belongs in the market map but not in the strict $20 decision set.

The broader social point is that this market is already segmenting the way telecom and cloud did: one bundle for the mainstream, one research-heavy option, one productivity-suite option, one coding-native option, and one higher-intensity tier for people whose workloads are simply more expensive to serve.

The Table That Matters

This is the real comparison.

Product	Entry paid plan	Reset cadence	Overflow model	What power users can squeeze out	Transparency
Claude	$20 Pro	Five-hour reset after limit hit	Wait, upgrade, or enable usage credits at standard API pricing	Strong burst usage if you keep sessions short, clear between tasks, and avoid large context drag	High
ChatGPT + Codex	$20 Plus	Reset exists, but OpenAI does not publish one simple universal public window for all plans	Some Plus and Pro users can add credits; others upgrade or wait	Good for many scoped tasks; expensive long-context runs because Codex is token-metered underneath	Medium
Cursor	$20 Pro	Monthly billing cycle	Buy additional usage at cost or upgrade	Fine for light daily agent use; sustained autonomous work usually pushes you past included usage	High
Devin / Windsurf	$20 Pro	Pro: daily + weekly. Max: weekly only, no daily cap	On-demand credits continue work without interruption	Best fit for autonomous queues; idle sleep does not materially consume usage, and scoped parallel sessions are explicitly supported	High
GitHub Copilot	$10 Pro	Monthly AI credit allowance	Buy more usage via AI credits and usage-based billing	Strong value if your workflow leans on completions, because completions stay unlimited while agents and chat burn credits	High
Grok	$30 SuperGrok or X Premium+ at $40 in the U.S.	No clear official public reset cadence I could verify	Multiple billing surfaces: Grok.com, X subscription, API	Hard to optimize because the limits are not clearly documented	Low
Google AI Pro / Gemini	$19.99	Mostly daily feature caps; limits may change without notice	Upgrade to higher Google AI tier; separate monthly AI credits for some media tools	Better for research and general AI workflows than coding-agent saturation; coding value comes through Jules and Gemini CLI, not pure autonomous coding throughput	Medium
Perplexity Pro	$20/month	At least 300 Pro Searches per day, with each credit restored 24 hours after use	Not really overflow; mostly a rolling search-credit model plus separate API credits	Strong for research throughput, weak as a primary autonomous coding subscription	Medium

That table captures the entire category better than any benchmark chart.

What The Best Plans Are Actually Optimized For

The plans look similar from the checkout page. They are optimized for different behaviors.

Claude: burst intensity

Claude Code is included with Pro and Max, and Anthropic says usage is shared across Claude surfaces. That makes Claude the cleanest "one account, one pool" design in the market.

It is also the clearest burst model. Anthropic documents the five-hour reset for paid-plan overflow handling and explicitly recommends keeping conversations shorter, reducing tool usage, and keeping the context window under control in its usage docs and best-practices docs.

Claude rewards people who work in sprints. Open a clean session. Do one bounded task. Exit. Reset. Repeat.

ChatGPT Codex: broad membership, softer boundaries

Codex is included in eligible ChatGPT plans. That means coding is one surface inside a broader membership rather than a dedicated coding contract.

The important shift is hidden in the mechanics. OpenAI moved Codex to a token-based credit rate card on April 2, 2026. That is a major tell. It means the subscription wrapper is still consumer-friendly, but the meter underneath now maps directly to input, cached input, and output tokens.

That makes Codex more economically legible and less emotionally flat. It is good for many small and medium tasks. It becomes more expensive and less predictable when you let context sprawl.

Cursor: the honest hybrid

Cursor Pro is $20/month, but the docs say Pro includes $20 of API agent usage plus bonus usage. Cursor also says daily agent users typically land in the $60-$100/month range, and power users often exceed $200/month.

That is the most honest sentence in the category.

Cursor does not pretend the $20 plan is enough for heavy autonomous work. It treats the subscription as a soft commit with included value and clear overage behavior. That is much closer to cloud economics than classic SaaS economics.

Devin / Windsurf: the most agent-native contract

Devin's self-serve plan docs are the clearest public explanation of agent-native billing I found. Pro includes a daily and weekly quota shared across Devin sessions, Devin for Terminal, and the Windsurf IDE. Max keeps the weekly quota but removes the daily cap. Overages are covered by on-demand credits.

The more important part is in the usage docs. Devin says usage reflects actual work performed. Sleep does not consume usage. Inactive sessions go to sleep. And large projects should be split across multiple sessions because there is no limit on simultaneous sessions.

That is what agent-native pricing looks like. The contract assumes you will run many scoped jobs, not one long chat.

GitHub Copilot: cheaper seat, clearer metering

Copilot Pro is $10/month, Pro+ is $39, and GitHub says Copilot Max is built for sustained, heavy agent-driven workflows and includes $100/month in GitHub AI Credits. GitHub's docs now frame usage in AI credits, where one credit equals one cent.

The sharp distinction is this: code completions remain unlimited on paid plans, while agent mode, chat, cloud agent, code review, and CLI burn credits.

Copilot is therefore strongest when your workflow still includes a large completion layer, not only autonomous delegation. It is cheaper than the $20 coding plans because it meters the expensive parts more directly.

Grok: fragmented and opaque

xAI sells SuperGrok on Grok.com. X sells Premium and Premium+ with Grok access inside X. X Premium+ in the U.S. is $40/month. Grok Build exists as a CLI product, but public limit documentation remains thin.

Grok is the easiest product in the set to misunderstand because the brand is unified and the billing surfaces are not.

Before comparing Grok to the others, you first have to ask: which commercial boundary are you actually buying?

Gemini and Perplexity: adjacent, not identical

Google AI Pro is $19.99/month. Gemini's help docs publish unusually explicit feature caps: up to 100 prompts per day on Gemini 3 Pro for AI Pro, up to 500 for AI Ultra, up to 20 Deep Research reports per day on Pro, and up to 200 on Ultra. That is clearer than most consumer AI plans.

But Gemini is not optimized around coding-agent saturation. Its subscription is broader: Gemini app, Gmail, Docs, NotebookLM, Jules, Flow, and media generation.

Perplexity Pro is even clearer about its shape. It gives users at least 300 Pro Searches per day, and each used credit is restored exactly 24 hours later. That is a rolling daily meter, not a hard midnight reset. Perplexity belongs in the budget conversation because it competes for the same dollars. It does not belong in the same operational bucket as Claude Code or Devin if the main use case is autonomous coding.

How To Squeeze More Autonomous Work Out Of These Plans

The common optimization pattern is not clever prompting. It is cost hygiene.

Keep runs narrow. Long, wandering sessions are the fastest path to hidden spend.
Keep context short. The most expensive token is often the repeated input token, not the output.
Split large jobs into independent subtasks instead of asking one agent to do everything in one thread.
Bound retries. Autonomous loops that continue after a tool failure can quietly become the whole bill.
Prefer reusable memory or retrieval over replaying giant histories.

This is especially obvious in the products with the clearest policies. Anthropic explicitly recommends shorter conversations and fewer active tools. Devin explicitly recommends multiple scoped sessions. Cursor explicitly shows you what the median usage looks like. The products are telling you how to survive them.

How People Are Actually Starting To Think About This

This is not only an academic framing.

Researchers are helping make the pattern legible, but the shift is broader than that. Builders, founders, students, operators, independent professionals, and general knowledge workers are all running into the same realization from different directions: these products do not feel like normal software subscriptions once you use them seriously.

They feel more like governed access to compute.

The research now gives sharper language to that intuition. How Do AI Agents Spend Your Money? shows that agentic coding workloads can be drastically more expensive and more variable than simpler interactions. Evaluating AGENTS.md shows that more repository context can increase cost while hurting results. Beyond the Context Window shows that memory design matters economically, not only technically. The Yale paper Menu Pricing of Large Language Models explains why the market keeps drifting toward hybrid menus instead of clean flat-rate plans.

But none of those papers created the underlying feeling.

The feeling came first. People started noticing that the same $20 label could mean a five-hour burst pool, a monthly credit bucket, a daily research allowance, or a soft entry point into metered overages. They started noticing that autonomous work burns budget differently from chat. They started noticing that "use it more" and "use it well" are no longer the same thing.

That is the real shift.

The category is teaching ordinary users to think a little more like operators. Not because everyone suddenly cares about token accounting in the abstract, but because the products themselves now force questions like:

What kind of work am I actually buying this for?
Which usage pattern am I likely to hit first: daily cap, weekly quota, five-hour reset, or monthly credits?
When does a second specialized subscription make more sense than asking one generalist subscription to do everything?
Which parts of my work are chat, which are research, and which are autonomous execution?

That is why this market matters beyond the coding crowd. It is becoming part of how a much wider slice of society allocates attention, software budget, and cognitive outsourcing.

What The Research Says

The academic picture is finally catching up with the product behavior.

How Do AI Agents Spend Your Money? is the most relevant paper I found. The headline result is brutal: agentic coding tasks can consume roughly 1000x more tokens than code chat and code reasoning tasks, and the same task can vary by up to 30x across runs. Higher cost does not reliably mean higher accuracy. Cost often peaks past the point of useful return.

Evaluating AGENTS.md found that repository-level context files often increased inference cost by more than 20% and reduced success rates in the tested setups. More context is not automatically better context.

Beyond the Context Window argues that persistent memory systems can beat naive long-context replay on cost and performance. That maps directly to what the best commercial products are doing: retrieval, memory, scoped context, not endless transcript accumulation.

The economics paper Menu Pricing of Large Language Models is not about coding agents specifically, but it frames the category correctly. The market is moving toward token-budget menus, max tiers, and hybrid subscriptions because flat pricing breaks under spiky agent demand.

So What

The wrong question is "Which $20 plan is best?"

The better questions are:

Which plan has the reset cadence I can live with?
Which plan has the overflow rule I trust?
Which plan is transparent enough that I can tell when autonomous runs are going off the rails?
Which plan is optimized for my actual workload: coding, research, general AI work, or autonomous job queues?

Claude is the cleanest burst subscription. ChatGPT Codex is the broadest bundled membership. Cursor is the most honest hybrid. Devin is the most agent-native. Copilot is the cheapest serious coding entry point. Gemini is the clearest generalist plan. Perplexity is the clearest research plan. Grok is the most fragmented.

The open thread I am still sitting with: does this market converge toward explicit infrastructure-style metering, or do vendors keep the subscription wrapper because people will accept hidden compute budgets longer than they will accept visible compute bills?

Part 1 of the Agent Economics series. Part 2: Reset Windows Are Product Design →

Your AI Agent Needs Communication Modes, Not a Voice Clone

Amit — Sat, 06 Jun 2026 07:21:21 +0000

TL;DR

Every AI platform collapses communication into a single flat voice profile — but knowledge workers switch between at least six distinct registers daily (casual, professional, leadership, field, publishing, builder), and averaging them produces output that's wrong for every context.
The fix is engrams: mode-specific profiles with tone calibration, vocabulary boundaries, structural patterns, values integration, and — most importantly — an anti-pattern library. Anti-patterns are more distinctive than positive examples.
Agent output should amplify intent, not clone raw voice. A casual voice message delivers intent; the engram-calibrated agent delivers a draft that exceeds real-time output quality for that register.
Automatic mode detection (from a config-backed priority hierarchy: override → recipient → role → channel → intent keywords) eliminates manual mode selection entirely.
Spend one hour building six mode engrams. Every AI interaction improves for the rest of the year.

Every AI assistant on the market today offers some version of "write like me." Upload your writing samples, set a style preference, and the model will dutifully mimic your patterns. The output reads like a slightly off photocopy — recognizably shaped like you, but missing the judgment calls that make communication actually work.

The problem is not that voice cloning fails technically. It is that voice cloning answers the wrong question. The question is not "how do I sound?" The question is: "how should this message land, given who it is for and what it needs to do?"

Knowledge workers do not have one voice. They have six.

The Flat Persona Trap

Every major AI platform now offers persistent voice customization. ChatGPT has Custom GPTs and Projects with system instructions. Claude has Projects, Styles, and custom instructions — including a "Taste Interviewer" prompt pattern that extracts voice DNA from conversation. Gemini has Gems. Anthropic recently added a Styles feature where you pre-select formal, concise, or explanatory modes — or upload custom examples.

All of these tools treat voice as a single axis. You feed the model writing samples, it pattern-matches your sentence length, vocabulary, and quirks, and then every output comes through that same filter. A casual DM to a friend sounds the same as an exec briefing to a VP. A customer email sounds the same as an internal team message. A published thought piece sounds the same as a handoff task to another agent.

As one practitioner put it: "The standard advice for getting AI to match your voice is to feed it samples and say 'write like this.' This barely works." The reason it barely works is not sample quality. It is that a single-mode voice profile collapses context that professionals spend years learning to calibrate.

Linguistics has a word for what happens next: register. Register is the form that language takes in different circumstances — and "code switching" is the ability to move between registers guided by context. UCLA research confirms what anyone in a professional setting already knows: people employ casual, slang-infused language among peers while adopting structured, formal language with leadership. This is not inconsistency. It is competence.

A flat persona strips that competence away.

Six Modes, Not One Voice

If you work in any knowledge-intensive role — product management, solutions architecture, engineering leadership, GTM strategy — you switch between at least six distinct communication modes every day:

Mode	When	What it needs to sound like
Casual / Inner Circle	DMs with close colleagues, peers you trust	Direct, warm, zero ceremony. Short. Familiar but not sloppy.
Professional / Peer-to-Peer	Cross-functional threads, team channels, project syncs	Strategic, data-specific, action-oriented. Advisory posture — flag, connect, advise.
Leadership / Upward	Exec emails, endorsement requests, VP briefings	Personal but purposeful. Confident, not deferential. Ask first, context later.
Field / External	Customer emails, partner comms, external stakeholders	Customer-obsessed, growth-oriented, warm but measured.
Publishing / Thought Leadership	Blog posts, strategy docs, public writing	Evidence-based, opinionated, universally framed. Not personal — "If you work in..."
Builder / Technical	Handoff tasks, system docs, architecture, code	Precise, structured, executable. Written for machines and humans simultaneously.

Each mode has different vocabulary, sentence structure, opening patterns, closing patterns, and — critically — a different set of things you would never say. A casual message that opens with "I hope this note finds you well" is wrong. A leadership message that opens with "Hey dude" is wrong. A published post that opens with "In my role as..." is wrong.

The voice is not the variable. The mode is.

The Architecture: Engrams

An engram is a mode-specific voice profile. Not a flat style guide — a structured analysis of how communication should work in a specific register, for a specific audience, with a specific intent.

Each engram contains five components:

1. Tone Calibration
Not "friendly and professional" — that describes 90% of the internet. Instead: "Direct. No ceremony. Two sentences max for the opener. Get to the ask within the first three lines." Specificity is the entire point.

2. Vocabulary Boundaries
What to use and — more importantly — what to never use. The never-use list is more distinctive than the use list. Everyone uses "thanks." Not everyone avoids "appreciate it, brother." The anti-pattern is the fingerprint.

3. Structural Patterns
How messages open, flow, and close. Casual mode: no opener, straight to content. Leadership mode: the ask comes first, the context follows only if they say yes. Publishing mode: the surprise or finding leads, not the setup.

4. Organizational Values Integration
For organizations with articulated operating principles, each mode emphasizes different values. Casual mode leans on speed and directness. Leadership mode leans on trust-building. Publishing mode leans on big-picture thinking and customer focus. The values are not decorative — they calibrate judgment.

5. Anti-Pattern Library
The explicit list of phrases, structures, and behaviors that are wrong for this mode. This is the highest-signal component. "No worries if not" at the end of a leadership ask signals lack of confidence. "I can pull together a one-pager" in an advisory message signals doer posture when advisor posture is required. "In my role as..." in a published post signals credential framing when universal framing is needed.

The anti-patterns catch failures that positive instructions miss. "Be confident" is vague. "Never end a leadership message with an opt-out phrase like 'either way' or 'no pressure'" is actionable.

Why Amplification, Not Cloning

The critical distinction: agent output should not sound like a raw transcript of the human. It should sound like an amplified version of the human's intent.

When someone dictates a quick voice message — "hey can you reach out to Kevin and tell him I talked to Sarah about the promo thing and it'd be great if he could put in a good word" — they are not delivering final copy. They are delivering intent. The raw transcript captures the meaning but not the polish. A flat voice clone would reproduce the filler words, the incomplete thoughts, the verbal tics.

An engram-calibrated agent does something different. It takes the intent, identifies the correct mode (casual / inner circle), applies the mode's structural patterns (direct opener, no ceremony, short), checks against the anti-pattern library (no sycophantic closings, no emoji overuse, no hedging), and produces output that is more cohesive than what the human would have typed themselves — while remaining unmistakably shaped by the human's values and directness.

This is not ghostwriting. It is amplification. The human reviews, edits, and sends — but they start from a draft that already exceeds their typical real-time output quality for that register.

The Axios guide to building AI writing clones gets halfway there: "Don't ask the AI to go find your voice. Give it your voice. The CEO who uploads 50 documents gets a 10x better clone than the one who types a few simple prompts." True — but the 50 documents still produce one flat clone. The upgrade is giving it 50 documents tagged by mode, so it knows which version of the voice to invoke.

What the Competition Offers Today

A quick landscape of how current tools handle voice personalization:

Platform	Approach	What It Gets Right	What It Misses
ChatGPT (Custom GPTs / Projects)	System instructions + uploaded samples	Persistent context across conversations	Single mode per GPT/Project — no register switching
Claude (Projects / Styles)	Preset styles (formal, concise, explanatory) + custom examples	Recently added Styles feature with mode selection	Styles are generic presets, not user-specific mode profiles
Claude Skills	Markdown files encoding voice + workflow	Eliminates the "Blank Slate Tax" — voice persists across sessions	One skill = one voice. No multi-mode architecture.
Gemini Gems	Custom instruction sets per Gem	Quick setup, integrated with Google ecosystem	Same single-mode limitation as Custom GPTs
Voice cloning prompts	Feed samples → extract patterns → reproduce	"Taste Interviewer" pattern produces detailed voice DNA	Clones the raw voice including flaws — no amplification, no mode switching

None of these platforms offer mode-specific profiles. You can create separate GPTs or Projects per mode — but there is no architecture that automatically selects the right profile based on context (audience, channel, intent). The mode selection is entirely manual.

Building It: The Practical Loop

Here is how an engram system works in practice:

Step 1 — Collect samples across modes. Pull your chat DMs (casual mode), sent emails (professional + leadership modes), published posts (publishing mode), and agent conversations (raw intent signal). Tag each sample by mode.

Step 2 — Analyze per mode. For each mode, produce a structured analysis: tone, vocabulary, sentence patterns, openers/closers, anti-patterns, values emphasis. The analysis should be 500–800 words per mode — specific enough to calibrate, short enough to fit in a system prompt.

Step 3 — Build the anti-pattern library. This is the highest-value step. Review agent outputs that you've corrected. Every correction is an anti-pattern: "Don't say 'appreciate it brother.'" "Don't hedge with 'either way.'" "Don't volunteer to build deliverables — flag, connect, advise." Corrections are more distinctive than examples.

Step 4 — Save as persistent profiles. Each mode becomes a named engram that the agent loads based on context. Writing a DM to a close colleague → load casual engram. Drafting an email to a VP → load leadership engram. Writing a blog post → load publishing engram.

Step 5 — Iterate from corrections. Every time you correct an agent draft, the correction feeds back into the relevant engram's anti-pattern library. The profiles sharpen over time through use, not through re-training.

The Enterprise Implication

For individual builders, engrams solve the "my AI sounds generic" problem. For organizations, the implication is larger.

Institutional voice is not one voice. It is a set of registers that encode how the organization communicates in different contexts — with customers, with leadership, with the field, with the public. Today, that institutional knowledge lives in the heads of senior practitioners who have spent years calibrating their register-switching. When they leave, the calibration leaves with them.

Engrams make that calibration portable. A senior practitioner builds mode-specific profiles. A new team member's agent loads those profiles and immediately communicates at a higher calibration than they could achieve alone — not replacing their judgment, but starting them at a higher baseline.

This is not homogenization. Each person's anti-patterns are different, their vocabulary boundaries are different, their structural preferences are different. But the architecture — modes, not flat personas — can be shared. The organization provides the mode taxonomy and the values integration. The individual provides the voice within each mode.

Automatic Mode Detection

Manual mode switching breaks the flow — nobody wants to tell the agent "use casual mode" before every message. The fix is a classification function backed by a config file that encodes a signal priority hierarchy: explicit override → recipient-specific override → role-based mapping → channel/medium detection → intent keyword matching. The agent resolves the correct engram before generating a single word.

Anti-pattern extraction from corrections is the second architectural piece. When you reject a draft — "don't say 'appreciate it brother'" — that correction should auto-classify to the relevant mode and append to that mode's engram. Corrections are the highest-signal input the system receives. Every rejection is a fingerprint.

So What

"Friendly and professional" is not a voice. It is the absence of one. Knowledge workers switch between six or more distinct communication registers every day, and every AI platform on the market collapses them into a single flat profile.

The fix is not better voice cloning. It is mode-specific profiles — engrams — that capture how communication should work for a specific audience, intent, and register. Anti-patterns over patterns. Amplification over imitation. Organizational values as behavioral calibration, not decoration.

The person who invests an hour building six mode engrams will get better output from every AI interaction for the rest of the year. The organization that standardizes mode taxonomies will ship institutional communication quality that does not walk out the door when senior practitioners leave.

Your agent does not need your voice. It needs your judgment about which voice to use when.

This post is part of a series on building mode-specific voice profiles for AI agents. The next post covers what the engram builder gives you — and what you have to add on top.

Part 1 of the Voice & Engrams series. Part 2: Your Agent Needs Six Voices, Not One →

What Is an Agent — And What Isn't

Amit — Sat, 06 Jun 2026 07:20:45 +0000

TL;DR

An agent is defined by one thing: the loop — perceive, reason, act, observe, repeat. Chatbots, copilots, fixed workflows, and RPA scripts are not agents; they lack autonomous iteration.
The three required properties: goal-directed behavior (outcome, not response), tool access (interacts with the world), autonomy (multi-step without confirmation at every step).
Claude Code earned 84.6K GitHub stars and a 46% "most loved" rating among developers — compared to Cursor at 19% and Copilot at 9% — because it actually loops: writes, tests, observes failures, fixes.
If you conflate chatbots with agents, you underinvest in the harness, set the wrong reliability bar, and build fixed pipelines that break when the world changes.
Apply the loop test before building: does the system perceive, reason, act, observe, and iterate? If not, name it correctly and architect accordingly.

Every product launch in 2026 uses the word "agent." Customer support chatbots are agents. Autocomplete plugins are agents. Cron jobs with an LLM wrapper are agents. A Zapier flow with a model step is an agent now, apparently.

None of those are agents.

The word has become meaningless through overuse, and the confusion is not academic. If you think your chatbot is an agent, you will build the wrong thing, staff the wrong team, and set the wrong expectations with customers. The distinction matters because the architecture, the reliability requirements, and the trust model are fundamentally different.

This post draws the line.

The Agent Loop — The Only Definition That Matters

An AI agent is a system that can autonomously take actions to accomplish a goal — not a system that responds to a single prompt and waits for the next one.

The defining primitive is the loop. An agent operates in a continuous cycle:

Perceive — take in information from the environment (files, APIs, user input, tool outputs)
Reason — decide what to do next (which tool to call, what parameters to pass, whether the goal is met)
Act — execute the chosen action (call a tool, write a file, send a message, run code)
Observe — evaluate the result (did the action succeed? did the state change? is the goal closer?)

Then it loops back. Perceiving the new state, reasoning again, acting again. This loop can run once for a simple query or iterate dozens of times for a complex workflow. The key: the agent decides when to stop — not the human.

Three properties separate a real agent from everything that calls itself one:

Goal-directed behavior. The agent pursues an outcome, not a response. "Organize this folder by project" is a goal. "What files are in this folder?" is a query.
Tool access. The agent interacts with the external world — file systems, APIs, databases, browsers, shell commands. Without tools, the model is a brain in a jar.
Autonomy. The agent takes multi-step actions without requiring human confirmation at every step. The degree of autonomy varies (more on that below), but zero autonomy means zero agency.

This is the formula from Tian Pan's anatomy of an agent harness: Agent = Model + Harness. The model handles reasoning. The harness handles everything else — tool execution, context management, memory, safety. The loop is what ties them together.

What an Agent Is NOT

The confusion comes from four categories that look agent-like but are not.

Chatbots Are Not Agents

A chatbot receives a prompt and returns a response. One turn. The user drives every interaction. There is no loop — the system does not perceive, act, observe, or iterate. ChatGPT in default mode is a chatbot. Claude.ai in a single-turn conversation is a chatbot. They are useful. They are not agents.

The test: does the system take actions in the world without being asked? A chatbot does not. It waits.

Copilots Are Not Agents

A copilot suggests. It autocompletes code. It offers a draft. It highlights errors. The human accepts or rejects every suggestion. GitHub Copilot, in its original autocomplete mode, is a copilot — the model proposes, the human disposes.

The distinction is the approval gate. A copilot has a human-in-the-loop at every step. An agent has a human-in-the-loop at the goal level ("organize my downloads folder") but not at every action level ("rename file X, move file Y, create folder Z").

AI tools require humans to operate them. AI agents operate on behalf of humans. That distinction changes everything about how work gets done.

Workflows Are Not Agents

A workflow is a fixed DAG — a directed acyclic graph of predetermined steps. Step 1 always leads to Step 2, which always leads to Step 3. The path is decided at design time, not runtime. LangChain chains, Airflow DAGs, Step Functions — these are workflows. They are deterministic. They do not reason about which step to take next.

An agent selects its next action based on the current state. If a tool call fails, the agent can try a different tool, adjust parameters, or abandon that approach entirely. A workflow cannot — it follows the graph or it fails.

RPA Is Not Agentic

Robotic process automation scripts follow brittle, pixel-mapped sequences. Click here, type there, wait for this element. They break when the UI changes. They cannot recover from unexpected states. They have no reasoning layer.

An agent navigating a browser can handle unexpected popups, changed layouts, and missing elements because the model reasons about the visual state and adapts. An RPA script cannot.

The Spectrum of Autonomy

Not all agents are fully autonomous. The reality is a spectrum:

Level	Description	Example
Confirm every action	Agent proposes, human approves each step	Early Claude Code (pre-trust)
Confirm risky actions	Agent acts freely on reads, confirms on writes/sends	Amazon Quick Desktop default mode
Fire and forget	Agent runs to completion, human reviews the output	Claude Code with `--dangerously-skip-permissions`
Continuous autonomous	Agent runs on a schedule with no human trigger	Amazon Quick Desktop scheduled agents

The correct operating point depends on the stakes. Renaming files in a personal folder? Fire and forget. Sending an email to a VP on the user's behalf? Confirm that action.

This is the trust ramp: start at the left, move right as the agent proves reliability. The ramp is not a product decision — it is a per-user, per-workflow decision that evolves over time.

Four Agents, Four Architectures

The agent loop is universal. How it is implemented — what harness wraps the model, what tools are available, what the interaction surface looks like — varies by product. Here are four agents that demonstrate the range.

Claude Code — The Terminal Agent

Claude Code is Anthropic's terminal-native coding agent. It lives in the terminal. No IDE. No browser. No GUI. You describe a task in natural language, and the agent reads your codebase, plans multi-file edits, writes code, runs tests, observes failures, fixes them, and commits the result.

The agent loop is visible: the model reads a file (perceive), decides what to change (reason), edits the code and runs the test suite (act), sees if the tests pass (observe), then iterates until they do.

Claude Code hit 84.6K GitHub stars by March 2026 and earned a 46% "most loved" rating among developers, compared to Cursor at 19% and GitHub Copilot at 9%. It integrates with 150+ tools via MCP (Model Context Protocol), spawns sub-agents for parallel work, and supports automatic memory across sessions via CLAUDE.md files.

This is not autocomplete. This is an agent that writes, tests, and ships software.

Claude Cowork — The Desktop Agent

Claude Cowork launched on January 12, 2026 as a research preview inside the Claude Desktop app. It brought the same agentic architecture that powers Claude Code to non-developers.

You grant Cowork access to a folder. You describe the outcome. It reads, edits, creates, and organizes files within that scope — sorting chaotic downloads folders, pulling expense data from receipt screenshots, synthesizing research documents. Powered by Claude Opus 4.6 with a one-million-token context window, it plans an approach, executes across local files and connected applications, and returns a finished deliverable.

The market noticed. Investors wiped $285 billion from software stocks within days as the implications sank in: an AI capable of autonomous knowledge work, running on a desktop, for $20/month.

Cowork is not a chatbot with file access. It is an agent that takes a goal and works until the goal is met.

Kiro — The Spec-Driven IDE Agent

Kiro is AWS's agentic IDE, built on a fundamentally different philosophy. Where Claude Code and most AI coding tools start with code, Kiro starts with specifications.

The approach is called spec-driven development. Before the agent writes anything, it generates structured specifications — requirements with acceptance criteria, a technical design document, and a numbered task list. You review and edit the specs. Then the agent implements from the spec.

This inverts the model that Cursor, Copilot, and most AI assistants use. In Kiro, the spec is source-of-truth and code is a build artifact. The agent loop runs at a higher level of abstraction: perceive (read the spec), reason (plan the implementation), act (write code, generate docs, create tests), observe (validate against acceptance criteria).

Built on Amazon Bedrock with multiple foundation models, Kiro treats the decisions made during development as first-class artifacts — not ephemeral chat messages that vanish when the tab closes.

Amazon Quick Desktop — The Knowledge Work Agent

Amazon Quick is an agentic AI-powered digital workspace that operates across the full surface of knowledge work — not code, but everything else.

Running Quick Desktop daily with 250+ tools connected in a single conversation, the agent loop operates across Slack, Outlook email and calendar, Salesforce, SharePoint, file systems, knowledge graphs, web search, browser automation, Python/JavaScript execution, and image generation. The agent triages your inbox, drafts replies, searches across indexed folders, posts structured Slack threads, builds dashboards, and manages account context files — all driven by skills (encoded methodology) and a persistent memory system.

A concrete example: the Slack MCP server exposes tools like post_message and search_messages. The Outlook MCP server exposes tools like email_reply and calendar_view. The Salesforce MCP server exposes tools like query_opportunities and update_contact. These MCP servers are the connectors; the tools are the individual functions the agent calls inside its loop. The distinction matters: you don't connect "Slack" to an agent — you connect a Slack MCP server that exposes a set of tools the agent can invoke.

Quick Desktop's distinguishing feature: scheduled agents that run 24/7 in the cloud. These are continuous autonomous agents — they run on cron schedules or event triggers (new Slack message, new email, upcoming calendar event) without any human initiation. A morning briefing agent fires at 7:56 AM, scans email and Slack, classifies by priority, and posts a triage report before you open the app.

This is the farthest point on the autonomy spectrum: agents that operate on behalf of humans with no human in the loop at execution time.

Why the Distinction Matters

If you conflate chatbots with agents, three things go wrong.

You underinvest in the harness. A chatbot needs a prompt and a model. An agent needs tool execution infrastructure, context management, memory, safety enforcement, error recovery, and human-in-the-loop workflows. The harness — the infrastructure that wraps the model — is where reliability lives. Skip it, and your agent fails in production even if the model is brilliant.

You set the wrong trust expectations. Users expect chatbots to be wrong sometimes and shrug it off. Users expect agents — systems acting on their behalf — to be reliable. A chatbot that hallucinates wastes 30 seconds. An agent that sends a hallucinated email to a VP wastes a career. The reliability bar is categorically different.

You build fixed workflows instead of adaptive systems. If you think "agent" means "workflow with an LLM step," you will build rigid pipelines that break when the world changes. Real agents adapt. They recover from tool failures, try alternative approaches, and ask for help when stuck. That adaptability requires the loop — and the loop requires a harness.

The Agent Loop Is the Primitive

Everything in the agent ecosystem builds on top of the loop.

Tools give the agent hands — the ability to interact with the external world. Without tools, the agent is a brain in a jar. But tools alone are atomic: read a file, send a message, query a database. Each tool call is a single action in a single iteration of the loop. MCP servers are the connectors that expose those tools — the Slack MCP server, the Outlook MCP server, the Salesforce MCP server. The tools are the individual functions those servers make available to the agent.

Skills give the agent methodology — the knowledge of how and when to act. A skill encodes a workflow: when to trigger, what inputs to gather, what tools to use in what order, what quality checks to run. Skills are what prevent the agent from reinventing its approach every session.

The harness gives the agent a body — context management, memory, safety enforcement, error recovery, state persistence. The harness is the infrastructure that keeps the loop running reliably across sessions, across tools, and across failures.

Agent → Loop → Tools → Skills → Harness. Each concept builds on the one before it. Get the agent definition wrong, and the rest of the stack is built on sand.

The industry will keep calling everything an agent. That does not mean everything is one. The loop is the line. If the system does not perceive, reason, act, observe, and iterate — it is something else. Call it what it is.

Part 1 of the Agent Primitives series. Part 2: What Is a Tool →

What Is an Agent Harness — The Infrastructure That Makes Agents Actually Work

Amit — Sat, 06 Jun 2026 07:20:10 +0000

TL;DR

Agent = Model + Harness. The model handles reasoning; the harness handles everything else: tool execution, context management, memory, state persistence, safety enforcement, error recovery, and human-in-the-loop workflows.
Claude Code and Claude Cowork run the same underlying model — their experiences are entirely different because the harness is different. The model is a component; the harness is the product.
Two agents using the same model but different harnesses produce wildly different results. This is not a metaphor — it is the literal architecture of every working agent in 2026.
Models are commoditizing; harnesses are differentiating. Skills, memory, learned preferences, and institutional knowledge all live at the harness layer.
Stop evaluating agents by model benchmarks. Evaluate by harness: does it persist memory, enforce safety, recover from failures, and compound institutional knowledge across sessions?

The industry talks about models. Which one is smartest. Which context window is largest. Which benchmark score is highest. That conversation misses the point entirely.

The model is the brain. Without a body, a brain sits in a jar.

The Formula

Agent = Model + Harness.

The model handles reasoning — what to do next, how to interpret results, when to change approach. The harness handles everything else: tool execution, context management, memory, state persistence, safety enforcement, error recovery, and human-in-the-loop workflows.

This is not a metaphor. It is the literal architecture of every working agent system in 2026. Strip the harness away and you have a stateless text-completion API. Add the harness back and you have a system that reads your codebase, triages your Slack, books your meetings, and runs overnight pipelines while you sleep.

Phil Schmid puts it directly: "An Agent Harness is the infrastructure that wraps around an AI model to manage long-running tasks. It is not the agent itself. It is the software system that governs how the agent operates."

Firecrawl's definition sharpens it further: "An agent harness is everything that wraps around an LLM — tool execution, memory, context management, state persistence — excluding the model itself."

The model decides what to do. The harness decides how it gets done.

What a Harness Provides

LLMs are stateless by default. No memory across sessions. No tool access. No file system. No persistence. No safety boundaries. The harness adds all of it.

Tool execution. The model emits a structured tool call — read_file("report.md"). The harness routes that call to the actual API, handles authentication, manages rate limits, and returns the result. Without the harness, the model's tool call is a JSON blob that goes nowhere.

Context management. A million-token context window sounds infinite until you try to fit a codebase, a conversation history, a knowledge graph, and forty tool schemas into it simultaneously. The harness decides what enters the context window and what stays out — retrieval, summarization, priority ranking.

Memory. Short-term: the current conversation. Long-term: what the agent learned three weeks ago about your Slack triage preferences. Cross-session: the knowledge graph that compounds entity relationships across every email, Slack message, and meeting note the agent processes. The model has none of this natively. The harness provides all of it.

State persistence. Sessions survive restarts. Conversations resume. Work products are saved. A 90-minute research task that gets interrupted at minute 47 picks up where it left off. Without state persistence, every interruption restarts from zero.

Safety enforcement. Permission boundaries — which folders can the agent read? Content filtering — does this output contain PII? Action approval gates — should a Slack message to #general require human confirmation? The harness enforces all of these. The model has no inherent concept of "don't post to the wrong channel."

Error recovery. Retry logic when an API call fails. Fallback strategies when a tool is rate-limited. Graceful degradation when context overflows. The model generates one response; the harness manages the recovery loop around it.

Human-in-the-loop. Trust ramps — confirm every action on Day 1, approve only high-risk actions by Day 30, fully autonomous by Day 60. The harness implements this progression. The model doesn't know what day it is.

Sub-agent orchestration. Spawning four parallel research agents, aggregating their results, managing dependencies between sequential steps. The model can reason about parallelism; the harness actually executes it.

Same Model, Different Harness, Completely Different Experience

This is the key insight. Two products can use the exact same underlying model and produce radically different user experiences — because the harness is different.

Claude Code — The Terminal Harness

Claude Code is Anthropic's terminal-native coding agent — 84.6K GitHub stars, 46% "most loved" rating among developers.

The harness is optimized for software engineering. Filesystem access scoped to the project directory. Git-aware — understands branches, diffs, commit history. Shell command execution. Sub-agent spawning for parallel work — the model reasons about which files to edit, the harness executes the edits, runs tests, observes failures, and routes results back. CLAUDE.md files provide persistent project context that survives session restarts. Hooks enforce custom policies before and after every tool call.

The model is Claude. The harness is a terminal runtime built for code.

Claude Cowork — The Desktop Harness

Claude Cowork launched January 12, 2026 as a research preview inside the Claude Desktop app. Powered by Claude Opus 4.6 with a one-million-token context window.

The harness is optimized for knowledge workers who never open a terminal. Folder-scoped filesystem access — the user grants access to a specific folder. The agent reads, edits, creates, renames, sorts, and deletes files within that scope. App automation connects to web and desktop applications. No shell. No git. No code execution.

Same underlying model family. Completely different harness. Completely different user.

Kiro — The IDE Harness

Kiro is AWS's agentic IDE, built on Amazon Bedrock with multiple foundation models.

The harness inverts the model familiar from Cursor and Copilot. The spec is the source of truth; code is a build artifact. Before writing a single line, the harness generates structured specifications — requirements with acceptance criteria, technical design, numbered task list. The user reviews and edits. Then the agent implements from the spec.

The harness is optimized for structured development — spec-driven, document-first, implementation-second. The model generates; the harness enforces the spec → design → task → code sequence.

Amazon Quick Desktop — The Knowledge Work Harness

Amazon Quick Desktop is a knowledge work agent that surfaces hundreds of tool functions in a single conversation. These tools come from connected MCP servers — Slack, Outlook email, calendar, Salesforce, SharePoint, OneDrive, web search, browser automation, image generation — alongside sandboxed Python and JavaScript execution and a local knowledge graph. All of it is accessible without switching apps. The harness decides which MCP servers to connect and which tools to surface to the model; that is the tool exposure point.

The harness is optimized for cross-tool knowledge work. Scheduled agents run in the cloud 24/7 — they execute even when the user is offline. A skills system encodes reusable methodology (not prompts — methodology). Long-term memory compounds across sessions. A knowledge graph connects entities extracted from Slack, email, calendar, and local files. Feed notifications surface agent output as prioritized cards.

The model reasons. The harness manages connections to dozens of MCP servers and their exposed tools, persists institutional knowledge across sessions, and orchestrates parallel sub-agents.

Bedrock AgentCore — The Managed Harness

On April 22, 2026, AWS announced a managed agent harness within Amazon Bedrock AgentCore. Developers declare an agent's model, system prompt, and tools, then run it in three API calls.

The harness manages the full agent loop — reasoning, tool selection, action execution, response streaming — inside a dedicated microVM spun up for each session. No orchestration code required. AgentCore Gateway provides governed connectivity to APIs and MCP servers with built-in auth, access control, and policy enforcement.

The harness is optimized for developers who want to build custom agents without reinventing infrastructure. The model plugs in (Claude, Llama, Mistral — any Bedrock model). The harness provides everything else. This is covered in depth in a separate post in this series.

Harness Engineering Is Becoming a Discipline

The term comes from Mitchell Hashimoto, creator of Terraform and Ghostty. His definition: "Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."

That is harness engineering. Not prompt engineering — the model's instructions are one input. Not fine-tuning — the model's weights are unchanged. Harness engineering is the practice of improving the infrastructure around the model so that reliability increases with every failure.

Blake Crosley frames the mental model precisely: "An AI coding agent is a programmable runtime with an LLM kernel. Every action the model takes passes through hooks you control. You define policies, not prompts."

The discipline has formalized rapidly. Both OpenAI and Anthropic now use the term formally. Martin Fowler has written about it. An arXiv paper formalizes the pattern. This is not a buzzword — it is the missing architectural layer that determines whether AI agents work in production.

The harness is where reliability lives. Models hallucinate; harnesses catch hallucinations. Models forget; harnesses persist memory. Models don't know your tools; harnesses expose the right tools at the right time.

Why the Harness Matters More Than the Model

Three reasons.

Models are commoditizing. Harnesses are differentiating. You can swap Sonnet for Opus for Haiku and the harness stays the same. The model is a component. The harness is the product. Claude Code, Claude Cowork, Kiro, and Amazon Quick Desktop all have access to the same models — their differentiation is entirely in the harness.

Two agents using the same model but different harnesses produce wildly different results. Give Claude Sonnet a terminal harness and it writes code. Give the same model a knowledge work harness and it triages your inbox. The model is identical. The experience is not.

The harness is where institutional knowledge lives. Skills, memory, learned preferences, safety policies, workflow patterns — all harness-layer concerns. The model has no concept of "last time this customer asked about pricing, here's how we responded." The harness does.

The Convergence

Every harness category is expanding into the others.

Terminal harnesses (Claude Code) are adding knowledge work features — memory, web search, MCP integrations with 150+ tools. Desktop harnesses (Cowork) are adding coding features — file manipulation, structured outputs. IDE harnesses (Kiro, Cursor) are adding agentic loops — autonomous multi-step execution beyond autocomplete. Knowledge work harnesses (Quick Desktop) are adding builder features — agent delegation to Claude Code and Kiro, with git and cloud account access on the roadmap.

The winning harness will unify all four surfaces: terminal, desktop, IDE, and knowledge work — in a single runtime where the model switches modes but the harness provides continuity.

The model conversation is nearly over. The harness conversation is where the actual competition lives.

Part 3 of the Agent Primitives series.
← Part 2: What Is a Tool · Part 4: What Is a Skill →

What Is a Tool — The API Call Your Agent Makes on Your Behalf

Amit — Sat, 06 Jun 2026 07:19:34 +0000

TL;DR

A tool is a single atomic capability with a JSON schema: name, description, parameters, output. The Slack API existed since 2013; what's new is an AI model autonomously deciding which one to call and when.
MCP standardized tool discovery and invocation across platforms — 97 million SDK downloads and 13,000+ public servers in 16 months. The protocol won; the long-tail integration problem is now plug-in, not build.
Work happens when the agent chains tool calls into a workflow — seven calls, one coherent sequence, no human orchestrating the order.
Having 250 tools does not make an agent capable. The bottleneck shifts to methodology: which channel, what tone, what prior context to reference. That's the skill layer.
Stop adding more tools. The tool layer is solved. Build the methodology layer above it.

APIs have existed for thirty years. Your agent calling one on your behalf — that's new.

If you've used an AI agent that reads your email, posts to Slack, queries your CRM, and books a meeting — all in one conversation — each of those actions was a tool call. A tool is a single, well-defined capability the agent can invoke. Read a file. Search the web. Create a calendar event. Send a message. Every tool has a name, a description, input parameters, and an output format. The agent reads the description, decides when to call it, fills in the parameters, and interprets the result.

That sounds simple. It is simple. The interesting part is everything around it.

This is the third post in the Agent Primitives series — four posts that cut through the confusion around agents, skills, tools, and agent harnesses. The first post defined what an agent is (and isn't). The second defined skills — reusable encoded methodology. This one defines the atomic layer underneath both: tools.

A Tool Is a Function With a JSON Schema

Strip away the marketing and a tool is a function signature. It has:

A name: email_send, file_read, calendar_view, web_search
A description: natural language explaining what the function does — this is what the model reads to decide whether to use it
Input parameters: typed fields with descriptions (e.g., to: string, subject: string, body: string)
An output: whatever the function returns — a list of emails, a file's content, search results, a confirmation

The agent's reasoning model reads the description and decides: given my current goal and the tools available, which one should I call next, and with what parameters?

That decision — the model selecting and parameterizing a tool call at runtime — is the fundamental shift. APIs existed long before AI agents. SDKs, webhooks, REST endpoints, GraphQL queries. All the plumbing was already there. What changed is that the human is no longer the one deciding which API to call. The model is.

Tools Are Not New. Who Calls Them Is.

A Slack API endpoint for posting a message has existed since 2013. An Outlook API for reading email has existed since 2015. A Salesforce API for querying opportunities has existed since the early 2000s. None of this is novel infrastructure.

What's novel: an AI model sitting in a reasoning loop, examining your goal ("triage my inbox and flag anything from Tier-1 accounts"), scanning its available tools, and autonomously deciding:

Call email_inbox to pull the last 50 messages
Call file_read on accounts.csv to load the Tier-1 list
For each email, reason about sender, subject, and body against the Tier-1 list
Call file_write to produce a triage report
Call conversations_add_message to post the summary to Slack

No human selected those tools. No human wrote that sequence. The agent reasoned through it based on the goal and the tools available.

This is why the tool layer matters: it's the interface between the agent's reasoning and the external world. Without tools, the agent is a brain in a jar — it can think about your email, but it can't read it.

Seven Categories of Tools

In practice, tools cluster into functional categories. Here's what a production knowledge-work agent actually uses:

Category	Examples	What It Enables
Communication	Slack MCP server → tools: `post_message`, `search_messages`, `add_reaction`; Outlook MCP server → tools: `email_read`, `email_reply`, `email_forward`	Agent reads and writes to your communication buses
Knowledge	File read/write, semantic search (RAG), knowledge graph queries	Agent accesses and updates your information layer
Data	Salesforce MCP server → tools: `search_opportunities`, `fetch_account_details`, `update_opportunity`; dashboard and spreadsheet tools	Agent queries structured business data
Calendar	View events, check availability, book meetings, find rooms	Agent manages your time
Web	Search, fetch URLs, browser automation (click, type, screenshot)	Agent reaches beyond your local corpus
Code	Run Python, run JavaScript, execute in sandboxed environments	Agent computes, transforms, analyzes
Generation	Create images, transcribe audio, generate documents (PPTX, PDF, DOCX)	Agent produces artifacts

No single tool is interesting on its own. file_read by itself is cat. email_inbox by itself is Outlook. The power comes from what the agent does with the result of one tool to decide the next tool call. That's the agent loop, and it's the primitive that makes tools useful.

MCP: The Universal Tool Protocol

Before 2024, every agent platform defined its own tool format. OpenAI had function calling. LangChain had tool classes. Anthropic had tool-use blocks. If you built a Slack integration for one platform, you rebuilt it for every other one.

Then Anthropic released the Model Context Protocol — MCP — in late 2024. It standardized how agents discover, authenticate with, and invoke tools. JSON-RPC over stdio or Streamable HTTP. Language-agnostic. Open standard.

The adoption curve has been vertical:

97 million monthly SDK downloads by March 2026 — up from ~2 million at launch. 4,750% growth in 16 months.
13,000+ public MCP servers on GitHub as of April 2026, spanning databases, dev tools, communication platforms, and cloud infrastructure.
Governed by the Linux Foundation's Agentic AI Foundation (AAIF) with backing from Anthropic, OpenAI, Google, Microsoft, and AWS.
Natively supported in Claude, Cursor, Windsurf, VS Code, Kiro, and 200+ other tools.

MCP is to agent tools what HTTP was to web services — the universal transport. Before HTTP, every networked application had its own protocol. After HTTP, you built one server and any client could talk to it. Before MCP, every agent had its own tool format. After MCP, you build one server and any MCP-compatible agent can discover and call it.

The architecture has three roles:

Host: the application holding the LLM (Claude Desktop, Amazon Quick Desktop, Cursor)
Client: maintains a stateful connection to a specific MCP server
Server: an independent process exposing tools, resources, and prompts for the agent to use

A single host connects to multiple servers simultaneously. Each server exposes its own tools. The agent sees all of them in one unified namespace.

This is why MCP matters: it turns the long tail of integrations from a build problem into a plug-in problem. A team at Particula Tech shipped eleven enterprise integrations in nine days because every one spoke MCP. Two years earlier, that would have been three months of custom plumbing per integration.

Tool Composition: Where the Real Work Happens

A single tool call is not interesting. file_read returns text. email_inbox returns a list. web_search returns results. None of that constitutes work.

Work happens when the agent chains tool calls into a workflow:

Example — Morning triage:

email_inbox → pull last 50 messages from priority inbox folder
file_read → load triage-rules.csv for sender-tier classification
file_read → load accounts.csv for account-level context
calendar_view → pull today's meetings for cross-reference
Agent reasons: classify each email as T1/T2/T3 based on sender, keywords, account context, calendar overlap
file_write → produce a structured triage report with decision-lines
conversations_add_message → post the T1 items to Slack

Seven tool calls. One coherent workflow. The agent selected each tool, parameterized it, interpreted the result, and decided the next step. No human orchestrated the sequence.

That chain — tool selection, invocation, interpretation, next-step reasoning — is not a tool. It's the agent loop using tools. The distinction matters because people confuse tool access with capability. Having 250 tools does not make your agent capable. Having methodology for when and how to combine those tools does.

That methodology is a skill. And that's the boundary between the two concepts.

The 250-Tools-One-Conversation Reality

The scale of tool access in production agents has blown past what anyone expected two years ago.

Amazon Quick Desktop runs 250+ tools in a single conversation. Those 250 are individual tool functions spread across all connected MCP servers — Slack MCP server tools (read, write, search, react), Outlook MCP server tools (email and calendar), SharePoint tools, Salesforce MCP server tools, file system tools, Python and JavaScript execution, web search, browser automation, image generation, audio transcription, and more — plus native harness capabilities like AgentCore's code interpreter sandboxes and cloud browser sessions. (AgentCore is an AWS service with its own MCP server that exposes runtime primitives to agents; it gets its own dedicated post.) All simultaneously available. The agent selects from the full set based on the user's intent.

Claude Code integrates with 150+ tools via MCP — file system, git, shell commands, web search, browser automation, and any MCP server the developer adds. Sub-agents can spawn with their own tool access to work in parallel.

Kiro connects to AWS services natively through Amazon Bedrock, with MCP servers for additional tool access — SAM templates, CloudWatch, DynamoDB, Lambda.

The number of tools is not the flex. The number is the precondition. Once you have 250 tools available, the bottleneck shifts. The question stops being "can the agent send a Slack message?" and becomes "does the agent know which channel to post in, what tone to use, when to thread vs. top-level, and what prior context to reference?"

That's not a tool problem. That's a skill problem. Tools provide the hands. Skills provide the judgment.

Tools vs. Skills: The Critical Distinction

This is where the industry gets confused. Arcade.dev put it precisely: "Tools and skills get used interchangeably in marketing decks and conference talks, but they represent fundamentally different approaches to extending agent capabilities."

Here's the difference:

	Tool	Skill
Scope	Single atomic capability	Multi-step methodology
State	Stateless — call and return	Stateful — encodes workflow, rules, quality checks
Analogy	A hand	A brain directing the hand
Example	`email_send(to, subject, body)`	"Draft a reply to the highest-priority email, using the right voice mode, following the triage classification rules, and checking against the quality bar before sending"
Persistence	None — defined once in a schema	Versioned, shareable, evolves from experience
Knowledge	What to do (send email)	How and when to do it (which email, what voice, what rules)

A tool is an API call. A skill is institutional knowledge about when, why, and how to make that API call — and the six calls that should follow it.

The GTM AI Podcast framed it well: "Tools let agents act. Skills provide the knowledge of how and when to act — including the company-specific, team-specific, and user-specific context that separates a capable AI from a competent one."

If you're building an agent and you think the answer is "add more tools," you're solving the wrong problem. The agent already has the tools. What it's missing is the methodology to use them well. That's the skill layer, and it's the topic of the second post in this series.

So What

Tools are the atomic capabilities that let agents interact with the world. They are necessary but not sufficient. MCP standardized how tools are discovered and invoked — that's a protocol win comparable to HTTP. The ecosystem response has been 97 million SDK downloads and 13,000+ servers in 16 months.

But tools alone don't produce work. A tool is a single function call. Work is a chain of function calls governed by methodology — which tools to call, in what order, with what parameters, interpreted against what context, and checked against what quality bar.

The tool layer is solved. MCP won. The open question is the layer above it: who encodes the methodology that makes those tools useful? That's the skill layer, and it's the new frontier.

Part 2 of the Agent Primitives series.
← Part 1: What Is an Agent · Part 3: What Is an Agent Harness →

What Is a Skill — Why Methodology Resets Every Session Without One

Amit — Sat, 06 Jun 2026 07:18:59 +0000

TL;DR

Every new agent session resets methodology — the model knows how to reason, the tools know the API, neither knows your 11-step triage heuristic or your sender-tier rules. That gap is where skills live.
A skill is a structured SKILL.md file encoding reusable methodology: when to trigger, what inputs to gather, what tools in what order, quality gates, and explicit anti-patterns. Portable across 30+ platforms.
Skills are not prompts (ephemeral) or tools (stateless atoms) — they are durable, versioned, testable institutional knowledge that survives model upgrades and context resets.
One caught failure encoded into a skill means every future session — yours and every teammate who installs it — inherits the fix automatically.
Build the skill layer: tools give agents hands, skills give agents methodology. Without skills, 250 connected tools still produce mediocre output.

Every agent session starts from zero. The model is brilliant. The tools are connected. And still — you spend the first ten minutes re-explaining how you work.

That's the methodology reset problem. And skills are the fix.

The Reset Nobody Talks About

Open a new session with any AI agent. Ask it to triage your Slack. It will read your channels, classify messages, and produce a summary. The summary will be wrong — not factually, but methodologically. It doesn't know that your manager's DMs are always Tier 1. It doesn't know that "Bedrock" appears in 44% of your emails and shouldn't auto-escalate to urgent. It doesn't know your account team handles Cursor through three specific Slack channels, not email.

You explain this. The agent adjusts. The output improves. Forty minutes later, you have a workflow that works.

Tomorrow, you open a new session. The agent has no memory of any of this. You start over.

This happens because agents have two layers and are missing a third. They have reasoning (the model) and capabilities (the tools). What they don't have is methodology — the encoded knowledge of how YOU approach work. The model knows how to reason about Slack messages. The Slack MCP server exposes tools that know how to call the Slack API. Neither knows that your triage rules classify senders into three tiers based on a contacts registry, apply an 11-step heuristic in strict priority order, and output decision-lines instead of summaries.

The gap between "agent can do a thing" and "agent does the thing the way I need it done" — that's where skills live.

What a Skill Actually Is

A skill is a structured file — typically SKILL.md — that encodes reusable methodology for an AI agent. It tells the agent: when to activate, what inputs to gather, what tools to use in what order, what quality checks to run, and what mistakes to avoid.

The format has converged across 30+ agent platforms. SKILL.md works in Claude Code, GitHub Copilot, Cursor, Gemini CLI, OpenAI Codex, Windsurf, Roo Code, and others — a single skill file, portable across environments. YAML frontmatter declares metadata. Markdown body contains the instructions. Optional scripts/, references/, and assets/ directories carry supporting materials.

The architecture uses progressive disclosure: metadata (~100 tokens) loads always, full instructions (<5,000 tokens) load only when triggered, and resources load only during execution. This keeps token costs low while making the full methodology available on demand.

A concrete example. My "Slack Triage" skill encodes:

Trigger: "triage slack", "check my slack", "what did I miss"
Inputs: time window, channel scope (tier-1 only, deal-rooms, all)
Data sources: a contacts registry (54 contacts with tiers), a classification rules file (88 classification rules), a monitored channels list
Methodology: 11-step classification heuristic — check sender tier first, then noise patterns, then auto-sender detection, then keywords, then thread context, then account-name matching
Quality gates: "Bedrock" over-trigger guard (prevents 44% of emails from flooding Tier 1), thread collapsing (same conversation → one item, highest tier wins)
Anti-patterns: Don't classify based on subject line alone. Don't treat all @company.com senders as Tier 1. Don't skip the calendar cross-reference.
Output format: Decision-lines, not summaries. Each line is an action ("Reply yes/no", "Read before 11am"), not a description of what happened.

None of that information lives in the model. None of it lives in the Slack API. It lives in the skill. Remove the skill, and the agent produces a generic Slack summary that ignores your sender tiers, your channel priorities, and your action-oriented output format.

Skills vs. Tools vs. Functions vs. Prompts

The industry uses these terms interchangeably. They are four different things.

Primitive	What it is	Analogy	Persistence
Function	A single API endpoint the model can call. JSON schema: name, parameters, return type.	A single verb — "read", "send", "search"	None — stateless
Tool	An atomic capability exposed to the agent — read a file, post to Slack, query a database. A superset of functions (some tools compose multiple functions).	The hands	None — stateless
Prompt	A one-shot instruction to the model. No structure, no persistence, no versioning.	A sticky note	Session only — gone tomorrow
Skill	Encoded methodology combining multiple tools with domain knowledge, quality checks, and anti-patterns.	The brain telling the hands what to do, in what order, and why	Durable — versioned, portable, shareable

Arcade.dev states it directly: "'Tools' and 'skills' get used interchangeably in marketing decks and conference talks, but they represent fundamentally different approaches to extending agent capabilities. Understanding this distinction is the difference between building agents that work in demos versus agents that work in production."

Another way to frame it, from GTM AI Podcast: "Tools let agents act. Skills provide the knowledge of how and when to act — including the company-specific, team-specific, and user-specific context that separates a capable AI from a competent one."

The implication: you can give an agent 250 tools and it will still produce mediocre output if it lacks the methodology to use them correctly. Tools are necessary but not sufficient. Skills are what close the gap between capability and competence.

The Separation Principle

This is not a cosmetic distinction. It is an architectural one: "The model provides reasoning; the skill provides context; the composition produces behaviour that neither could generate alone."

Skills separate what the model can do from what the model should do in this specific context. The model can draft an email. The skill knows that emails to VP+ recipients use the leadership voice, never hedge the close, and always lead with the direct ask. The model can search Slack. The skill knows that from: queries require aliases, not display names, and that DM channel IDs starting with D don't work with the in: filter.

This separation matters because it means:

Skills survive model upgrades. Swap Sonnet for Opus. The skill still works. The methodology is independent of the reasoning engine.
Skills survive context window resets. New session, same skill file. No re-explanation needed.
Skills are diffable and versionable. They're Markdown files. Git tracks every change. You can review what changed, when, and why.
Skills are testable. You can define eval cases — specific inputs that should produce specific outputs — and verify the skill produces correct behavior after changes.

The 8-Phase Skill Lifecycle

Skills are not static files. They evolve through a lifecycle — and the lifecycle is what separates a personal hack from institutional infrastructure.

1. Catch — An agent makes a mistake. You correct it. This is the raw material. Example: the agent replied to the wrong message in an email thread because it used the first itemId instead of the target sender's itemId.

2. Author — You encode the correction as a skill. The email-thread-reply skill now resolves the correct itemId for the target sender's message before calling the reply tool. The failure mode is baked into the methodology so it never recurs.

3. Discover — Others find the skill. A shared store, shared knowledge spaces, a git repo — discovery is the prerequisite for distribution. An MIT/UCSB study validated that flat skill libraries fail without structured discovery and adaptation mechanisms.

4. Chain — Skills compose. The "morning briefing" isn't one skill — it's slack-triage → email-triage → calendar-triage → draft-responder, sequenced by a scheduler. Each skill is independent; the chain produces emergent capability.

5. Scrub — Before sharing, strip PII. Personal file paths, Slack channel IDs, customer names, CRM IDs — all of it gets replaced with parameters. The skill becomes portable.

6. Distribute — Push to a shared store. Today this happens through git repos, shared folders, or shared knowledge spaces acting as skill stores. Tomorrow it should be a native platform capability.

7. Adapt — A teammate installs the skill and adjusts it for their context. Different Slack channels. Different sender tiers. Different voice settings. The methodology stays; the parameters change.

8. Evolve — The skill improves from experience. A new failure mode is caught in phase 1 and baked back into the skill in phase 2. The cycle repeats. Every iteration makes the skill more durable.

This lifecycle is not theoretical. I run 40+ skills through it. When I catch a triage failure at 8am, the fix is in the skill by 8:15am. Every future session — mine and anyone who installs the skill — inherits the fix automatically.

Skills Are Institutional Knowledge

Here's the argument that changes how you think about skills.

When one person catches a failure mode and encodes it in a skill, everyone who installs that skill inherits the fix. The knowledge compounds across people without meetings, training sessions, or documentation review cycles.

Traditional institutional knowledge flows look like this: someone discovers a better way to do something → writes a wiki page → nobody reads it → the knowledge dies with the person when they change teams.

Skill-based institutional knowledge flows look like this: someone discovers a better way to do something → encodes it in a skill → pushes to a shared store → anyone who installs it gets the improvement automatically → when they encounter a new failure, they push a fix back → the skill compounds.

Christopher Spencer Penn captures it: "In modern agentic AI systems, agents can use skills, and skills can invoke agents. For example, I might have a skill called 'find the bloody bug' that kicks off three different kinds of debugging agents."

Skills are executable. Wiki pages are not. That's the difference between knowledge that sits and knowledge that works.

The Skill Distribution Frontier

Building a great skill means nothing if others can't find, install, and adapt it.

This is the frontier. Today, skill distribution is duct tape — shared folders, git repos, manual copy-paste. The 8-phase lifecycle works for a solo builder maintaining 40 skills. It breaks at team scale without native platform support.

What's needed:

Discovery: Semantic search over a skill store — "I need a skill for triaging customer emails" should return relevant skills ranked by quality and adoption.
Install: One-click install that parameterizes the skill for the user's context — their channels, their registries, their voice settings.
Adaptation: Fork a skill, adjust it, contribute improvements back. Git for skills.
Quality signals: Usage metrics, failure rates, user ratings. Not every skill is worth installing.
PII scrubbing as a first-class gate: Before any skill leaves a personal workspace, it passes through automated PII detection. File paths, channel IDs, customer names, CRM IDs — all parameterized.

CalmOps describes the maturity curve: "Unlike generic tools that provide single functions, skills encapsulate the complete knowledge and logic required to handle a specialized domain."

The platform that solves skill distribution — discovery, installation, adaptation, and quality feedback — will own the institutional knowledge layer for agent-native work. Every competing platform (Anthropic, Salesforce, ServiceNow, Glean) is building toward this. The winner will be the one that treats the full 8-phase lifecycle as a first-class system, not a marketplace bolted on top.

So What

Skills are the missing primitive between tools and agents. Tools give agents hands. Skills give agents methodology. Without skills, every session starts from zero — the agent re-discovers your workflow, your quality bar, your anti-patterns through trial and error. With skills, the first session establishes the methodology and every subsequent session inherits it.

Memories fade. Context windows reset. Prompts are ephemeral.

Skills persist.

The rest of this series covers the other three primitives: what an agent is (the loop), what a tool is (the hands), and what a harness is (the infrastructure). Skills are the brain that coordinates all three — the layer where institutional knowledge becomes executable, shareable, and compounding.

Part 4 of the Agent Primitives series.
← Part 3: What Is an Agent Harness

We're All Builders Now

Amit — Sat, 06 Jun 2026 07:18:23 +0000

TL;DR

AI didn't make building easier — a 2025 METR study found experienced developers using AI tools took 19% longer on complex tasks. The tool is an access enabler, not a simplifier.
The gate that moved was technical credential, not judgment. Domain experts who always had the clearest view of the problem now have direct access to execution primitives.
The internet democratized information access; AI democratizes creation access. Both left the hard part hard.
What didn't move: knowing what to build, conviction, distribution, customer obsession, and the stubbornness to finish well.
If you can see the problem clearly and you're waiting for someone with an engineering credential to build it — that argument is gone.

Building was never easy.

Writing software took years to learn. Most people spent careers getting good at it — understanding systems, debugging at 2am, learning the hard way why certain architectural decisions rot over time. That's real craft. Nobody is taking that away.

But access to creation? That was always a different problem. And that's what changed.

The Internet Did This First

When the internet arrived, it didn't make information easier to create or understand. Journalism still required skill. Research still required rigor. Writing a good book was still hard.

What it did was remove the gatekeeping between knowledge and the people who needed it. You no longer had to live near a great library, or know someone who subscribed to the right journals, or have an institution behind you to access what had been written. Geography, wealth, and institutional access stopped being filters. The knowledge was always there. The internet removed the walls around it.

AI is doing the same thing — one layer up. Not to information. To creation.

What Actually Changed

For most of computing history, building digital things required a specific kind of access: the ability to write code. Or know someone who could. Or wait for a roadmap item to survive a planning cycle and land in an engineer's queue.

That wasn't a commentary on who had the best ideas. Domain experts — the consultant who had seen the same broken workflow at forty companies, the operator who knew exactly where the process fell apart every quarter, the analyst who could have told you what the dashboard should show three years ago if anyone had asked — often had the clearest view of the problem. They couldn't build the solution themselves. The execution layer required a credential they didn't have.

AI removed that specific gate. Not by making building easy. By making access to building primitives broadly available — to anyone with a clear enough problem and genuine enough curiosity to try.

Gartner projects the low-code market will hit $44.5 billion by 2026, with 75% of new applications incorporating no-code or low-code solutions. Operations managers are shipping dashboards. Product leads are building internal tools. Finance analysts are deploying apps. The through-line in every case: the tool caught up with the intent. The thing that used to require engineering cycles now requires an afternoon and a clear enough problem statement.

That's democratization. Not simplification.

The Hard Part Is Still Hard

Here's what AI didn't touch: the judgment gap.

Knowing what to build — that's still human. Most people solve the wrong problem. They build clever solutions to the problem they wished existed, not the one that's actually costing someone time and money every week. Domain knowledge plus genuine customer obsession is still how you find the right problem. No model gives you that.

Conviction — the willingness to commit to a specific bet when everything is uncertain — is still human. AI can generate options indefinitely. It cannot choose. The person who can look at an ambiguous situation and say "this specific thing, now, for these people" — that's still the rare thing.

Distribution. Drive. Scaling judgment. The ability to know which customers to listen to and which to ignore. The stubbornness to keep going when it's not working. The clarity to kill something that's not working fast enough. None of that moved.

The METR study (2025) makes this concrete: in a randomized controlled trial of experienced open-source developers working on complex, real-world tasks from their own repositories, those using AI tools took 19% longer than those without. The tool is not a simplifier. It is an access enabler. What you do with that access still depends entirely on what you bring to it.

The Gate Moved, Not the Climb

The analogy that keeps coming up: "The Internet Democratized Information, AI Democratizes Intelligence." That's close. But for builders specifically, the sharper version is: the internet democratized access to information, AI democratizes access to creation.

Both leave the hard part hard. Reading everything ever written about surgery doesn't make you a surgeon. Having access to every building primitive in existence doesn't make you a good builder. What it does is remove the argument that someone else has to build the thing you can see clearly and they can't.

The gate was always at the wrong place. It filtered by technical credential when it should have filtered by judgment, curiosity, and understanding of the problem. That filter is now gone.

Which means the people who always had the deepest understanding of the problem — the domain experts, the operators, the people closest to real pain — now have the tools to do something about it.

That's the shift. Not that building got easier. That access to the act of building got broader.

So What

Builder mode isn't a job title. It isn't a technical credential. It isn't something you earn after enough years in an IDE.

It's an orientation — the decision to see a problem clearly, take it seriously, and do something about it rather than wait for someone else to. That operating mode is available to anyone with the curiosity to look hard at a problem and the drive to show up with something real.

The access barrier is gone. The hard part — conviction, judgment, customer obsession, drive — was always human. Still is.

What I haven't worked out: whether broader access actually produces better outcomes, or whether the judgment gap is wide enough that it just produces more noise. The METR data suggests experienced builders get slower when AI is in the loop on complex work — which implies the thing that separates good builders from the rest isn't access to tools. If that's true, opening the gate changes who can start. It doesn't change who finishes well.

Part 2 of the Builders series. ← Part 1: Builder Is an Operating Mode, Not a Job Title

Voice Is a Layer, Not a Setting

Amit — Sat, 06 Jun 2026 07:17:47 +0000

TL;DR

Five writing skills with embedded voice instructions = five drifting definitions; the same person sounds like different writers within months.
The fix is four independent layers: mode detection → voice + quality → format → publish. Voice lives in one place, called by everything else.
A single correction to the centralized voice layer propagates instantly across blog posts, Slack threads, emails, and strategy docs — no hunting across five skills.
Mode detection runs before a single word is written, resolving context from a five-signal hierarchy (explicit override, recipient, role, channel, intent keywords) with no manual selection required.
Every major tool treats voice as a setting. Separate the layers, centralize the voice, and the drift problem disappears.

The person is the constant. The mode is the variable. The medium is irrelevant.

If you have five writing skills and each one defines voice separately, you have five competing voice definitions. Over time they drift. A blog post and a Slack thread about the same topic come out sounding like different people wrote them. Not because the agent changed — because the voice instructions were never in the same place.

The fix is architecture, not better prompts. Voice is a layer. It belongs in one place, called by everything else.

The Problem With Embedded Voice

Every AI writing tool faces the same temptation: put the voice instructions where the writing happens. The blog skill says "write in a practitioner voice, evidence-based, no hedging." The email skill says "write professionally, direct, data-specific." The Slack skill says "keep it short, action-oriented."

Three skills, three definitions of "professional." None of them wrong. All of them slightly different. The drift is imperceptible at first — a slightly different sentence rhythm here, a slightly different threshold for hedging there. After a few months of iterating each skill independently, the same person sounds like three different writers depending on which skill ran.

This is not a voice problem. It is an architecture problem.

Four Layers

The fix is separating concerns that were bundled together:

Layer 1: MODE DETECTION
  What voice variant to use — casual, professional, leadership,
  field, publishing, or builder. Resolved from context before
  writing begins. Never manual.

Layer 2: VOICE + QUALITY
  The universal standards that apply regardless of mode.
  Cliché guard. Citation rules. Quality checklist. Anti-patterns.
  One definition. Called by everything.

Layer 3: FORMAT
  Structure, length, frontmatter, conventions.
  Blog format. Slack format. Email format. Strategy doc format.
  Each content type has its own format layer.
  Format knows nothing about voice.

Layer 4: PUBLISH
  Upload, verify, RAG optimization.
  Always a separate explicit step.
  Never bundled into format.

The calling skill provides the format layer. It calls the voice layer. The voice layer calls mode detection. The result: the voice is consistent across every content type because it lives in one place, not five.

Layer 1: Mode Detection

Mode detection runs before a single word is written. A five-signal priority hierarchy resolves the correct voice variant from context:

Explicit override — "keep it casual" or "exec tone" wins immediately
Recipient override — per-person config for people who always get a specific mode
Role mapping — looks up the recipient in a contacts registry, maps relationship (peer, manager, customer, close colleague) to mode
Channel detection — Slack public channel → professional; email to external domain → field; blog post → publishing
Intent keywords — "ping him," "heads up" → casual; "endorsement request" → leadership; "write a post" → publishing

Default: professional.

The agent never asks which mode to use. The signal is already there — recipient, channel, intent. The hierarchy reads it.

What makes this maintainable: the detection logic lives in a YAML config file, not code. Adding a new recipient override is a one-line edit. Adjusting a keyword mapping takes ten seconds. No code change needed when the context changes.

Layer 2: Voice + Quality

This is the layer most tools skip. Every writing skill embeds its own voice definition. The four-layer architecture pulls that definition out and centralizes it.

The voice layer owns:

The cliché guard — a universal banned-phrase list that runs on every piece of output regardless of mode or format. "Robust," "seamless," "comprehensive," "game-changing" — banned everywhere, always, because they are placeholders for the specific thing the writer actually means. The guard does not restrict expression. It forces specificity.

The never_say lists — mode-specific bans that load with the engram. Casual mode bans "I hope this note finds you well." Leadership mode bans "either way, no worries if not." Publishing mode bans credential framing. The bans are decisions, not style preferences — they encode what the writer has explicitly rejected in real output.

The quality checklist — conditions that must be met before output returns: opens with outcome not setup; every falsifiable claim has a source or "in my experience" label; no credential framing; has a "so what"; ends on action not opt-out.

Citation rules — inline links for every factual claim, "in my experience" for unlinkable observations. Not footnotes. Not optional.

Because this layer is centralized, a correction made in one place propagates everywhere. When "robust" gets added to the cliché guard, it is banned in blog posts, Slack threads, emails, and strategy docs simultaneously. No hunting across five skills to update five separate voice definitions.

Layer 3: Format

Format is what changes by content type. A blog post needs frontmatter, a filename convention, a category, a length target. A Slack thread needs a hook, a body, a close. An email needs subject, greeting, body, action. A strategy doc needs thesis, evidence, what's missing, so what.

Format skills are pluggable. Any format skill can call the voice layer. Blog format + publishing voice. Slack format + casual voice. Strategy doc format + leadership voice. The combination is arbitrary because the layers are independent.

This is the same principle behind separation of concerns in software architecture. The format skill does not know about voice. The voice layer does not know about format. Both apply — simultaneously, independently.

Layer 4: Publish

Publishing is always a separate explicit step. Never bundled into format.

The format skill produces a draft. When the draft is ready, a publish step handles the mechanics: RAG optimization for AI-readable structure, filename validation, upload, verification. One publish skill works for any content to any destination — because publish is format-agnostic.

Why separate? Because "format" and "ready to publish" are different states. A draft can be formatted correctly and still need review. Separating the layers makes that review natural — the format skill delivers a draft, the author reviews, the publish step runs when ready.

What the Market Offers

Every major tool treats voice as a setting, not a layer:

Approach	What it does	What's missing
Custom GPT / Claude Styles	Single voice profile from samples	No mode switching. DM = exec email = blog post.
Per-skill voice encoding	Voice defined inside each writing skill	5 skills = 5 definitions = drift
Engram builder (native)	Extracts one profile from message corpus	Single mode. No auto-detection. No never_say.
Brand voice guides	Organizational standards	Not machine-readable. Not enforced at write time.

Nobody has separated mode detection, voice quality, format, and publish into independent layers with clean interfaces between them. The closest analog is what Google did for visual identity with DESIGN.md — a single machine-readable source of truth for brand standards, called by any agent building UI. The writing equivalent is a centralized voice layer, called by any skill producing written output.

The Consistency Principle in Practice

What changes by mode: casual is shorter and warmer. Professional is strategic and data-specific. Leadership is personal and confident. Field is customer-obsessed. Publishing is universal and evidence-based. Builder is precise and structured.

What never changes: evidence-backed claims. No clichés. No hedging. Specific over vague. Peer voice, not trainer voice.

A blog post and a Slack thread about the same topic should feel like the same person wrote them. One is longer and more structured. The other is shorter and more direct. But the thinking, the specificity, the conviction, and the anti-patterns are identical — because those properties live in the voice layer, not in the blog skill or the Slack skill.

The person is the constant. The mode is the variable. The medium is irrelevant.

This is the final post in a four-part series on building mode-specific voice profiles for AI agents. The series starts here.

Two AI interfaces. Same desktop. Completely different jobs.

Amit — Sat, 06 Jun 2026 07:17:10 +0000

TL;DR

Two tools, same model, same MCP servers, same tools — the only difference is who drives: interactive co-pilot vs. autonomous delegate.
Precision work (strategy docs, KB edits, blog drafts) belongs in co-pilot mode; you need judgment at every step. Routine work (Slack triage, insight summaries, handoff drafts) belongs in delegate mode; approving every step kills the time savings.
The dividing line: if you'd feel comfortable not watching it work, delegate. If you'd feel nervous, co-pilot.
Using one mode for everything produces mediocre results from both tools; the friction is a mismatch signal, not a tool failure.
Ask "what role do I want to play in this task?" before opening any AI interface.

I've been running Claude Code and CoWork side by side for weeks. They use the same model. The same MCP servers. The same tools. So why do they feel like completely different things?

I'm not a developer. My work is strategy and knowledge work — positioning, research, content, handoffs. I don't write code for a living. But I spend a lot of time in AI interfaces, and for the past several weeks I've had two of them open simultaneously: Claude Code in the left panel, CoWork on the right.

At first I kept asking myself: why do I need both? They run on the same model. They connect to the same tools — my email, calendar, Slack, Salesforce. They can both draft a document, search my files, or pull insights from a Slack channel. They're the same thing.

Except they're not. And once I understood why, how I work with AI changed completely.

The thing nobody says clearly

Most comparisons of AI tools focus on the model, the features, the connectors, the pricing. That's the wrong frame.

The actual difference between Claude Code and CoWork isn't the model. It isn't the surface — I use Claude Code in a desktop interface, not a terminal. It isn't even the capabilities.

The difference is who drives.

Claude Code is an interactive tool. Every step gets surfaced. I approve tool calls. I redirect when something's off. The AI proposes; I dispose. This makes it slow and deliberate by design.

CoWork is an autonomous agent. I hand it a task and walk away. It executes end-to-end, using connectors and skills to complete multi-step workflows without me narrating every move.

Same intelligence. Different workflow model. That's the whole thing.

What this actually means for a knowledge worker

If you're a developer, the framing makes obvious sense: Claude Code is for precision work where you want to review each diff. Fine.

But I'm not a developer. So why do I need both?

It took me a while to see it, but the answer is exactly the same — just applied to different kinds of work.

Some of my work requires precision and judgment at each step. When I'm editing KB content, writing positioning docs, or modifying a repo structure, I want to see each move. An agent that acts autonomously in my codebase without my oversight is a liability. Claude Code wins here: interactive, deliberate, controlled.

Some of my work is routine and high-frequency. Triaging Slack channels, pulling last week's customer insights, drafting a handoff doc, summarizing email threads. I don't need to narrate these. I don't want to narrate these. If I have to approve every step of "find the key themes from a customer insights channel this week," I've already lost the time savings. CoWork wins here: autonomous, fast, hands-off.

The question isn't which interface is better. The question is whether you want to co-pilot or delegate.

The prior art gap

I went looking for this framing written down somewhere. The closest I found is Ethan Mollick at One Useful Thing, who writes thoughtfully about AI as a collaborative tool for knowledge workers. The interactive/autonomous distinction shows up in agent architecture discussions. Microsoft Copilot and Google Workspace AI also orbit this space.

But the specific insight — two AI interfaces running on the same desktop, same model, same tools, distinguished only by workflow mode, for a non-developer knowledge worker — I couldn't find it named.

People write about AI assistants versus AI agents as if they're different products. In my experience, they're different modes of the same product. The interface that makes sense for a given task depends entirely on whether you want to stay in the loop or get out of it.

I'd call this the co-pilot/delegate distinction: the fundamental question you should ask before opening an AI interface isn't "what can this tool do?" It's "do I want to co-pilot this task or delegate it?"

The practical split

Here's how it actually shakes out for me:

Task	Mode	Why
Editing KB content, repos	Claude Code (interactive)	Each change matters — I want to review
Positioning docs, strategy	Claude Code (interactive)	High judgment required throughout
Weekly Slack triage	CoWork (autonomous)	Routine, high-frequency, well-defined
Customer insight summaries	CoWork (autonomous)	Pattern work, same structure every time
Drafting handoffs from notes	CoWork (autonomous)	I define the output; it executes
Blog drafts	Claude Code (interactive)	Voice and tone need my judgment at every step

The dividing line isn't hard to find: if you'd feel comfortable not watching it work, delegate. If you'd feel nervous not watching it work, co-pilot.

What changes when you see it this way

Before I had this frame, I was using one tool for everything and getting mediocre results from both. I'd use Claude Code for routine tasks and resent that I had to approve every step. I'd use CoWork for precision work and get anxious that I wasn't reviewing the output.

The tools weren't failing me. I was using them in the wrong mode.

Once I separated the work by mode — interactive vs autonomous — the friction dropped. CoWork handles the steady-state workflow. Claude Code handles the work that needs my judgment. Between the two, almost nothing falls through.

That's not a productivity hack. It's a different mental model for how to work with AI: not "what tool should I use?" but "what role do I want to play in this task?"

If you're a non-developer figuring out how to structure AI-assisted knowledge work, I'd be interested to hear how you're thinking about it.