DEV Community: Kunal Sharda

We added up the real cost of our 7-tool delivery stack. Licenses were 15% of it.

Kunal Sharda — Thu, 11 Jun 2026 09:51:59 +0000

Every tool sprawl thread I read starts with license math, and license math is a decoy. Last quarter I added up what our seven-tool delivery stack actually cost us, and the subscriptions came to about 15% of the total. The other 85% never appears on an invoice, which is exactly why nobody budgets for it and nobody fixes it.

Some background so you can judge whether my numbers transfer to your team. I spent years building automation in banking before running my own product team, so I am professionally allergic to process waste. Despite that, our stack had drifted into the usual shape: Jira for tickets, Confluence for docs, Lucidchart for architecture, TestRail for test cases, two spreadsheets doing unpaid overtime in the gaps, and an AI chatbot bolted on the side that had never seen any of it.

The licenses for all of that, for six people, ran about $700 a month. Annoying. Not a crisis. And that is precisely why the "consolidate your tools" pitch dies in so many budget conversations. Saving a few hundred dollars a month does not justify a migration, and everyone in the room knows it. If licenses were the real cost, I would side with the skeptics.

The audit: two weeks of logging every re-key

So we measured the part nobody measures. For two weeks, everyone on the team logged every re-key: any moment a human moved or restated information that already existed in another tool. Copying acceptance criteria from Confluence into a Jira ticket. Updating TestRail because a story changed shape. Redrawing a Lucidchart flow that had drifted from the code. Reassembling a status update by hand from three tabs. Pasting project context into the chatbot, again, because it forgot everything since yesterday.

The rules were strict so the number would survive scrutiny. Log transfer time only, not thinking time. Round down when unsure. If the same fact got re-keyed twice, log it twice, because it cost twice.

Each entry went into a shared CSV with four columns, and this script turned it into the number we showed the budget owner:

# rekey_audit.py - run after a two-week logging window
import csv

LOADED_RATE = 75  # $/hr, fully loaded cost per engineer
rows = list(csv.DictReader(open("rekeys.csv")))  # who,from_tool,to_tool,minutes

total_min = sum(int(r["minutes"]) for r in rows)
weekly_cost = (total_min / 2 / 60) * LOADED_RATE  # two-week window
print(f"{len(rows)} re-keys logged, {total_min/60:.1f} hours")
print(f"~${weekly_cost:,.0f}/week, ~${weekly_cost*48:,.0f}/year")

Our result: 298 re-key events in the window, a little over 25 hours of pure transfer work. Call it 12.5 hours a week across six people. At a loaded rate of $75 an hour, that is roughly $940 a week, or about $45,000 a year. Our licenses were $8,400 a year. The invoice was 15% of the measurable cost, and the measurable cost is the conservative one.

The part that never makes it into the CSV

Because here is what the log cannot capture. A re-key is information moving late, and late information is where the genuinely expensive failures live. The sprint planned against a roadmap doc that was two weeks stale. The feature built on an architecture decision that had quietly changed in a diagram nobody reopened. Those do not cost five minutes. They cost a sprint.

The AI line item deserves its own sentence, because it is new and it is growing. Every prompt to our bolted-on chatbot started with several minutes of pasting stories and constraints into the model's context window, and that pasted snapshot was stale by the same afternoon. We were paying for AI and then paying again, in human minutes, to feed it a degraded copy of what our tools already knew. That is sprawl tax compounding on itself.

Before you consolidate anything, try the cheaper fix

I want to be honest here, because the obvious reading of this post is "buy an all-in-one" and that is not always the right move. Tool sprawl is usually a context problem wearing a tooling costume. Five tools is fine if context flows between them. So look at your log and find the worst handoff first. Sometimes a webhook, a sync job, or simply killing a redundant spreadsheet deletes a tool's worth of re-keying without buying anything.

And sometimes the incumbent earns its seat. If you are a large org whose audit process is built around deep Jira workflows, that configurability is load-bearing and you should keep it. If your pain is purely issue tracking, Linear is the best pure tracker I have used and its mobile app is more mature than anything in the all-in-one category, mine included. Notion remains a nicer place to write a long document than any delivery platform ships. If your two-week log comes back small, your stack is fine. Close this tab.

Consolidation is the heavier hammer, for when the handoffs cannot be fixed because the tools were never designed to talk. That was our situation. The re-keying was not an integration gap, it was the architecture: five databases, each holding a partial copy of the truth, each going stale at its own speed.

Where I landed

I went far enough down this hole that I built a product around the alternative, so apply the appropriate salt to this paragraph. Stride is my answer: plan, design, tests, and process on one connected graph, where a story links to its tests and its architecture decision instead of being re-typed near them, and an MCP server lets coding agents read the live graph instead of a pasted snapshot. That is my bias, stated plainly. But you do not need my product to act on this post. The CSV and ten lines of Python are free, and the number they print will start a much better conversation with whoever owns your budget than any license comparison ever has.

So before you run it, make a guess. If your team logged every re-key for two weeks, what would the annualized number be? Write your guess down, run the audit, and come back and tell me how far off you were. I will go first: I guessed $20,000 and the real number was more than double that.

Spec-driven development with AI is real now. The stale spec is the part nobody fixed.

Kunal Sharda — Mon, 08 Jun 2026 18:09:18 +0000

Spec-driven development won the argument. A year ago, writing a spec before you let an agent touch the code sounded like process for its own sake. Now GitHub's Spec Kit has tens of thousands of stars, AWS shipped Kiro around the same idea, and Claude Code, Cursor, and most of the rest have some version of write-the-spec-first baked in. The method is settled. What almost nobody talks about is the failure that shows up three weeks later, when the spec you carefully wrote no longer matches the thing you shipped, and your agent is now confidently building from a document that lies.

I came to this from years of automation work in regulated banking, where "just prompt it and see" was never going to fly past a risk team. So when the SDD tooling landed I was the easy convert. The pitch is clean. The spec is the source of truth, the code is a regenerable output, and the agent builds from intent instead of from a paragraph you pasted into a chat window. Spec Kit even formalizes it into a flow, Spec then Plan then Tasks then Implement, anchored by a "constitution" file of principles that are not supposed to change. Kiro walks you through requirements, then design, then tasks before it writes a line. Both are good. I am not here to dunk on them.

Here is where it cracked for me.

The spec is excellent at time zero. You write it, the agent builds from it, and the first-pass result is genuinely better than vibe coding. Then you ship. An edge case comes back from support, someone patches the code directly to stop the bleeding, and the spec is now quietly wrong. Nobody updates it, because updating a separate document is unrewarded work that no sprint board tracks. Two sprints later a new feature lands next to the old one, the agent reads the old spec to orient itself, and it builds on top of a description of a system that stopped existing a month ago. The model did not hallucinate. The spec did.

The fix is not a better spec format

I tried that route first. Better templates, a tidier folder, a longer constitution. It bought me about a week. The actual fix is making spec drift visible the same way a failing test is visible. A spec earns its keep only when it is tied to the thing that proves it true, which is a test, and when breaking that tie shows up somewhere people already look. That starts with writing acceptance criteria that an agent and a test runner can both read, not prose that only a human can interpret on a good day.

# checkout.feature  (lives next to the code, not in a wiki)
@story:PLAN-412 @verifies:checkout_guest.spec.ts
Feature: Guest checkout

  Scenario: Valid payment creates an order
    Given a guest with items in their cart
    When they submit a valid card
    Then an order is created
    And a receipt email is sent

That file is the spec and the test contract at the same time. The @verifies tag points at the test that proves the scenario, and the @story tag points back at the work that requested it. Now you can add a CI step that greps every .feature for its @verifies target and fails the build when the test is missing or red. Drift stops being something you hope a reviewer notices. It turns into a broken build, which is the one signal engineers cannot ignore.

You can do all of this today with files in your repo and a few lines of CI. No vendor required. If you only take one thing from this post, take that: link the spec to a test, and make the broken link fail loudly.

Where one repo stops being enough

The repo-local version works until the rest of delivery gets involved, and it always does. The story lives in Jira or Linear. The architecture decision that made the feature risky lives in a diagram nobody has opened since kickoff. The prose spec lives in Notion. The test lives in CI. Each tool is good at its slice. Notion is a pleasant place to write the narrative spec and get a team to actually agree on it. Linear is the cleanest pure tracker I have used. Jira will bend to almost any workflow you can dream up, which is exactly why big orgs keep it. None of them hold the spec, the story, the decision, and the test as linked nodes. So the linkage that makes SDD durable is the precise thing your stack drops on the floor between tabs.

There is a second reason linkage matters now, and it is the agent itself. A spec is only as useful as what the agent can see at the moment it runs. Paste the spec into context and it works for one task, then goes stale inside the same session as the agent edits around it. What you actually want is for the agent to query the live spec, the linked story, and the current test status, so it builds from what is true today instead of what you typed last month. That is the whole "context is the model, not the prompt" idea, and it is why I stopped thinking of a spec as a document and started thinking of it as a node with edges.

I went far enough down this hole that I built a product around it, so read the next sentence with the appropriate amount of salt. Stride keeps plan, design, tests, and process on one connected graph, and runs an MCP server so Claude Code and Codex read the real stories and tests instead of a snapshot you pasted. That is my bias, said out loud. But you do not need my tool to get most of the value here. You need the spec wired to a test, and you need drift to break the build. A .feature file and one CI check get you embarrassingly far before you have to buy anything.

The part I still have not solved cleanly is the prose half of a spec. Acceptance criteria keep behavior honest because a test can check them. The why behind a decision, the tradeoff you weighed and rejected, the constraint from a compliance review, that reasoning does not reduce to a Given/When/Then, and it rots the quietest of all. So I will hand it to you, because I am genuinely trying to steal whatever is working. If you are running spec-driven development with an agent right now, how do you keep the intent and the reasoning from going stale, not just the acceptance tests?

Jira, Linear, or one connected graph? An honest take for 2026

Kunal Sharda — Sun, 07 Jun 2026 10:40:10 +0000

I have shipped on Jira, Linear, and a homegrown stack of five tools held together with hope. People ask me which one to pick, and the honest answer is that the question is usually wrong. You are not choosing a tracker. You are choosing how much of your delivery context is allowed to fall on the floor between tools. Let me make the real tradeoff visible, including where I would tell you to pick the competitor.

Where Linear genuinely wins

If you need world-class issue tracking and not much else, Linear is the best at it, and it is not close. The keyboard-first UX, the speed, the opinionated simplicity. Their mobile app is more mature than most of the market. For an engineering-only team that wants to track work and nothing more, Linear is a fine and possibly correct choice. I will not pretend otherwise.

Where it leaves you is everything that is not an issue. Your architecture decisions live somewhere else. Your test coverage lives somewhere else. Your docs go stale in a third place. Linear is excellent at the box it draws and silent about everything outside it.

Where Jira genuinely wins

If you are a large org with dedicated admins and genuinely complex workflows, Jira earns its reputation. Twenty years of configurability, field-level permissions, a marketplace with thousands of apps. If you run a 200-state workflow with 40 custom screens, the lighter tools will frustrate you within a week, and Jira is the right call.

The cost is the one everybody knows. It is heavy, configuration is a job, and the AI bolted on top is a chatbot that does not understand your actual project. You pay in setup time and in the tax of context that never quite connects.

The thing both of them share

Here is the pattern that took me years to name. Tracker, docs, diagrams, and tests are four different tools, and the context does not flow between them. So a story has no idea which test verifies it. A test has no idea which architecture decision made it necessary. A defect does not update the ambiguous criterion that caused it. Every artifact is an island, and your team spends a quiet 15 to 20 percent of its time being the human glue that carries information across the gaps.

When you add AI to that world, the AI inherits the fragmentation. It works from snippets you paste, because there is no connected picture for it to read. That is why most "AI in your tracker" features feel like a toy. The model is fine. The substrate is broken.

The alternative is not "one giant tool"

The instinct when you hear "all in one" is to picture a bloated suite that does everything badly. That is the wrong mental model. The thing that actually matters is not consolidation for its own sake, it is a graph: every artifact is a node with typed links to the others. Stories link to tests. Tests link to decisions. Defects link back to the story and the criterion they came from.

Once that graph exists, two things change. Questions that used to take 15 minutes of cross-referencing become a query.

"which open defects block the next release?"
  release -> stories scheduled -> defects(status: open)
  # two hops. a list, not a guess, and the AI can answer it in one call.

And AI prompts run against the real structure instead of a pasted snippet, so the answers are grounded in your actual product.

This is the bet I made building Stride: plan, design, QA, and process on one connected graph, with the AI reading the graph rather than a snippet. I am obviously not neutral. So weigh this section accordingly and judge it against your own pain.

How to actually choose

Stop comparing feature lists and ask what hurts.

If the pain is "our issue tracking is clunky" and nothing else, get Linear. If the pain is "we are a big regulated org with complex workflows and dedicated admins," Jira is built for you, and I wrote a fuller honest take on when to stay on Jira if that is you. If the pain is "our context is scattered across five tools and our AI is useless because it cannot see any of it," that is the gap a connected graph is built for, and the lighter trackers will not close it no matter how nice the UI is.

The honest trap to avoid is picking a tracker to solve a problem that is not about tracking. A lot of "Jira is too heavy" is really "our process has too many states," and no tool fixes a process problem. Be clear-eyed about which problem you actually have before you migrate, because migrations are expensive and a wrong one is worse than staying put.

My one-line version

Linear if you only need issues. Jira if you are big and complex and have admins. A connected graph if your real problem is that nothing in your delivery stack talks to anything else and your AI is paying the price.

What is the actual pain that has you shopping? Drop it in the comments and I will give you my honest read, even if the answer is "stay where you are."

In regulated software, traceability is the deliverable. Stop building it by hand.

Kunal Sharda — Wed, 03 Jun 2026 20:46:39 +0000

I spent a chunk of my career wiring automation into banks, which means I have built more traceability matrices by hand than I would like to admit. The job always looked the same. An audit is six weeks out, someone asks for proof that every requirement was tested, and a person, usually me, starts screenshotting tickets from Jira, exporting test runs from a separate QA tool, and stitching the two together in a spreadsheet that is already wrong by the time it finishes printing.

Here is the opinion it took me too long to say out loud. In regulated software, the traceability matrix is the deliverable, not a report you generate under duress. If you are rebuilding it from screenshots the week before the auditor arrives, you do not have traceability. You have a fire drill that produces a document shaped like traceability.

The matrix is real. The way most teams build it is broken.

A traceability matrix is a simple idea. Every requirement maps to the design that satisfies it, the test that verifies it, and the defect history that shows the test actually means something. Auditors in finance, medical devices, and aerospace ask for it because it answers one blunt question: can you show that what you said you would build is the thing you actually tested and shipped?

The idea is sound. The implementation is where it falls apart. In most shops the requirement lives in Jira, the test lives in TestRail or Xray or a tab in Excel, the architecture decision that made the requirement risky lives in Confluence, and the defect sits back in Jira under a different project key. None of those tools share a single identity for the requirement, so the matrix only exists when a human re-joins them by hand. It is a join query executed by a tired person at 9pm, and it has all the reliability you would expect from that.

Why the hand-built matrix is worse than no matrix

A matrix you assemble at audit time certifies a snapshot. It was true for the twenty minutes it took to export, then the next merge made it slightly false, and nobody re-ran it because re-running it is another day gone. So you end up with a document that looks authoritative and is quietly out of date. That is arguably worse than admitting you cannot produce one, because now you have signed your name under a join that may not hold.

The auditor's real question is never "do you have a matrix." It is "prove, right now, that requirement 14 was tested and passed." If answering that means an engineer spends an afternoon reconstructing links, then traceability is not a property of your system. It is a performance you put on when asked. And the whole point of traceability is that it is meant to be continuous, the way version control is continuous. You do not rebuild your git history before an audit. You should not be rebuilding your requirement-to-test history either.

Traceability is a graph problem wearing a document costume

Here is the reframe that changed how I build this. A matrix is a flat view of something that was never flat. The real structure is a graph. A requirement has acceptance criteria. Each criterion is verified by one or more tests. Each test carries a pass or fail history. Defects link back to the criterion that was too vague to catch them. The "matrix" is just one projection of that graph, the row-and-column view an auditor likes to read.

Store the links as a graph and the matrix becomes a query instead of a craft project. Better still, the gaps become a query too.

# The audit question, expressed as a query over the graph:
# "Which requirements have no passing test?" That is the first thing an auditor hunts for.

requirement
  -> acceptance_criteria
  -> verified_by(tests where status = passing)   # the actual proof
WHERE count(verified_by) == 0                     # the hole in your coverage

# If this returns rows, your matrix has gaps. Run it in CI, not in March.

The day that query runs in your pipeline instead of in a spreadsheet, traceability stops being an event and turns into a property. A requirement that merges without a linked passing test fails the build, the same way an untested path can fail a coverage gate. You are not preparing for the audit. You are continuously sitting in the state the audit checks for.

What this looks like on a normal Tuesday

You do not need a graph database and a research team to start. You need a stable identity for each requirement and the discipline to link the test back to that identity, not to a free-text title that someone will reword next sprint. A convention as dumb as REQ-IDs in your test names, plus a script that diffs the set of requirements against the set of referenced IDs, will catch most missing links. The tooling matters less than the rule: a requirement is not done when the code merges, it is done when a test references it by ID and passes.

It also forces a distinction the spreadsheet hides. Test coverage and requirement coverage are not the same number. You can sit at 90 percent line coverage and still have a requirement that nothing verifies, because coverage counts code, not promises. The graph counts promises, which is the thing the auditor actually came to check.

Where the heavyweight tools genuinely win

I am not going to tell you to rip out a validated toolchain. If you are in medical devices or aerospace and you already run something like Polarion or DOORS inside a qualified, auditor-accepted process, that tooling is part of your validation, and replacing it is itself an audit risk nobody on a sales call will warn you about. The same goes for a mature Jira setup with Xray or Zephyr that your auditors already accept. The integration may be clunky, but clunky and accepted beats elegant and unproven when a regulator is in the room.

Jira also still wins on deep custom workflows and the size of its marketplace. If your compliance process genuinely needs forty fields and a fourteen-state approval, the lighter tools will fight you the whole way. And in fairness to the tools that do not pretend, Notion and Linear are honest about not being compliance systems. Linear tracks issues better than almost anything and makes no claim to be your system of record for an audit. Do not bolt a traceability process onto a tool built to be fast and opinionated about tickets, because you will spend your life maintaining the duct tape.

Where I landed

After enough of those 9pm spreadsheets, I stopped treating the matrix as a document to produce and started treating it as a graph to maintain. That is the thesis I built Stride around: stories, acceptance criteria, tests, and defects live as linked nodes on one graph, so the matrix is a live view and the gaps surface while you work instead of the week before the auditor lands. I am the founder, so read that with the salt it deserves. The principle does not depend on my product. You can start tomorrow with REQ-IDs and a diff script. The goal is to make traceability continuous instead of heroic.

So I will put it to the people who have actually survived an audit: what is the worst traceability gap you found at the worst possible moment, and when you fixed it, did you fix the process or just patch the spreadsheet? I am collecting the failure modes, because they teach more than any vendor checklist.

I changed how I write acceptance criteria, and my AI agent stopped building the wrong thing

Kunal Sharda — Wed, 03 Jun 2026 20:44:57 +0000

For a while I blamed the model. The agent would build something plausible and wrong, and I would assume it needed a smarter brain. Then I went back and read the tickets I had handed it, and the problem was obvious. My acceptance criteria were wishes, not specifications. The agent built exactly what I wrote. I just had not written what I meant.

Here is the change that fixed most of it, and it has nothing to do with the model.

Prose acceptance criteria are where intent goes to die

Most ACs read like this:

The export should handle large files gracefully and not time out.

Every word in that sentence is a landmine. "Large" is how big. "Gracefully" is what behavior. "Time out" at what threshold. A human reviewer fills those gaps with assumptions, usually different assumptions than the person who wrote it. An AI agent fills them too, just faster and more confidently. You get working code for a spec nobody actually agreed on.

The fix is to stop writing criteria as description and start writing them as something checkable. If a criterion cannot become a pass or fail, it is not a criterion. It is a vibe.

The format that travels

I moved everything to a given / when / then shape. Boring on purpose.

Given a CSV with 100,000 rows
When the user triggers an export
Then the file streams to download and completes within 30 seconds
And peak memory stays under 512 MB

Now there is nothing to assume. The thresholds are explicit. An engineer reads it the same way QA reads it the same way the agent reads it. And the last clause is the quiet hero: it makes the criterion testable. You can write the test before the code, and the agent can check itself against it.

A few rules I hold to now:

Numbers, not adjectives. "Fast" becomes "under 200ms at p95." "Large" becomes a row count. If you cannot put a number on it, you do not understand the requirement yet, and neither will the agent.

One behavior per criterion. The moment a criterion has an "and also," split it. Compound criteria are how half-finished features pass review.

State the unhappy path explicitly. Most agent failures live here. What happens on an empty input, a duplicate, a permission error. If you do not write it, the agent will invent it, and you will not like what it invents.

Why this matters more with AI, not less

A human engineer who reads a vague AC will often stop and ask. Slack you, raise it in refinement, push back. That friction is annoying and it is also a safety net. The vague spec gets clarified because a person refused to guess.

An agent does not refuse to guess. It guesses instantly and commits. So the vague AC that a human would have flagged sails straight through into code. The discipline that you could get away with skipping when humans were the only readers is now load-bearing.

This is the part people miss when they say AI lets you move faster. It does, but it removes the human who used to catch your underspecified tickets. You have to put that rigor back into the spec, because the agent will not supply it for you.

Where good criteria alone are not enough

Honesty time. A sharp AC fixes the "built the wrong thing" failure. It does not fix the "could not see the rest of the system" failure. The agent can perfectly satisfy a criterion and still duplicate an existing utility or violate an architecture decision it never knew about, because that context lived in another tool.

So the AC is necessary, not sufficient. The agent needs the criterion AND the surrounding truth: the existing tests, the relevant decisions, the related stories. When I write criteria as checkable statements and the agent can query them along with the rest of the project, the output stops being plausible and starts being correct.

That is the thesis behind what I am building at Stride: the AI writes and reads acceptance criteria as linked nodes next to the tests and decisions they relate to, so a criterion is never three tabs away from the thing that proves it. But you do not need any particular tool to get most of this benefit today. You need to stop writing wishes.

Try this on your next ticket

Take the next thing you are about to hand an agent. Find every adjective in the acceptance criteria and replace it with a number or a concrete behavior. Add the unhappy path. Then run the agent. The difference is not subtle.

What is the worst acceptance criterion you have shipped, in hindsight? I will go first: "should feel snappy." Caused a week of rework. Your turn.

Your AI coding agent doesn't need a smarter model. It needs your backlog.

Kunal Sharda — Sat, 30 May 2026 20:03:26 +0000

Here is the uncomfortable thing I have landed on after a year of watching coding agents succeed and fail on real work: the model is almost never the bottleneck. Claude Code and Codex are both more than capable of the feature you are asking for. What breaks the run is that the agent cannot see the truth it is supposed to build against. The story. The acceptance criteria. The architecture decision it is meant to respect. The test that already exists for the thing it is about to rewrite.

So it guesses. The guess is locally reasonable and globally wrong, and you spend the afternoon unwinding it. The instinct is to reach for a smarter model. The fix is to give the model your backlog.

Why pasting context stops working

Most of us feed an agent context by pasting it. You paste the ticket, a few file paths, maybe a paragraph of background, and you let it run. This works for a self-contained task and falls apart the moment the work touches the rest of the system.

The reason is simple. Pasted context is a snapshot, and snapshots go stale inside the same session. The agent makes a change on step three that invalidates the assumption you pasted on step one, but the pasted text does not update, so by step seven it is reasoning about a version of the project that no longer exists. You are not giving it context. You are giving it a photograph of context and asking it to navigate a moving room.

The second problem is that the things that actually matter for a real feature are relationships, not paragraphs. Which architecture decision constrains this story. Which test verifies this acceptance criterion. What defect we last saw in this area. None of that lives in a paragraph you can paste. It lives in the links between artifacts, and a paste flattens all of it into prose the agent has to re-infer.

To be clear, this is not an argument that models do not matter. A better model is genuinely better at reasoning once it has the right inputs. The claim is narrower and more useful: for the failures most teams actually hit on bigger tasks, fixing the inputs beats upgrading the model, and it is cheaper.

What the agent actually needs

It needs a source of truth it can query on demand, not a wall of text you pasted once.

When the agent can query, it pulls the current state at the moment it needs it. It asks "what are the acceptance criteria for this story" right before it writes the code, not at the start of a session that has since drifted. It asks "what tests already cover this module" before it rewrites the module, so it stops breaking things it did not know existed. It asks "which decision governs this boundary" before it crosses the boundary. The context is live because it is fetched, not remembered.

For that to work, two things have to be true. The truth has to exist in a structured, linked form, and the agent has to have a way to reach it. The first is a product problem. The second is a protocol problem, and the protocol now exists.

MCP is the part that just got easy

The Model Context Protocol is the reason this is suddenly practical rather than a research project. MCP is the standard way for an agent like Claude Code or Codex to call out to an external system and read or write structured data. Instead of you copying your backlog into a prompt, the agent connects to a server and queries the backlog directly, the same way it would call any other tool.

// Instead of pasting context, the agent fetches it the moment it needs it:
const story = await mcp.call("get_story", { id: "STR-481" });
// -> { title, acceptanceCriteria, linkedTests, linkedADRs, status }
//
// Now it writes against the real acceptance criteria and the tests that
// already exist, not a snapshot you pasted at the top of the session.

It is worth being precise about why this beats the usual "AI that knows your data" pitch, which almost always means vector search. Embedding your docs and retrieving the most similar passage is fine for "summarize this page" and useless for "which decision constrains this story," because similarity is not the same as relationship. A graph answers the relationship question by traversal: this story, to the decisions in its epic, to the ones touching the same boundary. The retrieval is structural, not statistical, and structure is exactly what a coding agent needs when the task spans more than one file.

A concrete before and after

Take a normal request: add rate limiting to an API endpoint.

In the paste workflow, you copy the ticket, mention the endpoint, and let the agent go. It writes a reasonable rate limiter. It does not know you already have a rate-limiting utility in the codebase because that was not in the paste, so now you have two. It does not know the architecture decision that says limits live at the gateway, not the handler, because that ADR is in a separate tool nobody linked. It writes a test, but not one that matches the acceptance criterion about per-tenant limits, because the AC was three tabs away. The code looks fine in review and is wrong in three quiet ways.

In the queryable workflow, the agent reads the story, sees the per-tenant acceptance criterion, queries the architecture decisions for the area and finds the gateway rule, checks existing tests and finds the utility, and writes against all of it. The pull request that comes back is not just plausible, it is consistent with how your system already works. You review intent, not archaeology.

The model was identical in both runs. The inputs were not.

A quick way to tell if context is your problem

Look at your last five agent failures and sort them. If the agent produced code that was wrong about how your system works, that is a context problem, and plumbing fixes it. If it produced code that was technically fine but solved the wrong thing, that is a clarity problem, and better acceptance criteria fix it. If it produced code that was just low quality on a simple task, that is the one case where a better model actually helps. In my experience the first bucket is the largest by a wide margin, and it is the cheapest to fix.

Where this does not help, and where simpler is right

If your tasks are genuinely small and self-contained, scripts, one-file changes, throwaway prototypes, none of this matters. Paste the context and move on. Wiring up a source of truth for work that fits in one screen is overkill.

If your context problem is actually a clarity problem, no amount of plumbing fixes it. Half of "the agent did the wrong thing" is really "nobody ever defined what done meant in checkable terms." If your acceptance criteria are vague prose, the agent will build vague prose.

And if you live entirely inside one tool that your agent already integrates with deeply, you may have enough of this already. The gap shows up when the truth the agent needs is spread across your tracker, your docs, your diagrams, and your test tool, none of which talk to each other.

The shift in how I think about agents now

I used to treat the agent as the thing to improve. Better prompts, better model, better tooling around the prompt. I now treat the agent as fixed and the context as the variable. Given a capable model, the quality of the output is mostly a function of what the agent can see at the moment it acts. Improve what it can see and the same model gets noticeably better, on the same task, on the same day.

That reframing is freeing, because context is something you control. You cannot make the model smarter this afternoon. You can absolutely give it your backlog this afternoon.

This is the thesis I ended up building Stride around: one connected graph of stories, tests, and architecture decisions, exposed to your coding agents over MCP so they read the real thing instead of a paste. But the idea stands on its own no matter what you use. Give your agent your backlog, not a photograph of it.

What are the rest of you doing to keep agents grounded once the task is bigger than a single file? I am collecting approaches and would genuinely like to hear them.