DEV Community: Debbie Shapiro

Honest Memory: What Production Accuracy Data Actually Shows About AI Agent Memory

Debbie Shapiro — Mon, 08 Jun 2026 23:31:45 +0000

A major AI memory provider published their own research this spring measuring how well their system actually works in production. The controlled benchmark result was impressive: over ninety percent accuracy on standard evaluation corpora. The production result at thirty days was forty-nine percent.

That gap -- ninety-one to forty-nine -- is worth sitting with for a moment. The same system. The same vendor. The same definition of "working." The difference is what happens when the system runs continuously against real workloads instead of curated test sets.

This is not a vendor failing to disclose their results. They published the data themselves, in a public research report. That transparency is worth acknowledging. But the gap also tells you something important about what "AI agent memory" is actually solving -- and what it is not.

Why Auto-Capture Memory Degrades

The core challenge with automatic memory accumulation is that agents do not save discrete, well-structured facts. They save inferences, summaries, and working conclusions -- and those accumulate in ways that eventually contradict each other.

An agent that automatically captures "the user prefers short responses" at the beginning of a session, then captures "the user asked for a more detailed breakdown" three weeks later, ends up with two contradictory facts in its memory store. Neither is wrong in context. Both become noise when the system tries to answer "how should I structure my next response?"

At thirty days of continuous operation, a memory store built by automatic capture contains thousands of these contradictions. Facts about the same entity conflict. Preferences stated early have been reversed by subsequent behavior but the original entry was never expired. Working conclusions from resolved tasks still surface as if they were current state.

The controlled benchmark does not reveal this degradation because the test set is static. The evaluators know what the ground truth is and check against it. Production is dynamic -- ground truth shifts, context accumulates, and the memory system has no reliable way to expire stale entries without a human in the loop to say which facts are still valid.

What Explicit Saves Change

LoreConvo does not automatically extract memories from your conversation text. When a session ends, a hook fires and records the structured summary the agent or user has written -- decisions made, artifacts produced, questions left open. That record reflects what someone chose to commit, not what the model inferred from the raw exchange.

This is a different contract than automatic capture. The session-end hook automates the mechanics of saving, but the content is still explicitly structured. The things that get saved are the things someone decided were worth saving -- and put into words as decisions rather than leaving them as inferences the system tries to reconstruct later.

The result is a memory store that contains decisions, not inferences. Artifacts, not accumulations. Open questions that were explicitly flagged as open, not stale conclusions that were never closed out. When you search across saved sessions from the past month, you are searching a corpus of intentional records rather than a corpus of automatic accumulations.

LoreConvo does not make you immune to stale data. A decision saved three months ago may be obsolete by now, and LoreConvo has no way to know that on its own. The difference is that explicit saves put the staleness question in human hands. You decide what is worth keeping. You decide when a record is closed. The memory system stores what you give it and surfaces it when you ask.

The Professional Use Case

For agents operating in professional contexts -- engineering teams, data pipelines, consulting workflows -- the controlled-vs-production accuracy gap matters more than the headline benchmark. An agent that delivers ninety percent accuracy in evaluation but forty-nine percent in a month of real use is not reliable enough to trust with consequential context.

What those teams actually need is memory they can audit. Memory where they can see what the agent knows, correct a stale entry, and confirm that the correction took effect. Memory that does not silently accumulate contradictions across sessions.

Explicit saves and structured tagging are not features for users who enjoy manual data entry. They are features for users who need to be able to trust what their agents remember. The audit trail is the product.

The Benchmark Number Is Not the Lie

To be clear about what this data shows: the ninety-one percent controlled benchmark is not misleading. It accurately describes the system's performance on the evaluation task. The problem is what gets amplified in marketing copy versus what gets buried in the methodology section.

When a memory system is evaluated on a static test set with known ground truth, it looks like a different product than it is in continuous production. The evaluation is easier -- not because the vendor cheated, but because production is harder than any benchmark captures. Every memory system faces this gap to some degree. The question is how honestly it gets communicated and how much it matters for the specific workload.

For developers building agents that need to stay reliable over weeks and months, the thirty-day production number is the one that determines whether the system makes it into the stack.

LoreConvo stores what you structure and commit, surfaces what you search, and keeps your memory layer free of automatically-inferred contradictions. Free tier includes fifty sessions. Pro tier adds hybrid semantic search and unlimited sessions.

Find it at labyrinthanalyticsconsulting.com/tools.

What Is Agentic Workflow Consulting? A Practical Guide for Data Leaders

Debbie Shapiro — Fri, 05 Jun 2026 03:36:03 +0000

The Term Everyone Uses and Nobody Defines

Your CTO came back from a conference and said the team needs to "go agentic." A vendor pitched you an "agentic data platform" last week. LinkedIn is full of posts about agentic workflows transforming everything from customer support to supply chain management.

And yet, when you ask three people what "agentic" actually means for your data operations, you get four answers.

This is not a vocabulary problem. It is a strategy problem. Organizations are making six-figure decisions about agentic AI without a shared definition of what they are buying, building, or hiring for. That gap between the buzzword and the architecture is where most projects fail -- not because the technology does not work, but because nobody agreed on what it was supposed to do.

This guide is a practitioner's attempt to close that gap. No vendor pitch, no hand-waving. Just a clear definition, a real example, and a framework for deciding whether agentic workflow consulting is something your team actually needs.

What "Agentic" Actually Means (In Plain Language)

Traditional data pipelines are deterministic. You define steps, connect them in order, and run them. Step A feeds step B, which feeds step C. If the input changes shape, the pipeline breaks and a human fixes it. The pipeline does not adapt, reason, or make decisions -- it executes.

Robotic process automation (RPA) is slightly smarter but still scripted. It records human actions and replays them. Click here, type there, move this file. When the UI changes or an edge case appears, the bot breaks the same way a pipeline breaks: it stops and waits for a human.

Agentic workflows are fundamentally different. An agentic system has components that can reason about their task, make decisions based on context, and take actions without a pre-scripted path for every scenario. Instead of "if X then Y," an agentic node can evaluate ambiguous input, choose between approaches, validate its own output, and route work to the appropriate next step -- including flagging a human when confidence is low.

The practical difference shows up in how the system handles the unexpected. A traditional ETL pipeline encountering a CSV with a new column name will fail. An agentic pipeline can examine the new column, infer its meaning from context, map it to the correct destination field, and log the decision for human review later.

This is not artificial general intelligence. It is not a chatbot strapped to a database. It is a specific architectural pattern where autonomous components handle ambiguity, validate their own work, and collaborate with humans at defined checkpoints. That pattern is what consultants mean -- or should mean -- when they say "agentic workflow."

Where Agentic Workflows Solve Real Problems

Not every data problem benefits from an agentic approach. The pattern earns its complexity when your data operations share certain characteristics.

The first is source diversity. When you are pulling from seven different systems -- brokerage APIs, retirement account feeds, real estate management platforms, budgeting tools, crypto exchanges, tax document portals, and manual configuration spreadsheets -- the integration surface is enormous. Each source has its own format, its own error modes, and its own idea of what a "transaction" looks like. Traditional pipelines handle this with brittle transformation logic that breaks whenever a source changes its output format. Agentic components can absorb some of that variation by reasoning about the data rather than relying entirely on hardcoded mappings.

The second is validation complexity. When the cost of a wrong number is high -- tax calculations, financial reporting, regulatory submissions -- you need more than unit tests. You need independent verification where one process produces a result and a separate process checks it from a different angle. This is the maker-checker pattern: code generates a calculation, an LLM independently verifies it, and disagreements get flagged for human review. It catches the errors that deterministic validation misses because it can reason about whether a number "makes sense" in context, not just whether it matches a formula.

The third is decision branching. When your pipeline needs to route work differently based on data content -- this transaction is a stock sale, that one is a dividend, this one requires a different tax treatment -- the decision tree grows faster than you can hard-code it. Agentic nodes can evaluate each item against a set of criteria and choose the appropriate processing path, reducing the combinatorial explosion of if-else branches.

What This Looks Like in Practice

Abstract descriptions only go so far. Here is what an agentic workflow looks like as a real system, built for a real problem.

The problem: seven disconnected financial data sources needed to produce IRS-ready tax schedules, a retirement portfolio dashboard, and filled PDF forms. The old process involved one human, spreadsheets, phone calls to accountants, and weeks of manual reconciliation every tax season.

The solution: a 19-node LangGraph pipeline. LangGraph is a framework for building stateful, multi-step AI workflows as directed graphs. Each node in the graph represents a processing step -- ingestion, transformation, validation, output generation -- and the graph structure defines how data flows between them.

The architecture breaks into four layers. The ingestion layer connects to each data source through dedicated adapters that normalize raw data into a common format. The transformation layer uses dbt with 13 models and 58 tests to reshape data for analysis -- this is traditional, deterministic data engineering, and it should be. Not everything needs to be "agentic." The validation layer is where the agentic pattern earns its keep: maker-checker nodes where code-generated results are independently verified by LLM-based checkers, with disagreements routed to human review. The output layer generates the final artifacts -- tax schedules, dashboards, filled forms -- from the validated, transformed data.

The maker-checker validation is worth examining closely. When the pipeline calculates a capital gains figure, the calculation node produces a number based on cost basis, sale price, and holding period. A separate checker node receives the same raw transaction data and independently estimates what the capital gains should be. If the two numbers agree within a defined tolerance, the result passes through. If they disagree, the transaction gets flagged with both numbers and the raw data, and a human makes the final call.

This is not about distrusting the code. It is about catching the edge cases that deterministic logic misses -- wash sales, lot selection ambiguities, corporate actions that change cost basis in non-obvious ways. The pattern caught real errors that would have cost real money.

The result: what used to take weeks of manual work now runs in hours. Not because a single magical AI replaced the human, but because the architecture broke the problem into components where each one handles what it does best -- deterministic code for calculations, LLM reasoning for validation, human judgment for the ambiguous cases.

When You Need External Help (And When You Do Not)

Agentic workflow consulting exists because there is a gap between understanding the concept and shipping a production system. But not every team needs to hire for it. Here is an honest framework.

You probably do not need a consultant if your team has built production AI systems before (not just prototypes), your data sources are few and well-structured, your validation requirements are standard, and your timeline is flexible enough for learning curves. In that case, the frameworks are well-documented, the patterns are established, and a senior engineer with AI experience can figure it out.

You probably do need external help if your team is strong in traditional data engineering but has not shipped AI-augmented pipelines to production. The gap between "I built a chatbot demo" and "this runs unattended at 2 AM with financial data" is wider than it looks. It is not about intelligence -- it is about knowing where the failure modes hide. Which validation patterns catch which errors. How to structure state management so the pipeline recovers gracefully from partial failures. How to set up human review gates that actually get used instead of becoming bottlenecks.

You also need help when the stakes are high and the timeline is tight. Financial data, healthcare data, regulated industries -- these are not environments where you want to learn agentic patterns by trial and error on production data.

The honest truth is that most teams fall somewhere in the middle. They have strong data engineering foundations but have not navigated the specific complexities of agentic architecture in production. A consultant who has done it before can compress months of iteration into weeks, not by doing the work for you, but by steering you away from the dead ends.

How to Evaluate Whether a Consultant Actually Knows This

The agentic AI space is new enough that credentials are unreliable. Certifications do not exist in any meaningful sense. So evaluation falls on you.

Ask for production examples, not demos. Anyone can build a prototype that chains three API calls together and calls it "agentic." Production systems handle failures, validate outputs, manage state across runs, and operate without a human watching. Ask what happens when a data source goes down mid-pipeline. Ask how validation errors are surfaced. Ask what the monitoring looks like.

Ask about validation methodology. If a consultant is building systems that make decisions with your data, they should have a clear answer for how those decisions get verified. The maker-checker pattern is one approach. There are others. The red flag is not which pattern they use -- it is whether they have one at all.

Ask about the handoff. A good engagement does not create permanent dependency. You should end with documentation, trained team members, and a system your engineers can maintain and extend. If the consultant's pitch implies they will run the system forever, that is a service contract, not consulting.

Ask what they would not automate. Experienced practitioners know where the boundaries are. Some decisions should stay with humans. Some data transformations are better handled by deterministic code than by AI reasoning. A consultant who wants to make everything "agentic" does not understand the pattern well enough to know when it does not apply.

Making the Decision

Agentic workflow consulting is not a product category you browse on a marketplace. It is a specific kind of expertise -- production AI architecture applied to data operations -- that some teams need and others do not.

If you are evaluating whether your organization needs this kind of help, the clearest signal is whether your data challenges involve the three characteristics above: source diversity that overwhelms brittle integrations, validation requirements that exceed what deterministic testing can catch, and decision complexity that grows faster than you can hard-code.

If those describe your situation, the investment in getting the architecture right the first time is almost always cheaper than building the wrong thing and iterating toward something that works.

If you want to talk through your specific situation -- whether that leads to an engagement or just a clearer picture of what you need -- reach out. We also publish detailed case studies showing how these patterns work in practice.

Building Your First LangGraph Pipeline: A Decision-Maker's Guide

Debbie Shapiro — Sun, 31 May 2026 20:47:15 +0000

LangGraph is becoming the default framework for teams building agentic AI workflows. That is both a good thing and a problem.

The good part: it has real production pedigree, is actively maintained, and is used by teams doing serious work. The problem is that its growing reputation means a lot of teams are reaching for it by default -- before they have checked whether their problem actually calls for a graph-based orchestration framework rather than something simpler.

This post is not a tutorial. If you want to understand how to wire up nodes, edges, and state management in code, the official documentation covers that. What this guide addresses is the strategic decision: what LangGraph is and what makes it the right architecture for some problems and not others, what patterns experienced teams build before they touch the code, where pipelines fail in production, and what to look for if you bring in outside expertise for LangGraph consulting work.

The underlying question is not "how do I build a LangGraph pipeline?" It is "should I, and if so, how do I build one that actually works once it leaves the notebook?"

What LangGraph actually is

LangGraph is a framework for building stateful, multi-step AI workflows where the logic is organized as a graph: a set of nodes (units of work) connected by edges (routing logic). Each node receives state, does something, and returns updated state. The edges determine what happens next -- whether that means a fixed sequence, a conditional branch based on intermediate results, or a loop that repeats until some condition is met.

The concept that distinguishes LangGraph from simpler patterns is state management. When you have a single AI call, state management is trivial: you pass in a prompt and get back a response. When you have ten AI calls that depend on each other, where some of them route conditionally based on prior outputs, and where you need to be able to resume from any point if something fails -- state management becomes the hard part of the design. LangGraph provides a structure for handling that complexity without building it from scratch.

Two other features matter practically. Checkpointing lets you persist state to storage at any point in the graph execution, so an interrupted run can resume from where it stopped rather than starting over. Human-in-the-loop integration lets you pause execution at defined points and wait for a human decision before continuing. Both features are difficult to build correctly from scratch and are essential for production agentic systems.

When LangGraph makes sense -- and when it does not

LangGraph has meaningful overhead. It is a framework that adds structure, and structure is only worth the cost when the problem requires it.

LangGraph makes sense when the decision logic at one step depends on the output of previous steps in ways you cannot prespecify, when you have multiple AI calls that share state and produce outputs that feed into each other, when you need human review gates at specific points in the pipeline, or when your workflow needs to adapt its path through the logic based on what it finds at runtime. If those characteristics describe your problem, the graph abstraction is earning its keep.

The comparison to Airflow and Prefect is instructive because teams sometimes assume they are alternatives to the same problem. They are not. Airflow and Prefect excel at deterministic workflows at scale: the same inputs always produce the same outputs through the same steps, and the structure is fully known at the time you write the code. If your workflow is deterministic and the structure is static, those tools are better suited to it -- they are faster to operate, cheaper to run, and easier to debug.

Plain Python is often the right answer for simpler agentic work. A single AI call that classifies an input and routes it down one of three paths does not need LangGraph. Adding a framework with state management, edge routing, and checkpointing to a workflow that is essentially a function with a few conditional branches is overhead without benefit. The honest question to ask before committing to a graph framework is: am I adding this because my problem requires it, or because I have seen it in tutorials and it feels like the modern approach?

Architecture patterns that determine success

Before writing any code, experienced teams map out three things: the graph's state schema, the edge routing logic, and the points where human review is required. Getting these right in design prevents the most expensive mistakes in production.

The state schema is the shared context that flows between nodes. Every node reads from state and writes to state. If the schema grows without bound -- if each node appends data without pruning what is no longer needed -- the graph becomes slow and expensive as it processes longer pipelines. The symptom appears gradually: early test runs are fast, but production runs against real data become sluggish in ways that are hard to attribute. Experienced teams design state to be minimal: each node gets exactly what it needs, writes exactly what downstream nodes will use, and discards intermediate data that served its purpose.

Edge routing logic determines how the graph moves between nodes. Static edges are simple: node A always goes to node B. Conditional edges route based on the state at that point -- if the checker node found a discrepancy, route to the human review node; if maker and checker agreed, proceed to output. The routing logic needs to be explicit in the design before it gets encoded in the graph, because conditional routing errors tend to surface only in production when the specific conditions that trigger them finally occur.

Human review gates are the third design decision that most tutorials skip. Production agentic systems need to know when to stop and wait for a human rather than proceeding automatically. Getting this right requires thinking through a set of decisions upfront: what conditions trigger a human review request, what information does the reviewer see, what actions can they take, and how does their decision feed back into the graph execution. Treating human review as an afterthought -- something to bolt on once the automation is working -- almost always means redesigning significant portions of the graph.

A real architecture: the 19-node financial pipeline

The LangGraph pipeline we built for a financial data client illustrates these patterns in practice. It processes transactions across seven data sources through a 19-node graph, running unattended against live data.

The graph is organized in layers. An extraction layer pulls data from each source and normalizes it into a common schema. A classification layer determines the transaction type, applicable tax jurisdiction, and relevant accounting rules -- this is where ambiguity in source data gets resolved through AI reasoning rather than hard-coded rules. A validation layer applies a maker-checker pattern: a deterministic maker node calculates a result using the classified rules, and an independent checker node reads the same inputs and assesses whether the result is correct.

When maker and checker agree, the result proceeds automatically. When they disagree, the transaction is flagged and routed to a human reviewer with both results and the specific inputs that produced the disagreement. The reviewer sees exactly what the system saw, makes a decision, and the graph continues from that point.

This pattern has caught errors that deterministic testing could not. In one production case, the checker flagged a tax calculation where the maker was applying the correct formula for the wrong jurisdiction. The code passed all existing tests -- the formula was correctly implemented. The error was in the classification step upstream: the transaction's characteristics did not match the assumed jurisdiction context. The checker recognized the mismatch and routed it for human review before the incorrect result reached the output layer. That is not an edge case you can write a test for in advance. It is the category of failure that makes agentic validation valuable.

Where production pipelines fail

Most LangGraph pipelines that fail in production do so in predictable ways, and understanding them in advance is more useful than encountering them after the fact.

State explosion happens when the graph accumulates data without pruning. Long-running pipelines that append intermediate results to state without removing what they no longer need become slow and expensive. The fix requires explicit state lifecycle management in the design -- not as a performance optimization added later, but as a first-class concern from the start. Production data volumes will expose problems that development test cases do not.

Missing error boundaries mean that a single failing node can crash the entire graph. In a 19-node pipeline, if node 7 raises an uncaught exception, you want the graph to handle it gracefully: log the failure, route to an error recovery path, and surface the problem without losing the state of the nodes that completed successfully. Building error boundaries into each node is straightforward but tedious, and it is consistently underestimated in initial implementations. Teams that skip it pay for it the first time a recoverable error cascades into a complete pipeline restart.

The absence of a validation layer is the most expensive mistake. Teams that build without a checker -- where the AI is the only node producing a result, and that result is accepted automatically -- have built a system with no mechanism to catch model errors. A production pipeline that accepts AI-generated outputs without independent verification is not a production system; it is a prototype running on live data. The checker does not have to be an LLM call. Statistical sampling, deterministic rule checks, and threshold-based flagging are all legitimate approaches. The requirement is that something other than the maker is assessing whether the output is correct.

Inadequate monitoring is where most teams underinvest. A monitoring setup that tells you the pipeline ran without errors does not tell you whether it produced correct results. Accuracy drift -- where the model's outputs become systematically wrong over time without any technical failure -- is one of the hardest problems to detect in production AI systems. Monitoring for it requires ground truth comparisons, sampling strategies, and alerting on output distributions, not just on runtime errors.

What to look for in a LangGraph consultant

The market for LangGraph consulting is new enough that the gap between "has built demos" and "has shipped production systems" is large, and it is not always visible from the outside.

Ask for a specific production system, not a proof of concept. What was the input volume? How many nodes? What failure modes did they encounter and how did they handle them? How do they monitor for accuracy over time, not just uptime? Practitioners who have shipped production LangGraph pipelines have specific, unglamorous answers to these questions. Those who have not will give you architecture diagrams and API descriptions.

Ask about validation methodology. A team that built a LangGraph pipeline with no checker has not solved the hard part of the problem. The question to ask directly is: how do you verify that the pipeline is producing correct results, not just running without errors? The specific approach matters less than the fact that they have one and have tested it in production.

Ask when they would not recommend LangGraph. Anyone who reaches for a graph framework regardless of the problem has not thought carefully enough about the architecture decision. The honest answer involves specific scenarios -- deterministic workflows at scale, simple conditional routing, single-stage AI calls -- where a simpler tool is faster to build, cheaper to operate, and easier to debug. A consultant who cannot articulate those scenarios is optimizing for a tool they know rather than for your problem.

Getting started

If you are evaluating LangGraph for a real pipeline -- not a demo, but a system you expect to run in production against real data -- the most useful starting point is a structured conversation about the problem architecture before committing to an implementation approach. The framework choice follows from the problem requirements, not the other way around.

Labyrinth Analytics has built LangGraph pipelines in production for financial data workflows with complex validation requirements and human-in-the-loop review gates. If you want to see what that looks like in practice, the work section has case studies with real architecture details. If you want to talk through your specific situation before deciding on an approach, get in touch.

Labyrinth Analytics Consulting builds and advises on agentic data workflows, LangGraph pipelines, and AI-assisted data operations. Questions? info@labyrinthanalyticsconsulting.com