DEV Community: Paulo Victor Leite Lima Gomes

aws finops agent makes cloud cost a runtime problem

Paulo Victor Leite Lima Gomes — Wed, 17 Jun 2026 00:01:47 +0000

Cloud cost management has always had a strange emotional profile.

Everyone agrees it matters. Almost nobody wants to do it. The dashboard is there. The reports exist. The recommendations are technically available. The finance team asks reasonable questions. The engineering team says it will look after the launch. Then the launch becomes two launches, the old environment is still running, the database class is still too large, and the mystery line item in the bill becomes a recurring calendar invite.

This is why AWS FinOps Agent caught my attention.

The announcement is still preview-stage, so I would not build a religion around it. But the shape is important. AWS describes an agent that can answer cost questions, surface optimization opportunities, investigate anomalies, run recurring FinOps workflows, generate reports, open Jira tickets, and post findings to Slack.

That is not just "chat with your cloud bill."

That is cost management moving from dashboard archaeology to operational workflow.

And the moment an agent can turn a cost recommendation into a ticket, a Slack message, or a recurring investigation, cloud cost stops being only a reporting problem.

It becomes a runtime problem.

dashboards were never enough

The cloud cost dashboard is a classic enterprise compromise.

It gives everyone a place to point. It rarely gives anyone enough momentum to act.

A dashboard can tell you spend went up. It can show that a service is idle, that a cluster is oversized, that a Savings Plan might help, or that storage grew faster than expected. That is useful information. But between "useful information" and "somebody changed production safely" there is a lot of missing work.

Someone has to decide whether the recommendation is valid.

Someone has to know which team owns the resource.

Someone has to understand whether the workload has a weird traffic pattern, a compliance constraint, a migration in progress, or a customer promise hidden in a Slack thread.

Someone has to open the ticket, chase the owner, make the change, verify the result, and explain to finance why the saving is real or why it is not.

That gap is why FinOps is not just analytics. It is operations.

An agent is interesting here because it can sit in the gap. It can connect the cost signal to the workflow where engineering teams actually make decisions.

That is the useful version.

The dangerous version is the same agent confidently generating noise at scale.

recommendations are not decisions

Cloud optimization tools have always had a translation problem.

"This instance is underutilized" is not the same as "resize it now."

"This database looks idle" is not the same as "delete it."

"A Savings Plan might reduce spend" is not the same as "commit the company to this usage shape."

The recommendation is a clue. The decision needs context.

AWS says FinOps Agent can pull from Cost Optimization Hub and Compute Optimizer, generate reports, surface rightsizing, idle resource, and Savings Plans recommendations, and create Jira tickets from those recommendations. That is exactly where the line matters.

Opening a ticket is fine. It can be useful. The ticket should contain the evidence, expected saving, owner, affected resources, confidence level, and the reason the agent believes the recommendation is safe enough to review.

But the ticket is not the decision.

The decision belongs to whoever owns the service, the budget, or the operational risk.

This sounds obvious until teams start measuring the agent by how much work it creates. A FinOps agent that opens a hundred tickets is not automatically successful. It may have simply exported dashboard noise into Jira.

The better metric is boring: how many recommendations turned into safe, verified savings without creating operational incidents or review fatigue?

anomaly investigation is where this gets real

The most interesting part of the AWS description is not reports. It is anomaly investigation.

Cost anomalies are annoying because they are often urgent but ambiguous. Spend moved. Something changed. Maybe it is a product launch. Maybe it is an abuse pattern. Maybe a test environment leaked. Maybe a retry loop started. Maybe a data pipeline reprocessed more than expected. Maybe a team intentionally scaled something and forgot to tell anyone.

The first hour is usually context gathering.

Which account? Which service? Which region? Which tags? Which deployment happened around the same time? Which team owns it? Is the spend still growing? Is there a customer-facing impact? Is this actually abnormal for month-end, batch processing, or a marketing campaign?

That is good agent work if the boundaries are clear.

An agent that can collect evidence, summarize likely causes, and post findings to Slack can save time. The human does not need to start from a blank dashboard. The incident or FinOps channel gets a first pass with links, affected resources, and next actions.

But the agent needs to show its work.

If it says the root cause is a data pipeline, I want the query trail. If it says a deployment correlates with the spike, I want the deployment link. If it says the spend is limited to one account and one region, I want the exact filters. If it recommends pausing something, I want a human approval gate before action.

For cost anomalies, confidence without evidence is just a faster way to be wrong.

slack is not an accountability model

Posting findings to Slack is useful.

It is also not enough.

A Slack message is a notification. It is not ownership. It is not state. It is not proof that the work was done. It is not a durable record of why a cost decision was accepted or rejected.

The serious version of a FinOps agent needs a trail across systems:

the detected anomaly
the data sources used
the affected accounts, services, regions, and tags
the generated recommendation
the ticket or issue created
the owner assigned
the approval or rejection
the actual change
the measured impact after the change

Without that, the organization gets a more talkative cost dashboard.

With that, the agent becomes part of the operating system for cloud cost.

This is where engineering and finance need to be careful. The agent should not become a way for finance to spray tickets at engineering. It should also not become a way for engineering to ignore cost because "the agent will tell us."

The agent is coordination infrastructure.

Coordination still needs ownership.

recurring workflows need budgets too

Recurring FinOps workflows sound great.

Every Monday, generate the cost report. Every day, inspect anomalies. Every week, find idle resources. Every month, check commitment coverage. This is the kind of work that benefits from automation because humans are bad at doing repetitive analysis with consistent patience.

But recurring agent work can quietly become another bill.

The agent uses compute. It calls APIs. It may use model tokens. It may query observability systems. It may open tickets that consume human review time. It may create work that looks productive but does not pay for itself.

So the agent itself needs FinOps discipline.

How much does the recurring workflow cost to run? How many useful actions did it produce? How many false positives? How many recommendations were already known? How much verified saving came from the workflow? Which reports are read by humans, and which ones are just ritual?

If the agent is supposed to reduce waste, it should not be exempt from the same question.

Is this work worth what it costs?

what i would build first

If I were introducing a FinOps agent inside a company, I would avoid the heroic version.

I would not start with "optimize all AWS spend."

I would start with one narrow loop:

one business unit
one set of tagged accounts
one class of recommendation
one ticket template
one Slack channel
one human approval path
one measured saving target

Idle non-production resources are a good candidate. Low-risk rightsizing recommendations might be another. Savings Plans recommendations are tempting, but I would treat commitment decisions as a higher-governance workflow because the failure mode is different.

The first goal should be to prove that the agent can turn a cost signal into an owned, reviewable, measurable action.

Not a pretty report.

An action.

Then I would measure the rejection reasons. That is where the system improves. If owners reject recommendations because tags are wrong, the platform problem is tagging. If they reject them because the agent lacks context about expected traffic, the problem is context. If they ignore them because there are too many tickets, the problem is workflow design.

The agent will expose the messy parts of your cloud operating model.

That is useful if you are willing to look.

the punchline

AWS FinOps Agent is a good signal for where cloud operations are going.

The dashboard is not disappearing, but the center of gravity is moving. Cost data is becoming something agents can reason over, schedule, summarize, route, and attach to operational workflows.

That can be genuinely useful. Cloud waste is real, anomaly response is tedious, and many cost recommendations die because nobody turns them into owned work.

But the useful version is not "let the agent manage the bill."

The useful version is an accountable workflow: evidence, ownership, approvals, ticket state, measured impact, and a clean boundary between recommendation and decision.

Cloud cost has always been a shared responsibility, which is a polite way of saying it often belongs to everyone and nobody at the same time.

Agents will not fix that by themselves.

They will make the missing ownership visible faster.

That is still progress.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

agent evaluation is becoming the new test pyramid

Paulo Victor Leite Lima Gomes — Tue, 16 Jun 2026 00:01:53 +0000

We are starting to rediscover testing, but with more tool calls.

AWS published a post last week about Agent-EvalKit, an open-source toolkit for evaluating AI agents. The interesting part is not that another eval framework exists. We have plenty of those already, and half of them seem to be born with a leaderboard attached.

The interesting part is the shape of the problem it admits.

For normal software, you can often test the output and learn something useful. Give a function an input. Check the return value. Mock the slow thing. Assert the behavior. Add a regression test when it breaks.

Agents make that much less satisfying.

An agent can produce the right-looking answer for the wrong reason. It can call the wrong tool, ignore an empty result, invent the missing value, and still write a beautifully formatted response. It can accidentally skip the verification step that made the workflow trustworthy. It can get lucky on the final answer while quietly doing something you would never want as a habit.

That is the annoying thing about agents: the answer is not the only behavior.

The path matters.

output tests are not enough

I understand why teams start with output tests.

They are familiar. They are cheap to explain. They map nicely to product expectations: the user asked this, the agent answered that, the answer was good or bad.

But agents are not just text generators once we give them tools. They become small distributed systems with a language model in the middle. They read state, choose tools, pass parameters, interpret responses, make follow-up calls, write files, update tickets, open pull requests, and sometimes decide that silence from a tool is enough evidence to continue.

If you only check the final response, you miss the important failure mode.

Imagine a travel agent that returns a neat itinerary with flights, weather, exchange rates, and attraction details. The final answer is readable. The structure is useful. The tone is confident.

Now inspect the trace and discover that the currency tool returned nothing, the weather lookup failed, and the agent filled the gaps from vibes.

The user-facing answer was not the test. It was the cover story.

This is why the AWS example is useful. Their demo agent had high response quality but terrible faithfulness. In plain English: it sounded good while making things up when tools returned empty or incomplete data.

That is exactly the kind of bug output-only testing will flatter.

the new unit is the trace

The next useful testing unit for agents is not the prompt, and it is not the final message.

It is the run.

A run contains the input, model messages, tool calls, tool outputs, intermediate state, final response, timing, failures, retries, and maybe cost. That is the thing you evaluate because that is the thing that actually happened.

This sounds heavier than a unit test because it is.

But we went through a version of this before. Unit tests were never enough for distributed systems. We added integration tests, contract tests, synthetic checks, tracing, canaries, chaos experiments, and production monitoring because the behavior we cared about lived between components.

Agents push us into the same place, just with a softer and more annoying component in the loop.

The model is not deterministic enough to treat like a normal function. The tools are not decorative enough to ignore. The prompt is not complete enough to be the whole spec. The final answer is not honest enough to be the whole evidence trail.

So the trace becomes the test artifact.

Did the agent call the right tool? Did it pass the right parameters? Did it notice when the tool returned empty data? Did it distinguish known facts from guesses? Did it use the cheaper path when the expensive one was unnecessary? Did it stop when policy said stop?

Those are test questions.

They just do not look like expect(result).toEqual(...).

this is a platform feature

I do not think most product teams should build this from scratch.

That is not because they are incapable. It is because the work is tedious in exactly the way platform work is tedious: instrumentation, fixtures, synthetic cases, replay, trace storage, evaluator prompts, thresholds, reporting, CI integration, and enough history to see whether the agent got better or worse after a change.

You can absolutely hack together a notebook that scores a handful of examples.

That is not the same as an evaluation system.

An evaluation system needs to survive normal engineering life. Prompts change. Tools change. Models change. Schemas change. Product behavior changes. One team wants faithfulness. Another cares about tool parameter accuracy. Another cares about latency and cost. The security team wants to know whether the agent touched the wrong capability. The support team has real examples from customers that should become test cases.

This is where I think agent platforms will mature quickly.

The model picker is not enough. The chat UI is not enough. The workflow builder is not enough.

If the agent can take actions, the platform needs to make those actions measurable.

evals are not a scoreboard

The least useful version of this is a dashboard that says your agent is 87.3 percent good.

That number may be interesting, but it is not very actionable by itself. Good against what? Which tools? Which failure modes? Which customer scenarios? Which version of the prompt? Which model? Which hidden assumption?

Evaluation becomes useful when it points back to an engineering change.

This is one of the smarter parts of the Agent-EvalKit framing: the report is supposed to produce code-level recommendations, not just abstract scores. In the AWS example, the practical fix was not "make the agent better." It was closer to "add guardrails for empty tool results and improve error handling along the paths where the agent fabricates facts."

That is the difference between a metric and feedback.

A metric tells you faithfulness is low.

Feedback tells you where the agent loses contact with reality.

I want evaluation systems that create the second thing.

production will still surprise you

There is a trap here, because engineers love turning messy things into gates.

I am not against gates. If an agent workflow is important, there should be thresholds. A regression in faithfulness, tool accuracy, latency, or policy compliance should block a release the same way a broken test blocks a release.

But agent evaluation will not end at CI.

The weird cases will come from production. Users will ask things your synthetic data did not cover. Tools will return malformed data. Vendor APIs will degrade. Someone will add a new capability and accidentally change the search path. A model upgrade will improve the average answer and break one important edge case. A prompt edit will reduce hallucinations while making the agent annoyingly cautious.

That means the loop has to continue after deployment.

Real traces should feed new test cases. Rejected outputs should become examples. Incident analysis should add scenarios. Human review should calibrate the evaluator instead of being replaced by it. The test set should become a living artifact of what the organization has learned.

This is where the test pyramid metaphor is useful, but only if we do not take it too literally.

Agent evaluation probably needs layers: cheap deterministic checks, code-based assertions, LLM-as-judge scoring, trace inspection, human review, production monitoring, and regression suites built from real failures.

Not every workflow needs all of that.

But serious workflows need more than "the answer looked fine."

what i would start with

If I were introducing this in a team, I would not start with a grand universal eval platform.

I would pick one agent workflow that already matters.

Then I would define three things:

the final outcome that must be correct
the tool behavior that must be trustworthy
the failure mode that would embarrass us in production

For a support agent, that might be: answer grounded in retrieved docs, no invented policy, and escalation when confidence is low.

For a coding agent, it might be: tests run before the PR, no files outside scope, and no dependency changes without explicit instruction.

For an operations agent, it might be: read-only diagnosis by default, approved command list, and clear refusal when the requested action is unsafe.

Then I would capture traces and build a small regression set around those expectations.

The first version does not need to be elegant. It needs to be honest.

Once the team can see where the agent cheats, guesses, skips, overreaches, or wastes money, the next platform requirements become obvious.

the punchline

Agents are forcing evaluation to grow up.

Checking the final answer was fine when the agent was basically a chat box. It is not fine when the agent can inspect systems, call tools, write files, and make decisions that other people treat as work.

The mature question is no longer only "did it answer correctly?"

It is also "did it get there in a way we trust?"

That question needs traces, tool-call checks, faithfulness metrics, regression suites, production feedback, and reports that point to actual fixes.

In other words, it needs the boring testing culture we already learned to need everywhere else.

The agent era does not make tests obsolete.

It makes the test artifact bigger.

references

AWS: Evaluate AI agents systematically with Agent-EvalKit

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

hosted coding agents make observability a product feature

Paulo Victor Leite Lima Gomes — Mon, 15 Jun 2026 00:02:19 +0000

The laptop was never the interesting part of coding agents.

It was just the first convenient runtime.

Your laptop has the repository, the shell, the secrets, the package cache, the local database, the half-working dev server, and whatever credentials you forgot were still loaded in the background. So the early version of agentic coding naturally ran there. It was close to the work. It had all the messy context. It was also a very strange place to run something that might edit code for an hour while calling tools, installing dependencies, and touching private systems.

AWS published a Bedrock AgentCore post this month with a very good hook: you should be able to close your laptop while the coding agent keeps working.

That is the demo-friendly version.

The more important version is this: once the agent leaves the laptop, "what happened?" becomes a production question.

And that is where observability stops being a nice enterprise add-on and becomes part of the product.

remote is not the same as trustworthy

Moving a coding agent to a hosted runtime solves some obvious problems.

The agent can keep running after your machine sleeps. Multiple agents can run in parallel without fighting over the same local Postgres port. The filesystem can persist between sessions. The environment can be isolated in a microVM or container instead of sharing your real shell with everything else you do all day.

Good.

But remote execution also removes a useful kind of accidental visibility. When the agent is on your laptop, you can at least see the terminal, notice the fan spin up, watch the test output, and feel the blast radius because it is your machine.

In a hosted runtime, that little bit of intuition disappears. The agent is now somewhere else, with its own filesystem, network path, credentials, tools, retry behavior, and bill.

So the platform has to replace intuition with evidence.

Not a transcript pasted into a PR.

Actual operational evidence: traces, logs, metrics, command history, tool calls, token usage, latency, failures, retries, identity, and cost.

Without that, a hosted agent is just a remote terminal with better branding.

the trace is the review artifact

We still talk about coding agents as if the pull request is the main artifact.

That is too small.

The PR tells you what changed. It does not tell you enough about how the change was produced. For simple work, that may be fine. For production agent workflows, the process matters.

I want to know:

who or what started the session
which repository and branch were checked out
which identity was used for tools
which files were inside the allowed scope
which commands ran
which external tools were called
which tests failed before they passed
how much time and token budget the task consumed
what the agent tried before it settled on the final diff
where a human approved or stopped something

Some of that belongs in the PR description. Some belongs in the platform that launched the task. Some belongs in traces and logs. The important bit is that the information exists in a place the team can query later.

Six months from now, someone will ask why an agent changed an auth middleware, why it contacted a particular internal service, or why a migration took five attempts. "The bot said it was done" will not be a satisfying answer.

The trace becomes part of the review artifact because the diff is no longer enough.

observability is also a permission model

People often separate observability and security too cleanly.

For agents, they are tangled together.

If an agent can call GitHub, Jira, Slack, a database console, an internal admin API, and a package registry, you need to know more than whether the final tests passed. You need to know which capabilities it actually used.

This is why the AWS AgentCore framing around Identity, Gateway, CloudTrail, CloudWatch, and OpenTelemetry is interesting. The details matter less than the shape of the product: the agent runtime is not only where code executes. It is also where identity, tool access, traces, metrics, and audit records become one control surface.

That is the correct direction.

A platform team should be able to answer boring questions without spelunking through chat history:

Did the agent act as a human, a bot, or a GitHub App?
Which downstream credential was attached to a tool call?
Did it use the approved MCP gateway or bypass it with a raw network call?
Did it read memory it should not have read?
Did it push to a remote that was not in policy?
Did it spend $3 or $300 on model calls?

These are observability questions. They are also governance questions.

The dashboard is not decoration. It is how the organization decides whether the agent is allowed to keep doing this kind of work.

cost needs to be visible

Hosted coding agents will make cost stories weird.

A human running tests locally usually does not think about the marginal cost of each command. A hosted agent does. There is compute. There is storage. There are model tokens. There may be tool calls, network egress, hosted observability, and retries.

If coding agents are going to be part of the development platform, cost has to be attached to the unit of work. Not only per team or per account, but per session, per task template, per repository, and ideally per outcome.

How much did the dependency upgrade cost when it passed the first time? How much when tests failed twice? How much when the agent got stuck reading irrelevant files? Which task templates are cheap and boring enough to run automatically? Which ones should require approval because the variance is too high?

This is product feedback, not finance theater. Good agent platforms will make cost visible early enough that teams can improve the workflow instead of merely scolding people after the invoice arrives.

evals tell you if it worked once

Evaluation matters, but evals and tests are still mostly release-time confidence. They tell you whether the agent, tool, prompt, or workflow performed well against a known scenario.

Production has a different personality.

Production asks why the same workflow got slower this week. Why one repository has a higher failure rate. Why a certain tool call started timing out. Why token usage jumped after a prompt change. Why humans keep rejecting PRs from one task template. Why an agent that passed evals is still annoying reviewers.

That is observability work.

AWS's AgentOps guidance puts governance, build and operations, evaluation, and observability next to each other. That is the right grouping because agents do not fail in one clean layer. They fail across model behavior, tool behavior, runtime behavior, memory, permissions, prompts, data, networks, and human expectations.

The eval suite catches some of that. The production trace catches the rest.

what i would build first

If I were moving coding agents off laptops and into a hosted runtime, I would start with one narrow workflow.

Not "let agents work on anything."

Something boring and bounded, like dependency upgrades for low-risk services, small lint migrations, or service-template updates.

Then I would make observability part of the definition of done:

every session gets a stable ID
every task links to the issue, branch, PR, logs, and trace
every tool call records the caller, target, and credential class
every deterministic command is captured separately from model reasoning
every task reports duration, token use, retry count, and compute cost
every PR says what was attempted, what failed, and what was left for humans
every task template has a basic success and rejection rate

That sounds like a lot until you compare it with the alternative: remote agents doing semi-autonomous work in private repos while the team judges them from a final diff and a cheerful summary.

No thanks.

I want the boring trail.

the punchline

Hosted coding agents are coming because the laptop is a bad long-running agent host. It is too personal, too shared, too fragile, and too invisible to the rest of the organization.

But the real product is not simply "run the agent somewhere else."

The real product is a runtime where agent work is observable enough to trust, limit, debug, price, and improve.

That is why CloudTrail, CloudWatch, OpenTelemetry, token metrics, trace IDs, session records, and gateway logs are not enterprise garnish. They are the difference between a demo and an operational system.

The next useful coding-agent platform will not win only because the model is smarter or the sandbox is cleaner.

It will win because when someone asks "what happened?", the answer is not a vibe, a transcript, or a shrug.

It is a trace.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

agentic workflows are being domesticated by actions

Paulo Victor Leite Lima Gomes — Sun, 14 Jun 2026 00:01:41 +0000

GitHub's Agentic Workflows preview has the kind of headline that makes people reach for the wrong conclusion.

Natural language Markdown can turn into GitHub Actions workflows.

That sounds like "the YAML is going away."

I do not think that is the interesting story.

The interesting story is that the agent is not escaping the workflow engine. It is being pulled into it.

That matters because a lot of agent demos still pretend the future is a smart process floating above the boring machinery: the agent understands the request, edits the repo, runs some commands, and hands back a neat result. Nice demo. Very clean.

Production engineering is not clean like that.

Production engineering has permissions, logs, runner groups, approval rules, secrets, firewalls, budgets, weird old repositories, compliance questions, and someone who has to explain what happened when the helpful automation did something surprising.

So the shape of Agentic Workflows is useful precisely because it is less magical than the demo version. GitHub is putting agents inside the same CI/CD world that already carries a lot of organizational trust.

That is the right direction.

markdown is not the control plane

The cute part is that a developer can describe a workflow in Markdown and have GitHub turn that into standard Actions YAML.

That is useful. YAML is not a personality test, and most teams have better things to do than memorize every Actions syntax edge case.

But Markdown is only the input surface.

The control plane is still Actions.

That distinction matters. If the generated workflow is a normal Actions workflow, then all the existing machinery can still matter: repository permissions, runner selection, logs, environments, approvals, branch protection, organization policy, and whatever security controls the company already built around CI.

This is where I get more optimistic about agentic tooling.

The bad version of agents asks every organization to trust a new, parallel execution model because the model can write a nice plan.

The better version lets the agent help create and operate within the workflows the organization already governs.

It is not "throw away CI because agents are smarter now."

It is "let the agent speak CI."

defaults are doing real work

The preview details are full of boring words that are actually important: read-only defaults, sandboxed containers, firewall rules, output validation, and threat detection.

That is not launch-page decoration. That is the product admitting the hard part.

An agentic workflow is not just a script. It is automation that may interpret instructions, call tools, inspect a repository, generate files, and decide what to do next. If that runs with broad permissions and a casual network boundary, the organization has not gained an agent platform. It has created a very persuasive CI job with too much reach.

The read-only default is especially important.

Write access should be a decision, not an accident. Secret access should be a decision. Network access should be a decision. The ability to open pull requests, comment on issues, trigger deployments, or modify generated artifacts should be visible in the workflow definition and reviewable by people who own the repository.

This is the same lesson we learned with CI, package registries, browser extensions, cloud IAM, and Kubernetes admission controls. The default boundary decides how many bad ideas become incidents.

Agents make the boundary more important, not less.

the old machinery keeps winning

There is a funny pattern in AI developer tools right now.

The front of the product gets new language: agents, tasks, skills, autonomy, natural language, context, reasoning.

The back of the product keeps rediscovering old infrastructure: queues, tokens, logs, approvals, sandboxes, spend limits, role-based access, and audit trails.

That is not hypocrisy. That is maturity.

Agentic Workflows using GITHUB_TOKEN instead of long-lived personal access tokens is a good example. Nobody should be excited about passing PATs around as the foundation for organizational automation. It works until it does not. Then you get the classic mess: unclear ownership, overbroad scope, difficult rotation, and an audit trail that points to a person when the real actor was a workflow.

GITHUB_TOKEN is not glamorous. It is exactly the kind of boring identity primitive agent systems need.

Same for organization billing, cost centers, and per-run token caps.

People want to talk about whether agents can finish bigger tasks. Fine. But if a company is going to run agentic workflows across many repositories, the questions become painfully practical:

Who is paying for this run?
Which team owns the budget?
What happens when the token cap is reached?
Can we distinguish one expensive useful workflow from ten noisy ones?
Can a platform team shut this down without begging every repo owner?

That is what real adoption looks like. Not one amazing workflow. A hundred normal workflows that need ownership.

this is not just nicer workflow generation

It would be easy to treat Agentic Workflows as a better workflow generator.

"Describe what you want, get YAML."

That is probably useful on day one. But the more interesting use case is not replacing copy-paste from documentation. It is creating a higher-level path for routine engineering automation.

Imagine a repository owner saying:

"Every Friday, inspect stale dependencies, propose safe upgrades, run the standard test matrix, and open a draft PR only when the change is low risk."

Or:

"When a flaky test is labeled, gather recent failures, isolate the likely files, and draft an investigation issue with links to logs and suspected causes."

Or:

"Before release, check docs, changelog, migration notes, and examples against merged pull requests, then prepare a reviewable patch."

These are not huge science fiction tasks. They are the annoying glue work around software delivery.

The important part is that the output still lands in a governed workflow. The agent does not become a mysterious background employee. It becomes a workflow step with logs, limits, identity, and review.

That is less romantic.

It is also much more likely to survive contact with a real engineering organization.

approval is not a lack of trust

GitHub also changed the posture for github-actions[bot] pull requests so they can run workflows after approval by someone with write access. That matches the same general direction as Copilot-generated pull requests.

This is another boring rule that I like.

Automation needs to be able to test its work. A bot-created PR that cannot run CI is often useless. But a bot-created PR that automatically reaches every secret and deployment-capable workflow is a different problem.

The reasonable middle is human approval at the boundary where risk changes.

That does not mean every agent PR needs a committee. It means the system should know when generated work is about to enter a more privileged part of the pipeline.

The human is not there to admire the diff. The human is accepting the blast radius.

This is the part agent tooling needs to get comfortable with. Approval gates are not anti-agent. They are how autonomous work becomes acceptable inside shared systems.

what i would watch

If I were evaluating this inside a company, I would not start by asking whether the generated YAML is pretty.

I would ask where the policy shows up.

Can platform teams define which runner groups agentic workflows may use? Can they set token caps by organization, repository, or workflow class? Can security teams see which workflows requested write access? Can logs explain what the agent read, generated, validated, and refused to do? Can repository owners understand the difference between a normal Actions failure and an agent decision that stopped early?

The generated workflow is only one artifact.

The evidence around the run is the part that will matter later.

Six months from now, someone will ask why a workflow opened a PR, why it skipped a repo, why it spent more than expected, or why it had permission to touch a file. If the answer is "the agent decided," the platform failed.

If the answer is in the workflow definition, run logs, approval history, token budget, and linked pull request, then we are at least playing the right game.

the punchline

Agentic Workflows are interesting because they do not replace GitHub Actions. They make Actions the place where agents become normal engineering automation.

That is the part I would bet on.

The future is not a swarm of free-floating agents doing whatever the prompt suggests. The future is agents squeezed through boring machinery: workflow engines, scoped tokens, runner policies, sandboxes, approvals, logs, budgets, and reviewable outputs.

This will annoy people who want the AI story to stay magical.

Good.

Software delivery is already full of powerful automation. The lesson was never "make the automation as unconstrained as possible." The lesson was to make useful automation observable, governable, and boring enough that teams can depend on it.

Agentic workflows are just the next version of that lesson.

The Markdown prompt is the shiny part.

The Actions control plane is the story.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

GitHub Actions is becoming the agent runtime

Paulo Victor Leite Lima Gomes — Sat, 13 Jun 2026 00:01:46 +0000

GitHub Agentic Workflows is now in public preview, and the most interesting part is not that agents can write workflow logic from Markdown.

The interesting part is where GitHub chose to put them.

Agentic Workflows are defined in natural language Markdown, then compiled into standard GitHub Actions YAML. They run through existing runner groups, organization policies, sandboxes, firewalls, output validation, and threat detection. A day later, GitHub also removed the need for a personal access token, letting these workflows use the built-in GITHUB_TOKEN, bill AI credits to the organization, and cap token usage per workflow run.

That sounds like a set of product details.

It is also a pretty clear architectural statement: the future of agent automation looks less like chat and more like CI/CD with a reasoning step inside it.

And honestly, that is probably the right direction.

the YAML was never the hard part

There is an easy way to read this release: "AI can now write Actions workflows for you."

That is the least interesting version.

Writing YAML is annoying, but it is not the hard part of production automation. The hard part is deciding what the automation is allowed to do, where it runs, what secrets it can reach, how it is reviewed, how it is billed, and who owns the failure mode when it does something strange at 2 a.m.

We already learned this with CI.

The workflow file is just the visible artifact. The real system is branch protection, runner isolation, required checks, environments, secrets, approvals, logs, retry behavior, permissions, and the long tail of organizational defaults.

Agents do not make that boring machinery obsolete.

They make it more important.

If an agent is going to triage issues, analyze CI failures, update documentation, or propose changes across repositories, I do not want it floating around as a magical side channel. I want it inside the same governance layer we already use for automation.

That is what makes the Actions angle matter.

actions is becoming an agent control plane

GitHub Actions started as "run these jobs when this event happens." Over time it became the default place where many teams express engineering policy.

Tests run there. Security scans run there. Release jobs run there. Preview environments get created there. Package publishing, dependency checks, compliance evidence, and deployment approvals often pass through that same system.

Now agents are being pulled into the same shape.

That is not accidental. A reasoning agent still needs a trigger, an execution environment, credentials, limits, logs, and a result. Actions already has language for most of that.

The agent part is new. The operational container around it is not.

This is the part I think many teams should pay attention to. The durable product category may not be "AI assistant." It may be "workflow engine where some steps reason over messy context."

That sounds less exciting than the demo.

It is much closer to how useful software gets adopted.

getting rid of PATs matters

The personal access token change is a bigger deal than it looks.

Long-lived personal tokens were never a sane foundation for scaled automation. They are easy to create, easy to forget, and hard to reason about after the employee who created them changes teams or leaves the company.

They also blur ownership. Is this automation acting as Paulo? As the repository? As the organization? Who pays for it? Who can revoke it without breaking unrelated workflows? Which audit trail tells the truth?

Using GITHUB_TOKEN does not solve every problem, but it moves agentic workflows into a more normal automation model. The workflow gets an identity scoped to the repository and job permissions. The organization can own billing. Cost centers and per-run token caps can exist in the same conversation as runner policy.

That is the boring enterprise feature.

It is also the feature that makes the product real.

If an agent still needs a developer's personal token to operate, it is not really production automation. It is a clever script with somebody's identity attached to it.

That may be fine for experiments. It is not where I would build an organizational workflow.

markdown is the interface, policy is the product

I like the Markdown interface. It fits the way many engineers already describe operational intent:

"When a new issue is opened, classify it, find related incidents, suggest labels, and ask for missing reproduction steps."

"When CI fails on the main branch, inspect the failure, compare it to recent flakes, and open a draft investigation."

"Once a week, check docs that mention deprecated APIs and propose updates."

Those tasks are awkward to express entirely as deterministic YAML. They involve judgment, search, summarization, and a tolerance for partial answers.

But the Markdown is only the friendly edge of the system.

The important part is what happens after the task is described. Does it compile into something reviewable? Can the generated workflow be locked? Can a platform team constrain the runner group? Are permissions read-only by default? Are outputs validated before changes are applied? Is there a threat detection pass? Can the cost be capped?

That is the difference between "we let the model do things" and "we built an automation system that includes a model."

The second one is much more useful.

this will create new review work

There is a trap here.

Because the workflow starts as natural language, it may feel less like code. That would be a mistake.

An agentic workflow definition is still a production artifact. It can cause compute to run, credits to be spent, issues to be labeled, pull requests to be opened, and humans to spend review attention.

It deserves review.

Not in the same way as a low-level library, but with the same seriousness. Someone should read the task definition and ask:

Is the job narrow enough?
What input can trigger it?
What repositories and files can it see?
What permissions does it request?
What happens when the answer is wrong?
How much can it spend before stopping?
Who owns the workflow when it gets noisy?

This is where platform teams will matter. If every team invents its own agentic workflow style, organizations will end up with the AI version of abandoned CI jobs: half-useful automation nobody wants to delete because nobody remembers why it was created.

The healthy version will look more like a catalog.

Approved workflow patterns. Standard permission sets. Known runner groups. Required labels for agent-authored pull requests. Budget defaults. A small number of blessed ways to do common things.

That sounds bureaucratic. It is also how shared infrastructure survives contact with real companies.

the first workflows should be boring

The best early use cases are not dramatic.

I would not start with "refactor a service across 40 repositories." I would start with things that already have clear boundaries:

issue triage that only labels and comments
CI failure analysis that opens a draft report
documentation drift checks
dependency update summaries
compliance evidence collection
stale test investigation with no direct code changes

The pattern is simple: let the agent gather context, produce a reviewable artifact, and stop.

Do not let the first version directly merge changes. Do not let it silently mutate production configuration. Do not give it broad write permissions because the demo was impressive.

The more autonomous the workflow becomes, the more boring the control plane needs to be.

This is not pessimism. It is how you keep the useful parts.

Agents are good at starting work that humans postpone because it is tedious. They can read logs, connect clues, summarize context, and draft the next step. That is valuable.

But once the agent sits inside a scheduled or event-driven workflow, it becomes part of the engineering system. The system needs limits before the novelty wears off.

the punchline

GitHub Agentic Workflows are interesting because they make agents less special.

The agent is not a separate magical coworker hovering outside the process. It is a workflow step running inside Actions, behind policies, with permissions, billing, logs, and validation.

That is the right kind of boring.

The companies that get value from this will not be the ones that write the fanciest natural-language workflows. They will be the ones that treat those workflows like production automation: narrow jobs, scoped tokens, reviewable outputs, explicit owners, budget caps, and boring defaults.

The future of agent work may still have chat windows.

But the version that survives inside real engineering organizations is going to look a lot like CI.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

prompts are becoming CI/CD configuration

Paulo Victor Leite Lima Gomes — Fri, 12 Jun 2026 00:02:03 +0000

GitHub Agentic Workflows is now in public preview, and the headline version is easy to understand.

You write a natural-language Markdown file that describes a reasoning-based automation. GitHub compiles it into Actions YAML. The agent runs inside the Actions world, using the runners, permissions, policies, and review machinery teams already have.

That sounds like "AI for GitHub Actions."

I think the more interesting version is slightly more uncomfortable:

Prompts are becoming CI/CD configuration.

Not prompts in the casual chat window sense. Not "please summarize this issue" typed by one developer on a Tuesday afternoon. I mean prompts as durable, reviewed, repeatable inputs to the delivery system.

They live in the repository. They describe work. They decide which tools an agent can use. They affect code, issues, documentation, triage, security checks, and pull requests. They can spend organization money. They can be triggered by workflows.

At that point, a prompt is not a suggestion.

It is infrastructure.

markdown was the easy part

There is a very nice developer-experience trick here.

Markdown feels harmless. Everyone knows how to read it. A workflow written in Markdown looks less intimidating than a page of YAML, shell scripts, permissions, and matrix jobs.

That is useful. A lot of CI/CD configuration became painful because it asked every team to think like a build-platform engineer. If a product engineer can describe a narrow piece of maintenance work in plain language and have it compile into a normal Actions workflow, that is a real improvement.

But the plain-language surface should not fool us.

The system still needs all the boring parts underneath:

which events can start the workflow
which repository contents the agent can read
which tools it can call
which runner group it uses
which secrets it cannot reach
which output is considered safe
which lockfile represents the compiled behavior
which human is responsible when it gets weird

The prompt is friendly. The operational shape is not casual.

This is the part that engineering organizations need to internalize. The prompt is now one layer of a control plane. It is closer to Terraform, GitHub Actions YAML, CodeQL configuration, or policy-as-code than it is to a chat message.

Readable does not mean low risk.

compiled prompts need code review

The phrase "compiled into Actions YAML" matters.

Compilation creates a useful boundary. It means the natural-language file is not the only artifact that should be understood. There is a generated workflow shape too, and that workflow has permissions, jobs, runners, and execution behavior.

That should sound familiar.

We do not review Kubernetes manifests only by asking whether the app developer had good intentions. We look at the resources, probes, ports, environment variables, service accounts, and network exposure. We do not review Terraform by saying the description felt reasonable. We inspect the plan.

Agentic workflows need the same discipline.

If someone changes the prompt from "triage stale issues" to "fix stale issues," that may be a huge behavior change. If someone adds a tool, broadens a path pattern, changes a permission, or swaps the model used for the task, the diff can look small while the blast radius gets much larger.

Natural language makes this trickier because tiny wording changes can matter.

"Update documentation when APIs change" is different from "update documentation and examples when APIs change." One writes prose. The other may touch executable code. "Open a draft pull request" is different from "open a pull request." "Suggest labels" is different from "apply labels."

This is not a reason to avoid the feature. It is a reason to stop treating prompt review as vibes.

The reviewer should ask boring questions:

What is this workflow allowed to change?
What evidence does it need before changing it?
Does it produce a draft or a final artifact?
Are generated changes clearly labeled?
What happens when the agent is uncertain?
Can the team reproduce the behavior from the committed files?

That is code review. It just happens to include English.

the personal token going away is a big deal

The related GitHub change may be even more important: agentic workflows can now use the built-in GITHUB_TOKEN instead of a long-lived personal access token.

That sounds like plumbing because it is plumbing.

It is also exactly the kind of plumbing that separates hobby automation from company infrastructure.

Long-lived personal access tokens are a bad foundation for shared automation. They blur ownership. They outlive people. They hide inside secrets. They make it too easy for "Paulo's token" to become the thing that keeps a business process running.

Moving agentic workflows to GITHUB_TOKEN puts them into the normal Actions identity model. The repository and organization can own the automation. Permissions can be scoped. Billing can attach to the organization instead of a person. Policies can decide whether Copilot CLI usage is allowed.

This is less flashy than an agent writing code.

It is also the maturity moment.

Agents stop being toys when they stop using your personal token.

That does not make them safe by default. It makes them governable in a way that enterprises can understand.

budgets are part of the workflow file now

Organization billing changes the conversation too.

If an agentic workflow runs as part of Actions and consumes AI credits, the cost is no longer an individual developer experimenting with a tool. It is a property of the delivery pipeline.

That means it needs the same treatment as other metered CI resources.

Some workflows are worth running on every issue. Some should run nightly. Some should run only when a maintainer asks. Some should run on a small model. Some should spend more because the work is security-sensitive or touches a critical path.

None of that should be discovered from the invoice.

The uncomfortable part is that AI cost will often be mixed with human attention cost. A workflow that opens five low-quality pull requests a week is not cheap just because the model bill is small. It spends reviewer time. It creates notification noise. It teaches the team to ignore agent output.

That is why the owner matters.

Every agentic workflow should have someone who can answer three questions:

Is this still useful?
Is this still worth what it costs?
Is this still operating inside the intended boundary?

If nobody owns those answers, the workflow is just another piece of automation drifting toward background noise.

safe outputs are the new build artifacts

GitHub's announcement spends time on safeguards: read-only defaults, sandboxed containers, a firewall, safe outputs, and threat detection scanning proposed changes before they are applied.

That is the right direction.

It also hints at the real problem. Once an agent can reason over repository content and generate changes, the output itself becomes something that needs validation before the rest of the pipeline trusts it.

This is very similar to build artifacts.

A compiled binary is not trusted because the source code looked nice. It is trusted because it came from a known process, in a known environment, with checks, signatures, provenance, and review rules around it.

Agent output needs that mindset.

The question is not only "did the agent produce a useful diff?"

The better questions are:

What input did it see?
Which tools did it use?
Which files did it touch?
Which checks ran after it produced the result?
Was the output constrained before it reached a privileged workflow?
Can a reviewer understand why the change exists?

This is why putting agentic workflows inside Actions is smart. It gives the ecosystem a familiar place to put these controls.

But teams still have to use them.

what i would do first

I would not start with a workflow that rewrites production code.

Start with something useful and boring.

Issue triage is a decent first candidate. Documentation drift is another. Weekly dependency-report generation is probably fine. Release-note preparation can work if the output is explicitly a draft.

The important part is to keep the first workflow narrow enough that a reviewer can tell when it misbehaves.

For example:

It can only comment, not edit code.
It can only touch files under docs/.
It can only open draft pull requests.
It cannot run on untrusted external input.
It has a named owner.
It has a budget.
It has a clear delete condition.

That last one is underrated. Automation should have a delete condition. If the workflow creates more review burden than value for a month, turn it off. If the team ignores every output, delete it. If it needs a human to rewrite everything, tighten the task or stop pretending it is automation.

Engineering maturity is not keeping every clever workflow alive forever.

the punchline

Agentic Workflows are interesting because they make prompts durable.

A prompt can now sit in a repository, compile into a workflow, run on organization infrastructure, use organization identity, spend organization money, and produce changes that enter the same review process as human work.

That is a real shift.

It is also a warning label.

If prompts are becoming CI/CD configuration, they need the habits we already learned from CI/CD: review, ownership, least privilege, budgets, lockfiles, sandboxing, rollback, and deletion when the value is gone.

The pleasant fiction is that natural language makes automation simple.

The more useful truth is that natural language makes automation easier to author, which means we will have more of it.

More automation is good only when the operating model keeps up.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

coding agents made repositories the security boundary

Paulo Victor Leite Lima Gomes — Wed, 10 Jun 2026 00:03:52 +0000

GitHub shipped a small changelog entry this week that says more about the future of coding agents than most of the launch demos.

Security validation for third-party coding agents is now generally available. Not just for GitHub's own Copilot cloud agent. For third-party agents too, including Claude and OpenAI Codex.

The feature sounds boring in the best possible way.

When an agent creates code, GitHub can run CodeQL, check new dependencies against the GitHub Advisory Database, and use secret scanning to detect tokens, API keys, and other sensitive material. If it finds a problem, the agent tries to fix it.

That is not the flashy part of agentic coding.

It is the important part.

Because once agents are allowed to act inside repos, the question stops being "which model wrote this diff?" and becomes "can the repository apply the same policy to every automation actor?"

authorship is the wrong abstraction

We still talk about generated code as if authorship is the primary thing that matters.

Was this written by Copilot? Claude? Codex? A human with tab completion? A human who pasted something from a chat window and cleaned it up? A junior engineer following a Stack Overflow answer from 2018?

Those distinctions matter for procurement and product marketing. They matter less for the repository.

The repository has a simpler problem: a change is trying to enter the system. It may introduce a vulnerability, add a risky dependency, leak a secret, violate an internal rule, or be perfectly fine.

That is why the GitHub change is interesting. It moves the useful boundary from "our approved coding assistant" to "any coding agent operating in this repository."

the agent is now an actor

For years, repository automation was mostly boring and legible.

CI ran tests. Dependabot opened dependency updates. Release bots bumped versions. Linters complained. Security scanners commented. Humans reviewed. The automation could be annoying, but its shape was predictable.

Coding agents are different.

They do not merely report on the repository. They edit it.

They read issues, inspect files, modify code, add tests, update dependencies, and open pull requests. That makes them closer to a contractor than a linter.

And like every contractor with write access, they need boundaries.

Not vibes. Boundaries.

What tools can they call? What data can they read? Which secrets are visible? Which checks must pass? Who approves the final merge? How do we audit what happened later?

These are not philosophical questions. They are repository administration questions.

The funny part is that this makes agentic coding feel less futuristic and more like boring enterprise software. Identity, authorization, audit logs, policy inheritance, defaults, exceptions, and enforcement.

As usual, the boring part is where production starts.

security checks should not depend on vendor loyalty

The worst version of agent governance is vendor-specific policy.

Copilot-generated code gets one security path. Claude-generated code gets another. Codex-generated code depends on whatever the team remembered to configure. A local agent running from a developer machine is treated as a human diff because nobody knows what else to do with it.

That does not scale. It also creates the wrong incentive: teams argue about which agent is safe instead of making the repository resilient to generated work in general.

The security posture should not be:

"We trust this model."

It should be:

"We do not allow unvalidated changes through this boundary."

Those are very different statements.

The first one turns model selection into a security control. That is weak, because models change, vendors change defaults, prompts drift, and generated output is still output.

The second statement is boring and robust. Every change gets checked. Every actor hits the same gates. The repo enforces policy regardless of whether the diff came from a human, a first-party agent, or a third-party one.

That is the direction this needs to go.

automatic remediation is useful, but weird

The part that made me pause is not that GitHub runs security analysis.

It is that, when a problem is found, the agent attempts to resolve it before finalizing the pull request.

That is useful. It also changes the review loop.

Traditionally, a scanner reports an issue and a human fixes it. With agents, the loop can become: tool complains, agent modifies code, tool checks again, agent modifies code again, then the final diff appears for review.

That is probably right. It is also a reason to be more disciplined about logs and provenance.

If an agent introduced a dependency, the scanner flagged it, and the agent replaced it with another one, the final pull request may only show the end state. The interesting part might be the path it took to get there.

For small bugs, nobody will care.

For security-sensitive changes, regulated environments, and expensive incidents, people will care a lot.

This is where "agent wrote code" becomes too vague. We need to know which agent acted, under which permissions, which tools ran, and which checks were authoritative.

That sounds like paperwork until the first incident review.

Then it sounds like the minimum viable truth.

this is a platform-team problem

One pattern keeps repeating across the agent tooling news: the useful features are becoming platform features.

Security validation. MCP server configuration. Repository-level review settings. Agent skills. Cost controls. Sandboxes. Audit trails. Hooks before and after tool use.

These are not features an individual developer should be hand-rolling for every repo.

If a company has hundreds of repositories, the platform team needs to answer boring questions. Which agents are allowed where? Which validations are mandatory? How are secrets isolated from agent execution? Can the company prove that the same rules apply across repos? Who owns the policy when it breaks a team at 5 PM on Friday?

None of this looks like the AI demo, but it decides whether agent adoption becomes productive infrastructure or a pile of clever one-off workflows.

the repository is becoming the enforcement layer

I like this GitHub change because it points to the right mental model.

The agent is not the trust boundary.

The repository is.

That does not mean the agent can be reckless. Agent identity, sandboxing, tool permissions, and prompt controls still matter. But the repository is where work becomes part of the system. It is where code review, CI, security scanning, branch protection, dependency policy, and audit history already meet.

So it makes sense for agent governance to land there too.

This is also a good reminder that "AI-native development" will not replace the old software delivery machinery as quickly as people think. It will mostly increase the load on that machinery: more pull requests, more generated dependencies, and more plausible mistakes arriving faster than reviewers can comfortably process them.

The answer is not to make every human reviewer faster by sheer force of will.

The answer is to move the obvious checks closer to the boundary and make them consistent.

what i would do now

If I were responsible for agent adoption in a real engineering organization, I would start with the repository policies before buying more seats or enabling more tools.

Inventory which automation actors can open pull requests today. Include the boring ones: dependency bots, release bots, internal scripts, CI workflows, and any agent experiments running from developer machines.

Then make the entry rules explicit.

At minimum, I would want mandatory security scanning, dependency checks, secret scanning, branch protections, and a clear distinction between "agent may propose" and "agent may merge." I would also want logs that make agent actions reconstructable later.

I would avoid making this a model-ranking exercise.

The question is not whether Claude, Codex, Copilot, or the next thing has the best benchmark score this month. The question is whether your repository can safely accept work from any of them under a policy you understand.

That is the difference between a demo and an engineering system.

the punchline

The interesting thing about third-party coding agents is not that they can write code.

We know they can write code.

The interesting thing is that they are becoming normal actors inside software delivery. Once that happens, the repository has to become stricter, not looser. The more autonomous the actor, the more boring the boundary needs to be.

GitHub extending security validation beyond its own agent has a large implication: agent governance cannot be a product-specific afterthought. It has to be a repository property.

That is the right direction.

Do not trust the diff because you like the model.

Trust the diff because it survived the same boundary every other change has to survive.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

agent platforms have migrations too

Paulo Victor Leite Lima Gomes — Tue, 09 Jun 2026 00:01:52 +0000

There is a very useful moment in the life of every platform abstraction.

It is not the launch demo.

It is the migration guide.

Microsoft's refreshed public preview for hosted agents in Azure Foundry has one of those moments. The old preview hosting backend is being retired. Existing deployments on the initial backend do not move automatically. Teams that built on the earlier preview need to redeploy against the new model.

That sounds like normal preview churn, because it is. Preview platforms change. APIs move. SDKs rename things. Some early assumptions do not survive contact with actual users.

But the interesting part is not that a preview changed.

The interesting part is what changed.

The refreshed hosted-agent model has a new hosting backend, protocol libraries, identity model, management APIs, dedicated agent endpoints, and a session-based sandbox model. In other words, this is not just a wrapper around a model call getting a new package name. It is infrastructure.

Agent platforms are becoming real platforms, and real platforms have migrations.

preview is not a contract

I do not think Microsoft did anything especially surprising here.

If you deploy production-critical work on a public preview, you are accepting movement. That is the deal. Sometimes the movement is small. Sometimes the thing you built against turns into scaffolding for the thing that comes next.

The problem is that AI tooling encourages people to forget this.

The marketing surface says agent. The operational surface says runtime, identity, storage, isolation, protocols, lifecycle, files, sessions, and endpoints.

Those are very different categories of risk.

A chatbot prompt can be rewritten in an afternoon. A hosted agent wired into internal workflows, private tools, project data, authentication, audit logs, and user-facing automation becomes part of the system. When the hosting model changes, you are not only changing a library import. You are revisiting trust boundaries.

That is why migration guides are more honest than launch posts.

Launch posts tell you what the platform wants to become.

Migration guides tell you what the platform actually is.

the identity change is the tell

The most important line in this kind of migration is usually the identity line.

In the refreshed Foundry hosted-agent preview, each agent gets its own Entra identity at creation time, replacing the older shared project managed identity model. That is not a cosmetic change. It changes how access is granted, reviewed, debugged, and revoked.

This is exactly where agent platforms stop being toys.

If an agent can call tools, read files, reach services, write state, and keep a session alive across turns, then "what identity does it use?" is not an implementation detail. It is the control plane.

Shared identity is convenient early. It makes demos easier. It reduces setup friction. It lets the platform feel simple.

Dedicated identity is where the abstraction starts admitting that agents are actors.

They need least privilege. They need resource-specific access. They need ownership. They need audit trails that answer "which agent did this?" instead of "something under the project identity did this."

This is the same story we already saw with workloads, services, CI jobs, and Kubernetes controllers. At the beginning, everything runs with the big credential because getting started matters. Later, after the first few uncomfortable questions, the platform grows more precise identity boundaries.

AI agents are speedrunning that history.

protocols are product boundaries

Another useful signal is the move from framework-specific adapters toward protocol libraries.

Framework adapters are great for adoption. If you are using LangGraph, Agent Framework, Semantic Kernel, or something custom, the platform can meet you where you are. That is a good on-ramp.

But adapters are not always a durable platform boundary.

Protocols are.

Foundry's refreshed model talks about Responses, Invocations, Activity, and A2A protocols. A single agent can expose multiple protocols. Dedicated endpoints now matter more. Management APIs cover the lifecycle more completely.

This is the boring part that matters.

If agents are going to be called by portals, webhooks, schedulers, internal tools, other agents, customer workflows, and maybe boring old scripts, the stable contract cannot be "whatever this framework adapter happened to expose in April."

The stable contract has to be the protocol.

This is also where teams should slow down and think.

When you choose an agent platform, you are not only choosing a model host. You are choosing the shape of invocation. You are choosing how sessions are represented. You are choosing how files move in and out. You are choosing how cancellation works. You are choosing where tool access lives. You are choosing whether your workflow is portable enough to survive the next platform refresh.

The model is the least sticky part.

The runtime contract is the sticky part.

sessions are state, and state is where platforms get serious

The refreshed preview also introduces a session-based sandbox model, with a persistent home directory and file storage across turns and idle periods.

That sounds convenient. It is convenient.

It is also state.

State is where simple agent demos become production systems.

A stateless model call is easy to reason about. Input goes in. Output comes out. You log the request, maybe store the response, and move on.

A session-based agent is different. It can accumulate files. It can remember working context. It can do multi-turn work in an environment that has continuity. The platform can provision compute on demand and deprovision it after inactivity, but the session can still matter.

That creates better user experiences. It also creates better failure modes.

What happens when a session gets too large? What is retained? Who can inspect it? How does a user delete it? Does a later turn depend on a file written by an earlier one? Can the same agent handle concurrent work safely? Can a sensitive artifact survive longer than intended because it lives in the wrong sandbox path?

None of these questions are exotic. They are the normal questions we ask about stateful systems.

The point is that agent platforms do not exempt us from those questions. They just package them in a newer interface.

the migration is the architecture review

If your team built on the old hosted-agent preview, the practical work is obvious enough: update packages, change entry points, move to the new protocol libraries, update SDK and CLI calls, grant access to the dedicated agent identity, redeploy, verify the version is active, and test real workflows.

But I would not treat that as a checklist only.

I would treat it as an architecture review.

Ask what the old design assumed:

Did the agent rely on broad project-level identity?
Did any workflow assume a shared project endpoint?
Were tools configured as deployment-time metadata when they now belong behind runtime MCP access?
Did user sessions have implicit state that nobody documented?
Are logs and audits good enough to explain agent behavior after the migration?
Can the agent be redeployed without losing important operational context?

This is the useful pain of platform churn. It exposes which parts of your system were accidental.

Some teams will discover they only had a demo. Fine. Migrate it quickly and move on.

Other teams will discover they already have production dependency on an agent runtime they did not really model as production infrastructure. That is the more interesting case.

build agents like the platform will change

The lesson is not "avoid previews forever."

That would be too easy, and also unrealistic. The agent space is moving quickly. If you wait for every surface to become boring, you will learn late.

The lesson is to keep your own contracts small.

Put your domain logic somewhere you control. Keep tool definitions explicit. Treat platform sessions as useful runtime state, not as your only system of record. Do not let authorization live only in the agent instructions. Keep invocation boundaries narrow enough that another host could call the same workflow later. Write down which resources the agent identity needs and why.

Most importantly, separate the agent's job from the platform's job.

The agent's job is to reason, call tools, and produce useful work.

The platform's job is to provide identity, isolation, lifecycle, protocols, files, observability, and policy.

When those boundaries blur, every platform change becomes a product rewrite.

When they are clear, a migration is still annoying, but it is survivable.

this is what maturity looks like

I expect more of this, not less.

Agent platforms will keep changing because the industry is still discovering the right runtime shape. We will see more dedicated identities, more protocol work, more sandbox models, more managed tool registries, more audit surfaces, more session APIs, and more lifecycle controls.

That is not a failure of the agent idea.

That is the agent idea becoming infrastructure.

Infrastructure has versions. Infrastructure has deprecations. Infrastructure has migration guides that ruin your morning and improve your architecture if you read them carefully.

The teams that treat agents as magic assistants will experience this as chaos.

The teams that treat agents as workloads will recognize the pattern.

Your agent platform is not only a place where prompts run. It is a runtime with identity, state, isolation, protocols, and operational contracts.

And runtimes have migrations.

references

Microsoft Learn: Migrate hosted agents to the refreshed public preview

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

permission prompts are not an agent security strategy

Paulo Victor Leite Lima Gomes — Mon, 08 Jun 2026 00:01:44 +0000

Docker published a practical guide last week on securing AI agents, and one sentence in it should be printed on a sticker for every engineering team adopting coding agents:

Permission prompts are not a security strategy.

That is not the whole guide, obviously. Docker talks about isolation, tool access, identity, credentials, runtime monitoring, MCP provenance, and multi-agent trust boundaries. Good. Those are the grown-up topics.

But the permission prompt line is the one that stuck with me, because it names a habit I keep seeing in agent products and internal demos.

This feels safe because a human is technically in the loop. It also feels familiar because developers already approve things all day: browser permissions, OAuth scopes, package installs, CI reruns, deploy buttons, cloud console warnings, and the occasional horrifying Terraform plan.

The problem is that prompts are usually a speed bump, not a boundary.

prompts train people to click yes

Security controls that depend on constant human attention tend to decay into theater.

Not because developers are careless. Because the workflow teaches them that approval is the price of progress.

If an agent asks for permission twenty times during a coding task, the first prompt might get real scrutiny. The second still gets a glance. By the tenth, the developer is reading for whether the action looks vaguely aligned with the task. By the twentieth, the developer is clicking approve because the alternative is babysitting a machine that was supposed to save time.

This is not a character flaw. It is interface design.

Agents make this worse because the sequence is dynamic. A coding agent may need to inspect files, install packages, run tests, generate migrations, call an MCP server, open documentation, and retry after a failure. Each step can be individually reasonable. The risky thing is the chain.

Approving one command at a time does not mean the whole workflow is safe. It just means the danger has been sliced into small enough pieces that each piece looks acceptable.

the agent needs a sandbox, not a chaperone

The better model is to give the agent autonomy inside a disposable boundary.

That boundary can be a container, a microVM, a remote development environment, or some other sandbox. The implementation matters, but the shape matters more: the agent gets enough room to do useful work, and the host machine does not become part of the blast radius.

This is why I like the container framing for agents. Not because containers are magic security dust. They are not. But because they force better questions: what filesystem can the agent see, what network can it reach, which credentials are present, and what happens when the environment is destroyed?

A prompt asks a tired human to make a local judgment. A sandbox makes the dangerous default impossible or at least contained.

If the agent needs to run tests or install packages, fine. But do that in an environment where the agent cannot casually read ~/.ssh, scrape unrelated repositories, phone home to arbitrary domains, or inherit a developer's entire cloud identity because that token happened to be in the shell.

That is the difference between supervising a risky process and designing a system where the risky process has less to break.

tool access is the real permission system

The next weak spot is tool access.

Most agent demos treat tools as capability candy. Add GitHub, Slack, Jira, docs, the database, the deployment system, a browser, and the internal MCP server someone wrote last month. The agent becomes more useful, the demo gets better, and the trust boundary quietly expands.

This is where permission prompts are especially misleading.

The important question is not whether the agent asked before using a tool. It is whether that tool should have been available for this task at all.

A frontend refactor agent does not need production database access. A dependency update agent does not need customer transcripts. A docs summarizer does not need deploy permissions. A local coding assistant probably does not need arbitrary internet access while reading private code.

This sounds obvious until you look at how teams wire tools. The easiest integration model is "give the agent everything and rely on the model to choose wisely." That is not a permission model. That is wishful thinking with JSON schemas.

Tool access should be scoped by task, environment, repository, data classification, and identity. Ideally, a gateway enforces that policy consistently instead of leaving every agent runtime to invent its own rules.

MCP makes this more urgent, not less. It standardizes how agents connect to tools, which means tool descriptions, server provenance, and tool behavior become part of the security surface. A malicious or sloppy tool is not just bad code. It is an instruction source the agent may trust.

If you would not install a random GitHub App across your production organization, do not casually hand the equivalent MCP server to every coding agent.

agents need their own identities

One of the fastest ways to make agent security unmanageable is to run agents as the developer. It is convenient. It also makes the audit story terrible.

If the agent uses my token, every action looks like me. If it pushes a branch, calls an API, reads a document, opens a ticket, or touches infrastructure, the system records Paulo did it. Maybe I asked the agent to do it. Maybe a prompt injection steered it. Maybe a tool description was poisoned. The logs do not know.

Service accounts exist because we learned this lesson already. Automated systems need identities that are scoped, auditable, revocable, and understandable. Agents are automated systems. They should have their own identities.

Not one universal "ai-agent" account with god permissions. Real identities with purpose: this repo migration agent, this incident triage agent, this documentation agent. Each one should have the minimum useful access for the job and a clear owner.

Short-lived credentials help. Runtime secret injection helps. Logs that connect the human request, agent identity, tool call, and result help even more.

The point is not to remove humans from accountability. The point is to make the machine actor visible enough that accountability means something.

what i would actually do

If a team asked me how to start securing coding agents, I would not begin with a giant AI governance committee. I would start with four defaults.

First, agents run in disposable environments. No direct host access unless there is a specific reason. No accidental inheritance of local secrets.

Second, network access is denied by default and allowlisted by task. Package registries, docs, and internal APIs should be explicit.

Third, tools are granted per workflow. Do not preload every MCP server because it might be useful someday. Useful someday is how access sprawl becomes normal.

Fourth, every agent action worth caring about is logged with enough context to explain it later: tool, parameters, identity, policy, outcome. Not because auditors are fun, but because production systems deserve evidence. Prompts cannot give you that memory after an incident.

Permission prompts can still exist. Sometimes they are useful. A prompt before deleting files is fine. A prompt before spending money is fine. A prompt before pushing a branch is fine.

But prompts should be the last-mile confirmation for unusual actions, not the main wall between the agent and the rest of your environment.

the punchline

Agents are becoming useful because they can act without asking us about every tiny step.

That is also why "just ask the human" is such a weak security model.

If the agent needs a chaperone for every action, it is not autonomous enough to deliver the workflow. If the agent can act autonomously, then the security model has to live in the environment: sandboxing, scoped tools, dedicated identities, network policy, credential hygiene, and logs.

The industry is slowly rediscovering an old systems lesson. You do not secure dangerous work by hoping every operator makes perfect decisions under interruption. You secure it by shaping the room where the work happens.

So yes, keep the prompt for the weird command.

But build the boundary first.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

enterprise context is becoming the new ai platform lock-in

Paulo Victor Leite Lima Gomes — Sun, 07 Jun 2026 00:01:45 +0000

Microsoft Build 2026 had plenty of normal launch noise: more agents, more Copilot surfaces, more model choice, more reasons to pretend your backlog is about to become self-aware.

The part I keep coming back to is Microsoft IQ.

Not because the name is great. It sounds like a dashboard metric invented in a meeting with too much coffee.

But the shape of the product is important. Microsoft is turning organizational context into a platform layer for agents: Work IQ for Microsoft 365 signals, Fabric IQ for business data, Foundry IQ for application and agent context, and Web IQ for fresh web grounding. Work IQ APIs are scheduled for general availability on June 16, and the licensing page says API usage will be billed through Copilot Credits.

That is the real announcement.

The enterprise AI platform is no longer only the model. It is the context layer around the model.

And context is sticky.

the model is getting easier to swap

For the last two years, a lot of AI strategy sounded like model procurement.

Which model is best? Which provider has the longest context window? Which benchmark should we trust? Which one is cheapest for summarization? Which one can write better React code? Which one is allowed to see customer data?

Those questions still matter, but they are becoming less decisive.

The big platforms are all moving toward model routing. Microsoft talks openly about model diversity across OpenAI, Anthropic, its own MAI models, and other providers. GitHub and cloud platforms are also getting more comfortable with the idea that different agents may use different models for different work.

That is the sensible direction. A company should not bet its whole engineering workflow on a single model forever. Models improve, prices change, latency changes, risk profiles change, and some tasks simply need different tradeoffs.

But if the model becomes easier to swap, the durable lock-in moves somewhere else.

It moves to the layer that knows the company.

context is where the enterprise lives

Enterprise software is mostly local truth.

The architecture diagram is useful, but the real rule is in the migration note. The roadmap says one thing, but the customer escalation thread says another. The public API contract is in docs, but the important exception is buried in a pull request from 2023. The compliance requirement is in a policy document nobody reads until the audit week. The team decision happened in a meeting, then got half-summarized in Teams, then became a Jira ticket with the interesting sentence removed.

This is why generic agents hit a wall inside real organizations.

They can write code. They can explain APIs. They can search the web. They can be impressive in a clean repository with a good prompt.

But most useful enterprise work depends on knowing who decided what, which system owns which behavior, what the organization is allowed to do, and where the weird exceptions live.

Microsoft IQ is interesting because it goes straight at that problem. The official framing is a unified intelligence layer where agents and Copilot interactions are grounded in a shared understanding of the organization. Work IQ exposes Microsoft 365 context while preserving permissions and governance controls. Web IQ offers MCP-native grounding for external knowledge.

That is not just better retrieval.

That is the vendor saying: your agents should think through our map of your company.

this is useful and uncomfortable

I do not want to pretend this is bad technology.

If your company already lives in Microsoft 365, a context layer that understands mail, calendar, Teams, SharePoint, OneDrive, documents, organizational relationships, and existing permissions is genuinely useful.

An agent helping with a project should know the meeting where the decision happened. It should know the file the finance team treats as the source of truth. It should respect the same access controls a human employee has. It should not ask every team to rebuild enterprise search from scratch just so a model can answer "what is the current plan for this migration?"

That is the good version.

The uncomfortable version is that context is not portable in the same way a model endpoint is portable.

Once your agents depend on Work IQ semantics, Microsoft 365 permissions, Copilot Credits, Fabric ontologies, admin controls, MCP tools, and whatever ranking logic decides which internal fact matters, you have built more than an app. You have built on a vendor's interpretation of your organization.

That may be worth it.

But it is still lock-in.

lock-in is not always a trap

Engineers sometimes talk about lock-in as if it is automatically a moral failure.

That is too simplistic.

Every useful abstraction creates some lock-in. Postgres locks you into Postgres behavior. Kubernetes locks you into Kubernetes concepts. Stripe locks you into Stripe's payment model. GitHub locks you into GitHub's pull request workflow. The question is not "is there lock-in?" The question is whether the value is worth the exit cost.

For enterprise context platforms, the exit cost may be higher than teams expect.

Moving from one LLM API to another is annoying, but manageable if you planned for it. Moving from one vector store to another is also annoying, but at least the data shape is somewhat visible.

Moving away from a context layer that has become the nervous system for agents is harder.

What exactly do you export? Documents, yes. Calendar events, yes. Messages, maybe. Permission models, organizational graphs, ranking behavior, implicit relationships, access checks, tool definitions, skills, policies, and usage history? That is where it gets messy.

The most valuable part of the system is not only the raw data. It is how the platform turns the raw data into usable context at runtime.

That runtime behavior is hard to reproduce.

context governance becomes platform work

The practical question is not whether companies should use Microsoft IQ. Many will, and for good reasons.

The practical question is who owns the context layer.

If agents are going to reason over company memory, then context cannot be treated like a magic backend feature. It needs platform ownership.

Someone has to answer boring questions:

Which data sources are allowed to ground agents?
Which agents can use Work IQ, Web IQ, Fabric IQ, or custom MCP servers?
How are permissions inherited and audited?
What happens when an agent cites stale context?
Which teams can publish tools or skills into the agent environment?
How is usage billed, budgeted, and explained?
What is the fallback path if the context provider is unavailable?
How do you test whether the retrieved context was good enough?

Those questions sound operational because they are.

The mistake would be letting every team wire agents directly into whatever context source is easiest, then discovering a year later that nobody can describe which agents know what.

That is how "AI adoption" becomes another enterprise archaeology project.

microsoft is not the only one doing this

Microsoft is just the clearest example this week.

AWS is packaging agent tooling with IAM, CloudTrail, CloudWatch, managed MCP, docs retrieval, and sandboxing. GitHub is exposing cloud-agent configuration through APIs. Docker is hardening images and MCP servers. Everyone is converging on the same truth: agents need governed context, tools, permissions, logs, and runtime boundaries.

The model is the flashy part.

The platform is the part enterprises actually buy.

This is also why "model-agnostic" claims need careful reading. Web IQ being MCP-native and model-agnostic is useful. It means you are not necessarily locked into one inference provider. But a model-agnostic context layer can still be a context lock-in.

You may be able to swap the brain.

You may not be able to swap the memory.

what i would do before standardizing

If I were evaluating this inside a real company, I would not start with a grand AI architecture diagram.

I would start with one workflow where context is obviously the bottleneck.

Maybe support escalation summaries. Maybe engineering design reviews. Maybe compliance-heavy product changes. Maybe internal developer onboarding. Pick something where the agent needs real organizational memory, not just a web search and a codebase.

Then measure the unglamorous stuff.

Did the agent find the right source documents? Did it respect permissions? Did it cite useful evidence? Did it miss important context? Did it retrieve too much irrelevant noise? Did humans trust the answer more, or did they spend the same amount of time verifying it? What did the workflow cost in Copilot Credits or equivalent usage units?

Most importantly: document the dependency you are creating.

Which APIs are now in the critical path? Which data sources matter? Which admin settings define the security boundary? Which vendor-specific semantics would be painful to replace?

That is not anti-vendor paranoia. It is basic engineering hygiene.

the punchline

Microsoft IQ is a good signal for where enterprise AI is going.

The winning platforms will not just offer better models. They will offer better access to the messy, permissioned, constantly changing context inside the company.

That is useful. It is also where the next generation of lock-in will live.

The old cloud lock-in was compute, storage, databases, and deployment pipelines. The new AI lock-in is organizational memory, tool permissions, grounding behavior, and the agent runtime that makes those things usable.

So yes, use the context layer. The generic assistant that does not understand the organization will be too shallow for serious work.

But do not treat context as a free ingredient.

Context is infrastructure now.

And once your agents depend on it, it deserves the same boring attention as every other piece of infrastructure: ownership, observability, cost controls, security boundaries, portability plans, and a clear understanding of what happens when it lies.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

codex on AWS makes agents a cloud workload

Paulo Victor Leite Lima Gomes — Sat, 06 Jun 2026 00:02:44 +0000

OpenAI putting Codex and frontier models on AWS is easy to read as a distribution story.

One more cloud. One more enterprise channel. One more place where procurement can buy the shiny thing without a special exception.

That reading is not wrong, exactly. It is just too small.

The interesting part is not that Codex can now run closer to AWS customers. The interesting part is that a coding agent is no longer just an IDE feature, a CLI, or a chat window pointed at a repository.

It is becoming a cloud workload.

That changes the shape of the problem. Once an agent can read code, open pull requests, call tools, reach cloud APIs, inspect logs, and maybe modify infrastructure, the hard questions are not "which model is smartest?" The hard questions are the old boring platform questions:

Which identity is the agent using?
Which network can it reach?
Which secrets can it see?
Who approved the action?
Where is the audit trail?
What happens when the bill spikes?
Who owns the failure when the generated change breaks production?

Welcome back to software engineering.

distribution is governance in disguise

OpenAI's announcement says Codex on Amazon Bedrock brings the software engineering agent into AWS, using the security, governance, procurement, billing, and operational workflows enterprises already have.

That sentence is doing a lot of work.

For individual developers, a coding agent feels like a productivity tool. You give it a task, it changes files, you review the diff. The boundary is psychological and local: do I trust this thing enough to accept the patch?

For a company, that boundary is nowhere near enough.

An enterprise does not merely buy "a better coder." It buys an execution surface. The agent needs credentials. It needs access to source code. It may need package registries, internal docs, issue trackers, cloud consoles, observability systems, CI logs, and deployment pipelines. Every one of those integrations turns the agent from a helpful assistant into an actor inside the company's control plane.

That is why AWS availability matters.

Not because every company loves AWS so much that all AI must live there. Because many companies already have the boring machinery around AWS: IAM, CloudTrail, VPC boundaries, procurement, budgets, policy exceptions, account structures, incident processes, and security review muscle memory.

Putting Codex into that world is not just about latency or convenience. It is about letting the agent inherit an operating model the enterprise already understands.

the multi-cloud part is not optional

There is another uncomfortable implication here: agent infrastructure is going multi-cloud by default.

A normal engineering organization already lives across several control planes. Source code may be in GitHub. Identity may be in Okta or Entra. Production might be on AWS. Developer productivity might be tied to Microsoft. Some AI contracts may run through OpenAI directly. Some models may come through Bedrock. Some teams will still run local agents on laptops because that is where the work actually happens.

Nobody is going to cleanly centralize this.

The agent will cross boundaries because the work crosses boundaries.

It will read a ticket in one system, search docs in another, patch code in a repo, run tests in a sandbox, query logs in the cloud, and ask for a human review in a pull request. If you are lucky, each step has a useful audit trail. If you are unlucky, the agent becomes a very confident intern with five browser tabs, broad credentials, and no memory of why it clicked the thing it clicked.

This is the real platform problem.

The winning agent stack is not the one with the nicest chat box. It is the one that can carry identity, authorization, policy, observability, cost accounting, and review semantics across the whole workflow.

That is much harder than adding another model picker.

coding agents need platform contracts

I keep coming back to the same pattern: AI tools become serious when they stop pretending the model is the product.

For coding agents, the product is the system around the model.

The model can propose a fix. Fine. But a production-grade coding agent needs a contract for how that fix moves through the organization.

It needs scoped repository access, not "here is the company GitHub token." It needs ephemeral sandboxes that can be killed, inspected, and reproduced. It needs tool permissions that are narrow enough to reason about. It needs separate identities for reading logs, opening pull requests, and changing infrastructure. It needs policy gates before dangerous actions. It needs cost budgets. It needs provenance on generated code. It needs a way to explain what context it used.

This sounds boring because it is boring.

It is also the difference between a demo and something a serious company can leave running.

The more capable agents become, the less acceptable it is to treat them as fancy autocomplete. Autocomplete suggests text. Agents act. Once a system acts, you need the boring nouns: identity, policy, audit, isolation, rollback.

AWS is packaging the path, not just the model

AWS is not being subtle about this either.

Agent Toolkit for AWS is framed around managed MCP, AWS-specific skills, IAM guardrails, CloudTrail and CloudWatch observability, and sandboxed execution. AWS Transform agents are being pushed into developer tools like Kiro, Cursor, Claude, and Codex. Bedrock AgentCore is part of the same story.

The cloud providers have noticed that generic model knowledge is not enough.

An agent that knows AWS from public documentation is useful until it confidently wires together a production footgun. Cloud knowledge changes. Services have edge cases. Account structures vary. Security expectations are local. The right answer often depends on company policy, not Stack Overflow consensus.

So the cloud providers are turning best practices into runtime inputs.

That is a big shift. Cloud guidance used to live in docs, templates, reference architectures, and the heads of staff engineers who had been burned before. Now it is being packaged as agent context, tool definitions, policy constraints, and managed execution environments.

This is not documentation becoming nicer.

This is documentation becoming executable.

what i would watch

If I were running platform engineering in a company adopting coding agents, I would not start by arguing about which model is "best."

I would start with the control plane.

Can every agent action be tied to a durable identity? Can I tell the difference between Paulo using an agent and the agent acting autonomously after Paulo assigned a task? Can I restrict tools by repository, environment, account, and risk level? Can I see the prompt, context sources, commands, diffs, test results, and approvals that led to a change? Can I cap spend per team? Can I revoke access without breaking every developer's local setup?

Those questions sound unglamorous. That is usually a sign they are close to the real work.

I would also watch for a new kind of lock-in: operational memory. Policies, agent skills, tool contracts, approval flows, and audit records can become just as sticky as APIs and managed services.

That does not mean "avoid managed agent platforms." Managed platforms are often the right choice. It means teams should be explicit about what they are adopting. Are they buying model access, or are they buying an operating system for agentic work?

Those are very different commitments.

the punchline

Codex on AWS is not just OpenAI adding another distribution channel.

It is a signal that coding agents are moving from developer tools into cloud infrastructure.

That move is probably inevitable. The work agents need to do already crosses cloud boundaries, source control boundaries, identity boundaries, and organizational boundaries. Keeping agents trapped inside a chat box was never going to be the final shape.

But the moment agents become workloads, the conversation changes.

The model still matters. Of course it does. Better reasoning, better code generation, better tool use, and better reliability all matter.

They are not enough.

The serious question is whether the agent can operate inside the same governance envelope as the rest of the company. Identity. Audit. Network policy. Cost. Review. Rollback. Ownership.

The boring stuff again.

That is where the next phase of coding agents will be won or lost.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

the developer laptop is the first production environment for agents

Paulo Victor Leite Lima Gomes — Fri, 05 Jun 2026 00:02:47 +0000

Docker published a post this week about securing AI agents, and the most interesting part was not really Docker.

The post makes the now-familiar argument that agents need execution isolation, tool access control, identity and credential management, and runtime monitoring. It also says the quiet part clearly: permission prompts are not enough.

That should be obvious.

It is not obvious enough.

Most of the discussion around coding agents still treats the developer machine as a convenient place where the magic happens. The agent runs in your editor. It sees the repository. It can call tools. It can read logs, run tests, install packages, open browsers, hit APIs, and sometimes push code. If it asks nicely before doing the scary thing, we call that security.

This is backwards.

The developer laptop is becoming the first production environment for agents.

local does not mean low risk

We have a bad habit of treating local development as safer than production because it is smaller.

Production has customer data, uptime guarantees, compliance rules, and dashboards. The laptop has a half-finished branch, a terminal, and a human nearby, so we relax.

But a modern developer machine is a wonderful target.

It has source code, cloud credentials, package manager tokens, SSH keys, browser sessions, local databases full of copied production-shaped data, and access to staging systems that are often less locked down than production. It also has build scripts that can run arbitrary code.

Now put an agent on that machine and give it the same tool surface a senior engineer uses.

That is not a chatbot anymore.

That is an automation actor inside one of the most privileged environments in the company.

permission prompts are a user interface, not a boundary

The obvious answer is to ask the human for permission.

Can I run this command?

Can I edit this file?

Can I access the network?

Can I install this dependency?

Prompts are useful. I like them. They slow things down just enough that I sometimes notice the agent is about to do something silly.

But prompts are not a security model.

They depend on the human understanding the full consequence of an action at the moment the prompt appears. That is a lot to ask when the command is npm install, the dependency graph is 800 packages deep, and the agent presents it as the next obvious step.

The prompt says "run tests."

The test runner executes lifecycle scripts.

The lifecycle script reads an environment variable.

The environment variable contains a token that can deploy to staging.

The human clicked yes because running tests is normal.

This is why runtime containment matters. A useful prompt can ask for intent. A real boundary limits blast radius when intent and consequence drift apart.

agents need less trust than developers

There is an uncomfortable point here: agents should usually have less authority than the developer supervising them.

That sounds inefficient. It is also how every other automation system eventually grows up.

We do not give CI the CEO's laptop because it needs to run tests. We do not give a GitHub Action every cloud permission because one deployment step needs one credential. We do not let a build worker freely write to every repository because it might be convenient someday.

We scope automation because automation is fast, literal, and bad at knowing when the surrounding context has changed.

Coding agents are the same category of problem, just with a more charming interface.

An agent editing a README does not need AWS credentials. An agent refactoring a parser does not need the network by default. An agent generating a migration should not be able to apply it to a shared database unless that is an explicit, isolated workflow. An agent investigating a bug may need redacted logs through a narrow tool, not a shell with access to everything the human can reach.

The useful default is not "this is my assistant, so it inherits my machine."

The useful default is "this is an automation process, so it gets the smallest workspace and tool set that can complete the task."

the old sandbox conversation is back

This is the part I find funny in a tired way.

We spent years building containers, microVMs, CI isolation, network policies, service accounts, secret managers, and audit trails. Then AI coding tools arrived and a lot of demos quietly stepped around those lessons because the local magic was too good to interrupt.

Now we are rediscovering the same architecture.

Run the agent in an isolated workspace. Mount only the files it needs. Give it an explicit network policy. Pass credentials through scoped tools instead of ambient environment variables. Separate read-only inspection from write-capable actions. Keep enough evidence that a human reviewer can reconstruct the path from prompt to diff.

None of this is exotic. It is just platform engineering applied to the developer machine.

Docker talking about agent isolation is interesting because Docker already won the mental model for "this process gets a filesystem, a network shape, and a boundary." MicroVMs push the boundary further. Operating system features will keep moving in the same direction. Microsoft is already talking about agent runtime boundaries at the platform layer.

The direction is clear: agent security is moving below the app and editor.

It has to.

the laptop is where policy gets tested first

The enterprise version of this conversation will eventually become very formal.

There will be agent identities, audit APIs, policy engines, approved tool catalogs, model routing rules, and budget controls. There will be dashboards showing which agent touched which repo under which human supervisor.

Before that, there is a laptop.

That is where the first bad patterns appear:

agents reading files they did not need
agents installing packages from the public internet without review
agents using broad human credentials because scoped credentials are annoying
agents producing diffs without enough execution evidence
agents running commands whose side effects nobody inspected
agents mixing trusted project state with untrusted generated artifacts

Those are not hypothetical enterprise governance problems. They happen during ordinary development when the tool is good enough to keep using and not mature enough to fully trust.

The laptop is where the policy either becomes ergonomic or dies.

If the secure path is too slow, developers will route around it. If the isolated environment cannot run the project, they will turn it off. If credential scoping breaks every useful workflow, the broad token comes back. If the audit trail is unreadable, reviewers will ignore it.

Agent security has to be boring, local, and fast enough that people leave it on.

what i would do now

If I were responsible for rolling coding agents into an engineering team, I would start with local containment before writing a grand AI governance document.

First, separate workspaces. The agent should operate in a copy, branch, container, or sandbox where accidental writes are cheap. The human can review and promote the result.

Second, make network access explicit. Many coding tasks do not need the internet. The ones that do should say why. Package installation should be treated as a supply-chain event, not as background noise.

Third, remove ambient credentials. If the agent needs to call a service, give it a scoped tool or short-lived credential for that task. Do not let it inherit every token the developer happened to have in their shell.

Fourth, preserve evidence. Commands, outputs, files changed, tests run, and external calls should be visible in a reviewable trail. The human reviewer should not have to trust the agent's summary of what happened.

Fifth, make the safe path pleasant. This is the boring product work. If the sandbox takes five minutes to start, nobody will use it. If the policy requires twenty approvals for a typo fix, people will disable it. Security that cannot survive daily development is a memo, not a system.

the punchline

Local coding agents feel personal because they sit in the editor and speak in first person.

Operationally, they are automation systems running inside a privileged environment.

That distinction matters.

The developer laptop has source code, credentials, build tools, staging access, and a human who is busy trying to ship. It is not a toy environment just because it is local. For agents, it is the first production environment, and it deserves production-shaped boundaries.

Permission prompts are useful. They are not enough.

The real product is containment, identity, scoped tools, audit trails, and a workflow developers will actually keep turned on.

Welcome back to software engineering. The agent is new. The security problem is not.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.