What If AI Agents Had an Undo Button on Every API?
As LLMs learn to automate programming, what if we rewrote the APIs that AI agents use so that every action is reversible by default?
The basic idea of this post is simple: AI agents interact with the world through APIs – Gmail, Slack, Google Calendar, databases, filesystems. Most of these APIs execute actions immediately and irreversibly. When the agent sends an email, it’s sent. When it deletes a file, it’s gone. There’s no undo, no review step, no buffer.
What if that weren’t the case? What if every API an agent touches were wrapped in a layer that makes actions reversible by default – emails become drafts, file deletes become moves-to-trash, Slack messages go into a review queue? The agent does its work, and a human (or another AI) reviews a batch of proposed actions before they take effect. Anything bad gets caught; everything fine goes through.
This isn’t a new idea in software engineering. Databases have had transactions and rollback for decades. Distributed systems use the Saga pattern for multi-step workflows with compensating actions. Git is built around the idea that everything is reversible until you push. But nobody has systematically applied these patterns to AI agent safety – and there’s an interesting connection to programming language theory (specifically, free monads and algebraic effect handlers) that gives you a principled way to design such a system rather than doing it ad hoc.
Most AI safety work on agentic systems assumes a detection-first posture: watch what the agent does, flag suspicious actions, intervene before harm occurs. This is the core logic of AI control research as developed by Greenblatt et al. (2024). It works by pitting a monitor against an adversarial agent and measuring how often the monitor catches bad behavior. Reversibility is complementary: instead of trying to catch every bad action in real time, you make actions not matter until they’ve been reviewed. You don’t need perfect monitoring if you can undo the damage.
A note on scope: this post discusses both mundane agent failures (sent the wrong email, deleted the wrong file) and catastrophic ones (a misaligned agent attempting to exfiltrate weights, sabotage alignment research, or acquire resources outside its authorized scope). The mundane cases are useful for building intuition, and they’re the ones most people encounter today. But the research direction is ultimately motivated by the harder case – the one where the AI control community’s assumption of adversarial agents starts to bite. I use concrete, low-stakes examples throughout to make the architecture legible, but the point is to build infrastructure that degrades gracefully when the stakes are much higher.
What agent deployment actually looks like right now
Before getting into the theory, it’s worth being concrete about what we’re trying to improve.
Consider an AI agent deployed at a mid-size AI safety organization or an AI lab. Maybe it’s a coding agent that helps with research engineering, or an operations agent that handles scheduling, email triage, and document management. In a typical setup:
The agent has broad permissions. It has access to the organization’s Google Workspace (email, calendar, Drive), Slack, maybe a project management tool like Linear or Notion, and some internal systems. In practice, the agent often inherits the deploying employee’s permissions, because setting up fine-grained per-agent OAuth scopes is tedious and most APIs don’t make it easy. So an agent deployed by a senior engineer might have read/write access to every Google Doc in the organization, every Slack channel they’re in, every calendar, and potentially production infrastructure.
The oversight model is “glance at it occasionally.” The person who deployed the agent checks in on it every so often. Maybe they look at a log of what it’s been doing. Maybe they get a Slack notification when it completes a task. But nobody is reviewing every action in real time. This is especially true for long-running agents that operate over hours or days, like a coding agent working through a large refactor, or an operations agent that monitors a pipeline and takes corrective actions.
Containerization helps with some things but not others. If you’re running a coding agent in a sandbox (like Claude Code in a Docker container, or Codex in a cloud VM), the agent can’t mess with your local filesystem or steal your SSH keys. That’s good. But the agent might still have network access, API tokens, or MCP server connections that let it interact with external services. A containerized agent with a Gmail MCP connection can still send emails. A containerized agent with a Slack token can still post messages. The container limits what it can do to your machine, but not what it can do to the world.
The failure modes are mundane but real. Most of the time, the scary scenario isn’t “the agent exfiltrates secrets to a foreign adversary.” It’s more like: the agent misunderstands an instruction, sends an email to the wrong person, deletes a document it shouldn’t have, posts something embarrassing in a public Slack channel, or schedules a meeting with a candidate and puts confidential salary information in the calendar invite description. These are the kinds of mistakes humans make too, but humans make them one at a time, at human speed, and usually catch themselves partway through. An agent operating at API speed can make a hundred of these mistakes in a minute.
There is no undo. If the agent sends the wrong email, it’s sent. If it deletes a file from a system without version history, it’s gone. If it posts a message in Slack, everyone in the channel saw the notification, even if the message is deleted seconds later. The agent operates in a world where most actions take immediate, irreversible effect, and there is no systematic mechanism for batching, reviewing, or rolling back.
This is the status quo. It’s not catastrophically unsafe for most current use cases, because most current agents are doing relatively low-stakes work and humans are still pretty involved. But it’s a bad foundation to build on as agents get more capable and more autonomous. The question is what to do about it.
The landscape so far
The idea that agents should prefer reversible actions isn’t new to AI safety. DeepMind’s AI Safety Gridworlds (Leike et al., 2017) included an irreversible side-effects environment: a Sokoban-style puzzle where the agent must choose between pushing a box into a corner (irreversible) or to the side (reversible), and should prefer the reversible path even though it’s longer. Krakovna et al. (2019) developed this into a research program on impact penalties, designing reward modifications that penalize agents for taking irreversible actions. The core technical challenge they identify – that naive unreachability penalties “max out” after the first irreversible action, removing further incentive for caution – is subtle and important.
But that work is about training agents to prefer reversible actions. The question I’m interested in is different: can we engineer the environment so that irreversible actions are structurally impossible, or at least require explicit escalation?
Two recent systems take steps in this direction:
STRATUS (Chen et al., 2025; arXiv:2506.02009), from IBM Research and UIUC, is a multi-agent system for autonomous cloud site reliability engineering. Its key contribution is Transactional No-Regression (TNR): the agent operates in transactions, assesses whether each transaction improved the system state, and rolls back if it didn’t. Critically, if the agent proposes a destructive, non-recoverable action (like deleting a database), the system rejects it before execution and forces the agent to find a reversible alternative or escalate to a human. The paper reports that TNR-enabled agents significantly outperform non-transactional baselines on cloud failure mitigation benchmarks. The researchers note that “transactional abstractions are well-studied in database management, but it’s still unclear which can be transferred to agents.”
SagaLLM (Chang & Geng, 2025; arXiv:2503.11951), from Stanford, adapts the Saga pattern from distributed databases to multi-agent LLM planning. In the Saga pattern, each step in a long-running workflow has a defined compensating action that undoes it. If step 3 of 5 fails, the system runs compensating actions for steps 1 through 3 to restore a consistent state. SagaLLM applies this to multi-agent coordination, adding persistent memory and independent validation agents. It relaxes strict ACID guarantees (you don’t get serializability across agents) but provides workflow-level consistency and recovery.
There’s also relevant work from Anthropic’s measurement team. Their study of agent autonomy in practice found that most agent actions on Anthropic’s API are currently low-risk and reversible, with software engineering representing about half of all agentic activity. They observed that experienced users shift from approving individual actions toward monitoring-and-intervening – a natural drift toward the tiered model I’ll describe below.
A concrete proposal: the transactional agent runtime
Here’s what I think should actually get built. I’ll describe it concretely, then discuss the costs.
The basic architecture
You put a middleware layer between the agent and every external service it talks to. The agent never directly calls the Gmail API, the Slack API, the filesystem, or anything else. Instead, it calls the middleware, which captures each action as a proposed effect and decides how to handle it based on the action’s reversibility profile.
Three tiers:
Auto-commit (green tier). Actions that are cheap to reverse and low-risk. Reading a file. Reading an email. Querying a database (read-only). Looking up someone’s calendar availability. Creating a local file in a sandboxed workspace. These go through immediately, because there’s nothing to undo.
Buffer-and-review (yellow tier). Actions that are technically reversible but have some cost or leakage. Sending an email (could be created as a draft instead). Posting a Slack message (could be held in a queue). Creating a calendar event (could be created as tentative/pending). Editing a shared document (could be staged as a suggestion). These get batched into a queue that a human reviewer (or an automated review agent) processes periodically. The key property: the agent can continue working as if the action succeeded – it just hasn’t actually taken effect yet. The agent might draft 10 emails and schedule 3 meetings in the course of a task; all of those sit in a review queue until someone approves the batch.
Escalate (red tier). Actions that are genuinely irreversible or high-stakes. Deleting data from a system without version history. Sending a message to an external contact (outside the organization). Making a purchase. Modifying access permissions. Pushing code to production. Sending an API request to an external service that triggers downstream effects you can’t control. These block until a human explicitly approves them. The agent gets told “this action is pending approval” and can continue with other work, but the action doesn’t execute.
An important point: this doesn’t reduce the agent’s autonomy. The agent is still doing the same amount of thinking, planning, and executing. It drafts 10 emails, schedules 3 meetings, updates 5 records – the same work it would do without the transactional layer. The difference is that its effects live in draft land until a human commits them. This is the same relationship a developer has with a Git branch: full autonomy to write whatever code they want, with a PR review gate before it goes live. Nobody thinks of branches as “less autonomous coding.” They’re how you get the benefits of autonomy without the risk of shipping broken code to production.
What this looks like in practice
Say you have an operations agent that’s helping with recruiting. Over the course of a work session, it does the following:
Reads 15 new emails from candidates (green, auto-committed)
Drafts responses to 8 of them (yellow, goes to the review queue as actual Gmail drafts)
Looks up interviewer availability on Google Calendar (green)
Creates 3 calendar events for interviews (yellow, created as tentative events pending review)
Updates the status of 5 candidates in the ATS (yellow, staged as pending changes)
Sends a Slack message to the hiring manager summarizing today’s pipeline (yellow, held in buffer)
Notices a candidate who withdrew and goes to delete their file from the shared drive (red, blocked pending approval)
The human reviewer gets a notification: “Your agent has 17 actions pending review.” They open a dashboard, see the batched drafts, tentative calendar events, staged ATS updates, and the held Slack message. They can approve all, reject some, or edit individual actions before they go through. The delete action is flagged separately as requiring explicit approval.
Total review time: maybe 3-5 minutes to skim through a batch that represents an hour of agent work. That’s a very different ratio from either (a) reviewing every action in real time, which is prohibitively slow, or (b) the current status quo of not reviewing anything and hoping it works out.
What this looks like for a coding agent
Coding agents are a simpler case because the infrastructure already supports this pattern pretty well. Git is already a transactional system with rollback semantics. A coding agent working in a branch, with a PR review gate before merge, is already operating in the yellow tier by default. The interesting extensions are:
The agent’s network requests could be buffered. If it tries to
curlan external API or download a package from a URL you don’t recognize, that goes to the red tier.The agent’s shell commands could be classified by reversibility.
mkdiris green.rm -rfis red.pip installis yellow (you can uninstall, but there might be side effects).File writes within the sandbox are green. File writes outside the sandbox are red. This is basically what Docker already gives you, but made explicit as a policy rather than a side effect of containerization.
What this looks like for a more autonomous agent
The harder and more important case is an agent that’s operating with broader permissions over a longer time horizon. Think of an agent that’s been given access to your organization’s internal systems and asked to do something like “research the competitive landscape and prepare a briefing document.” Over the course of several hours, this agent might:
Search the web (green, but the search queries themselves could leak intent)
Read internal documents on Google Drive (green)
Read Slack message history for context (green, but privacy-sensitive)
Draft the briefing document (green, it’s a local file)
Email the document to three people (yellow)
Share the Google Doc with a wider distribution list (red – this is modifying access permissions)
Post a summary in a public Slack channel (yellow, but borderline red if the channel includes external guests)
Without the transactional layer, this agent could accidentally share a confidential document with the wrong people, or post internal strategy discussions in a channel that includes contractors or partners. With the transactional layer, the risky actions get caught in the review queue before they execute.
Costs and benefits
I want to be straightforward about the costs here, because “add a review step” is the kind of proposal that sounds free but isn’t.
The costs
Human review time. This is the big one. Someone has to look at the batched actions and approve them. If an agent generates 50 actions per hour and 20 of them are in the yellow tier, that’s 20 actions a human needs to at least glance at. In practice, most of these will be obviously fine – the email draft says what you’d expect it to say, the calendar event is at the right time, the Slack message is appropriate. So the review can be fast, more like skimming than careful reading. But it’s not zero. My rough estimate is that reviewing a batch of 20 routine agent actions takes 2-5 minutes if the review UI is well-designed (showing diffs, highlighting what changed, letting you approve-all with one click). That’s maybe a 5-10% overhead on the human’s time relative to the time the agent saved. Whether that’s worth it depends on the stakes.
For high-stakes deployments (agents with access to sensitive data, agents operating in regulated environments, agents interacting with external parties), the review overhead is clearly worth it. For low-stakes deployments (a coding agent working in a sandboxed branch), it might not be, and you’d want most actions in the green tier.
Agent latency. The agent has to wait for review before its actions take effect. For the yellow tier, this means the agent might draft an email at 2pm but it doesn’t actually get sent until the reviewer approves it at 3pm. For many tasks this is fine – the one-hour delay doesn’t matter. For time-sensitive tasks (responding to an urgent email, handling an incident), it might be unacceptable. You’d need a way to mark certain tasks as time-sensitive and either fast-track the review or temporarily promote the agent to a more permissive tier.
Implementation complexity. Building the middleware layer for each API integration is real work. Each service has different affordances: Gmail has a drafts API, Slack has scheduled messages, Google Calendar has tentative events, but not every service has a natural “staging” equivalent. For services that don’t, you’d need to build your own buffer (hold the action in a local queue and execute it on approval). This is a lot of integration code. I’ll come back to this.
Agent confusion. If the agent thinks it sent an email but the email is actually sitting in a review queue, the agent’s model of the world is wrong. It might send a follow-up that references the first email, which the recipient hasn’t received yet. Managing this requires either (a) telling the agent that actions are pending and letting it reason about that, or (b) letting the agent operate in a “virtual” version of the world where it thinks its actions have taken effect, and reconciling later. Both approaches have failure modes. The first requires the agent to handle a more complex state space. The second can lead to cascading inconsistencies if a reviewed action gets rejected.
The benefits
Reduced blast radius of mistakes. The most common agent failure mode is “did the wrong thing.” Sent the wrong email, edited the wrong document, posted in the wrong channel, used the wrong data. With the review layer, these mistakes get caught before they take effect. The blast radius of an agent mistake shrinks from “whatever damage the action caused in the real world” to “the time it took to review and reject the bad action.”
Reduced blast radius of adversarial behavior. In the control framework, you’re trying to catch an adversarial agent before it does something harmful. The monitor has to be fast and accurate, because once the action executes, the damage is done. With a transactional layer, the monitor gets a second chance: even if real-time monitoring misses something, the review step can catch it. This relaxes the requirements on the monitor, which is a big deal because building reliable real-time monitors for arbitrary agent actions is genuinely hard.
Graceful degradation of trust. Right now, agent permissions are basically binary: either the agent can do a thing or it can’t. The transactional layer gives you a continuum. You can start a new agent in a maximally cautious mode (everything in the yellow or red tier), watch its behavior for a while, and gradually promote actions to the green tier as you build confidence. This is how you’d actually want to deploy agents in a high-stakes environment: start conservative, widen permissions as trust is established, and narrow them again if something goes wrong.
Audit trail for free. Because every action passes through the middleware, you get a complete, structured log of everything the agent did (and tried to do). This is useful for debugging, for compliance, and for training better agents. Currently, getting a good audit trail of agent behavior requires instrumenting the agent itself, which means different agents produce different logs in different formats. The middleware gives you a uniform action log regardless of which agent is behind it.
Composability with monitoring. This is the theoretical point I’ll develop below, but the practical version is simple: if you’re running a monitoring system that flags suspicious actions, it’s much more useful if flagged actions can be held rather than just logged. Currently, a monitor that fires after the action has already executed is basically just an alert – useful for post-hoc analysis, but it doesn’t prevent harm. A monitor integrated with a transactional layer can actually prevent flagged actions from taking effect. The monitor becomes a reviewer rather than just a logger.
The costs of the status quo
It’s also worth noting that the current approach has costs that people don’t always account for. The most obvious one: people restrict agent capabilities below what’s useful because they don’t trust the agent to act without oversight. If you gave an agent read-write access to your email and calendar, it could save you an hour a day on scheduling and correspondence. But most people won’t do that, because the downside of the agent sending a bad email on your behalf is too high. So instead they give the agent read-only access, or no access at all, and the agent is less useful than it could be.
The transactional layer changes this calculus. If you know that every email the agent drafts will sit in a review queue before being sent, you’re much more willing to give it email access. If you know that every document share will require your explicit approval, you’re more willing to let the agent interact with Google Drive. The review overhead is the price you pay for being able to deploy the agent with broader permissions than you otherwise would. In many cases, the net effect is positive: the agent with broad permissions plus review is more useful than the agent with narrow permissions and no review.
What hasn’t been built yet
Despite the existing work, several things are missing:
A reversibility taxonomy for common agent actions
People keep saying “irreversible actions need human approval,” but nobody has built the actual taxonomy. For each of the top 50-100 API actions that AI agents commonly perform, what is the reversibility profile? Some dimensions that matter:
Time window: Can you undo it? For how long? Gmail gives you 30 seconds for “undo send.” Google Calendar events can be restored from trash for 30 days. Git commits are reversible until they’re pushed to a shared remote. Some actions have no reversal window at all.
Information leakage: Even if you reverse the action, does someone see it in the interim? A Slack message can be deleted, but anyone online may have read it. A calendar invite can be cancelled, but the recipient already saw the title and description.
Cascading effects: Does the action trigger downstream effects that are themselves harder to reverse? Sending a webhook, triggering a CI pipeline, causing a push notification.
Reversibility cost: How expensive is the reversal in terms of user trust, system state, or coordination overhead? Withdrawing a job offer is “reversible” in a narrow technical sense, but enormously costly in practice. Deleting and re-creating a Slack message is technically the same content, but people noticed the deletion.
This taxonomy would be immediately useful for anyone building agent scaffolding. It would let you define principled default policies: auto-commit actions in the “fully reversible, no leakage” tier, buffer-and-review actions in the “reversible with leakage” tier, require explicit approval for the “irreversible or high-cost” tier.
A general-purpose transactional middleware for agents
STRATUS works for cloud SRE. SagaLLM works for multi-agent planning. But there is no general “reversibility layer” that an arbitrary AI agent can operate through when it interacts with heterogeneous services – email, calendar, CRM, filesystem, messaging, code hosting. Each service has different rollback affordances, and nobody has built an abstraction that maps these into a uniform transactional interface.
I want to be honest about this: actually building this for real APIs, across dozens of services with wildly different rollback affordances, is a lot of software. Each integration needs its own mapping from “agent intent” to “reversible staging action,” and each one has edge cases. Gmail has a proper drafts API, which makes it relatively easy to intercept “send email” and turn it into “create draft.” Google Calendar supports tentative events. Slack has scheduled messages. But not every service has a natural staging equivalent. For a CRM like Salesforce, there might not be a “tentative record update” concept, so you’d need to build your own shadow state and replay mechanism. For an ATS like Lever or Greenhouse, you’d need to maintain a diff of proposed changes and apply them on commit. Each of these is a meaningful engineering project.
That said, this is a moment when writing a lot of novel software is cheaper than it has ever been. LLM-assisted coding has compressed the timeline for this kind of integration work from years to months. And the ambition level here is much lower than some proposals floating around in adjacent communities – “rewrite all critical infrastructure in formally verified languages,” “rewrite the Linux kernel in Rust for memory safety,” etc. This is not asking anyone to rearchitect email or redesign the filesystem. It’s asking for a shim layer that interposes on existing APIs and translates “execute” into “stage for review.” Tedious, but tractable. And each individual integration is independently useful even if you don’t build all of them.
There’s also a reason to think the hardest part of this is already being done – just for a different purpose. The RL environments industry has exploded over the past year, with AI labs paying hundreds of thousands of dollars for faithful clones of enterprise software. Startups like Habitat, DeepTune, Fleet, Turing, and many others have built interactive replicas of Slack, Gmail, Salesforce, Jira, AWS terminals, and more – complete with realistic state management, API fidelity, and reset-to-checkpoint logic. Salesforce itself built eVerse, a simulation environment for training and testing agents against realistic enterprise workflows. These clones are built for training (run parallel episodes, assign rewards, reset state), not for deployment safety. But the core engineering challenge – faithfully replicating the behavior of complex SaaS APIs with controllable state – is the same. The delta between “Salesforce clone for RL training with reset semantics” and “Salesforce wrapper for safe deployment with staging/commit semantics” might be surprisingly small.
It’s worth noting that some of this infrastructure is already being built by MCP (Model Context Protocol) server developers, who have to define tool schemas for each service anyway. Adding a “staging” mode to an MCP server is a much smaller lift than building the MCP server from scratch. This is an area where a well-designed standard could save a lot of duplicated effort.
Some commercial efforts are already moving in this direction. Rubrik launched Agent Rewind in 2025 (built on their acquisition of Predibase), which provides selective rollback of AI agent actions with an immutable audit trail – you can trace an agent’s action back to its originating prompt and revert specific changes without undoing unrelated work. Composio has built Saga-pattern orchestration into their agent tooling platform, with compensating transactions and circuit breakers across hundreds of tool integrations. These are enterprise reliability products rather than AI safety research, but they’re building exactly the infrastructure that a safety-motivated version of this idea would need. The gap is in connecting this infrastructure to the adversarial threat models that the AI control community cares about.
The connection to programming language theory
This is the part I find most intellectually interesting, and the part that is least explored in the AI safety context. As far as I can tell, nobody has explicitly proposed effect systems or monadic architectures as a safety mechanism for AI agents, despite the fit being almost suspiciously natural.
In functional programming, there is a well-developed tradition of separating the description of side effects from their execution. The canonical version of this is the free monad (or, in more modern formulations, algebraic effects and handlers). The idea: instead of a program directly performing effects (writing to disk, sending a network request), the program produces a data structure describing the effects it wants to perform. A separate handler or interpreter then decides how to actually execute them.
Here is the key structural analogy:
FP concept Agent safety concept Pure computation Agent reasoning / planning Effect description (free monad) Proposed action (buffered, not yet executed) Effect handler / interpreter Review layer (human, automated, or policy-based) Different handlers for same effects Same agent, different deployment contexts (sandbox vs. production)
In Haskell, this separation is enforced by the type system via the IO monad. In languages with algebraic effects (OCaml 5, Koka, research languages like Eff), the separation is even more explicit: you perform an effect, execution suspends, and a handler receives a continuation representing “everything the program was going to do after this effect.” The handler can then execute the effect, modify it, defer it, or discard it entirely.
The Jane Street blog post on algebraic effects in OCaml 5 gives a clean illustration of this pattern in production code. As they put it, effects let you separate the “business logic” (what the agent wants to do) from the “scheduler” (how and whether to actually do it). This is exactly the architecture you want for safe agent execution.
An effect-typed agent runtime would look something like this:
The agent produces a sequence of effect descriptions:
SendEmail(to, subject, body),CreateFile(path, content),PostMessage(channel, text).These are captured as a data structure, not executed.
An interpreter inspects the batch, applies policy (auto-commit reversible actions, buffer irreversible ones, reject disallowed ones), and executes accordingly.
Different interpreters can be swapped in for different contexts: a “dry run” interpreter that logs but doesn’t execute, a “sandbox” interpreter that executes against a staging environment, a “production” interpreter that executes for real with appropriate review gates.
This isn’t hypothetical. It’s how some deployment systems already work. GitOps declares a desired state and a separate reconciliation loop makes reality match. Neon provides database branching, so you can fork a database, let an agent operate on the fork, and merge only if the results look good. Vercel preview deployments let you see the effects of code changes before they go live. The agent safety version would generalize these patterns across arbitrary APIs.
How reversibility composes with adversarial monitoring
The AI control literature – particularly Redwood Research’s work (Greenblatt et al., 2024; see also Greenblatt’s overview of control measures) – takes as its starting assumption that the AI agent may be intentionally trying to cause harm while evading your safety measures. This is the “scheming” threat model: the agent is strategically choosing when to act, optimizing for actions that look innocuous to monitors but accumulate toward some catastrophic outcome. In this framing, the question isn’t whether the agent makes mistakes – it’s whether your safety infrastructure holds up against an adversary operating from the inside.
Reversibility fits into this picture, but it’s not a silver bullet, and I want to be precise about what it buys you.
Greenblatt’s overview already includes “scaffolding that supports resampling/defer/block” as core control infrastructure, and notes that for diffuse threats (like research sabotage), asynchronous review after action execution may be fine because “reverting the problematic action may be basically fine.” So the control community is already thinking in this direction for a subset of threats. What’s less explored is systematically engineering the deployment environment to maximize the fraction of actions that fall into the “revertable” category, and formally analyzing how this interacts with monitoring.
The interesting formal question is: given a control protocol that catches bad actions with probability p, and a deployment environment where fraction r of actions are reversible, what is the overall safety guarantee? Naively you might expect the risk to scale as (1-p)(1-r) – you’re only in trouble if monitoring misses the action and the action is irreversible. But this isn’t quite right, because the cases where monitoring is hardest (subtle, adversarial actions) may correlate with the cases where reversibility helps least (information exfiltration, side-channel leakage). A scheming agent isn’t going to waste its one shot on a reversible action if it can identify which actions are in the red tier.
This is the strongest version of the objection: reversibility might just shift the adversarial agent’s strategy without actually reducing risk. The agent learns that emails are buffered, so it doesn’t try to exfiltrate via email – it finds the irreversible channel you forgot to lock down. The counter-argument is that shrinking the irreversible action surface still helps, because it constrains the adversary’s options and makes monitoring easier on the remaining attack surface. But this needs to be demonstrated, not assumed.
Working out this composition formally, and testing it empirically using existing control evaluation benchmarks, would be a genuine contribution. It would tell you something concrete: is it more cost-effective to invest in better monitoring, or in making more actions reversible, or in some combination?
Where reversibility doesn’t help
It’s worth being honest about the limits.
Information leakage is irreversible. If an agent reads a secret and encodes it into a seemingly innocuous action – a carefully chosen meeting title, a particular file name, a specific wording in a Slack message – you cannot un-leak that information by rolling back the action. The recipient saw the message, even if it’s later deleted. The downstream system processed the webhook, even if the source event is reversed. This is the fundamental boundary: reversibility helps with actions but not with information flow. For threat models centered on exfiltration (e.g., an agent trying to leak proprietary training data), you need access controls and monitoring, not reversibility.
Cascading effects across trust boundaries. You can make actions reversible within systems you control. But agents need to interact with systems you don’t control. If an agent sends an API request to an external service, and that service triggers its own downstream effects, you can’t roll those back. The further the causal chain extends beyond your trust boundary, the less reversibility buys you. This is why “send email to external contact” should almost always be in the red tier: once the email leaves your mail server, it’s out of your hands.
Adversarial exploitation of the review window. A sufficiently sophisticated adversarial agent might learn to exploit the buffer/review architecture itself. The most obvious attack: submit many innocuous actions to build a pattern of being approved, then slip a harmful action into a large batch that a reviewer is likely to rubber-stamp. A more subtle version: craft actions that look harmless individually but are harmful in combination (e.g., three separately-innocuous Slack messages that, together, reveal confidential information). The review process itself becomes an attack surface, and the reviewer (human or automated) needs to be aware of this. This is essentially the “trusted monitoring” problem from the control literature, replicated in the review layer.
The agent confusion problem. As described in the costs section: if the agent thinks it sent an email but the email is sitting in a review queue, the agent is operating with a false model of the world. This can cascade. The agent sends a follow-up email referencing the first one; the reviewer approves both, but the recipient gets them out of order (or gets the follow-up before the original). For simple action sequences this is manageable, but for complex multi-step workflows where later actions depend on the effects of earlier ones, the potential for confusion grows. You’d need the agent to be explicitly aware that its actions are pending, which adds complexity to the agent’s task.
These limits are real, but they don’t make the approach useless – for the same reason that the existence of unsafe blocks doesn’t make Rust’s memory safety useless. The question is whether you can cover enough of the action space to meaningfully change the risk profile. My intuition is yes, for the simple reason that most agent failures today are in the “did the wrong thing by mistake” category, not the “adversarially exfiltrated information via a side channel” category. Reversibility is well-suited to the former, even though it doesn’t help with the latter.
A concrete research agenda
If I were trying to turn this into a research program, here’s what I’d do:
Phase 1: Taxonomy and measurement. Take an existing agent benchmark (AppWorld, WebArena, or the action traces from Anthropic’s autonomy study) and classify every action by its reversibility profile. Produce a dataset mapping (API, action) pairs to (reversibility tier, time window, information leakage, cascading risk). Quantify what fraction of typical agent workflows are fully reversible, partially reversible, and irreversible. This would answer a basic empirical question that nobody has answered: what percentage of the actions agents currently take could be made transactional without any new infrastructure, versus requiring new middleware, versus being genuinely irreversible?
Phase 2: Prototype runtime. Build a proof-of-concept transactional agent runtime that wraps 5-10 common APIs (Gmail, Google Calendar, Slack, filesystem, a SQL database) with buffer/commit/rollback semantics. Demonstrate that real agent tasks can be completed through this layer with acceptable latency overhead. Measure the review burden: how many actions per minute does a human reviewer need to process, and what’s the false-positive rate of an automated classifier that auto-commits “safe” actions? This phase is where most of the engineering effort goes, and it’s worth acknowledging that each API integration is a nontrivial project. But each one is independently useful, and the first few (Gmail, Slack, Calendar) would cover a large fraction of current agent activity.
Phase 3: Composition with control. Take an existing control evaluation setup (e.g., from Greenblatt et al. or METR’s evaluation protocol) and add a reversibility layer. Measure how much the transactional architecture reduces the success rate of a red-team agent that is trying to cause harm. Compare the safety improvement from (a) better monitoring alone, (b) reversibility alone, and (c) the combination. This is the experiment that would tell you whether the theoretical composability actually holds in practice.
Phase 4: Formal analysis. Develop a formal model of safety under composed monitoring-and-reversibility control protocols. Characterize the conditions under which reversibility provides multiplicative risk reduction versus the conditions where it provides no benefit (e.g., pure information exfiltration). Connect this to the existing theory on impact measures from Krakovna et al.
Who should build this?
This is a practical question worth thinking through, because the answer isn’t obvious.
Option 1: A startup. The product shape is clear – transactional MCP servers for the top 10-20 services, a review dashboard, an audit trail. The pitch to enterprise buyers writes itself: “give your agents broader permissions without the risk.” The RL environments industry has proven you can build faithful API replicas and charge real money for them. The timing is good because MCP is becoming the standard and nobody is building safety-first MCP servers yet.
The problem is that this is definitionally a wrapper company. Your entire value sits between the agent and the APIs it uses. If Anthropic adds a staging mode to MCP, or Google adds transactional semantics to Workspace APIs, your layer becomes redundant. The counterargument is that no single platform owner is incentivized to build the cross-platform version – Google won’t build transactional wrappers for Slack and Salesforce – and platform owners are historically slow to build safety features nobody is demanding yet. But “we’ll get acquired or disrupted in 3-5 years” is a real risk.
Rubrik’s Agent Rewind is the closest existing product, but it’s after-the-fact recovery (restore files, databases, configs from backup), not pre-execution staging. It doesn’t cover the communication layer (email, Slack, calendar) and it doesn’t prevent actions from executing. The gap between what Rubrik offers and what you’d actually want for the sabotage threat model is significant.
Option 2: An open-source standard or protocol extension. Something like: “here’s a spec for how MCP servers should expose staging/commit/rollback semantics, and here are reference implementations for the top 5 services.” This is more of a public good than a business. It’s the kind of thing that could come out of a safety org, get funded by a grant, and actually get adopted because it’s not locked behind a vendor. The MCP spec is already open; proposing an extension for transactional semantics would be a natural contribution.
Option 3: A grant-funded research project at a safety org. The research agenda I outlined above – taxonomy, prototype runtime, composition with control evaluations, formal analysis – is a 6-12 month project with clear deliverables. It fits naturally at an organization like METR, ARC Evals, or AISI, or as an independent project funded by Open Philanthropy or SFF. The advantage is that you can do the work without worrying about product-market fit, and the output (open-source tooling, published evaluations) benefits everyone. The disadvantage is that grant-funded projects don’t always produce production-grade software.
Option 4: AI labs build it themselves. Anthropic, OpenAI, and Google all have internal agent deployments that would benefit from this infrastructure. Anthropic in particular has the sabotage threat model front-of-mind (it’s in their RSP). If any lab decides this is important enough to build internally, they’ll have the engineering resources and the motivation to do it right. The risk is that they build it for their own use and don’t open-source it, so the broader ecosystem doesn’t benefit.
My honest read: the first step is probably option 3 (a grant-funded project that builds the taxonomy and a prototype), which then creates the evidence base for either option 2 (an open standard) or option 1 (a startup, if the demand turns out to be there). Option 4 will probably happen eventually regardless, but it’ll happen faster if someone outside the labs demonstrates the idea first.
Why this matters now
The near-term motivation is practical: agents are being deployed with broad permissions and minimal oversight, and the failure mode is embarrassment and lost time. But the medium-term motivation is more serious. As labs begin using AI agents internally for safety-critical work – running evaluations, writing and reviewing code for alignment research, managing infrastructure – the consequences of a bad agent action stop being “sent the wrong email” and start being “subtly corrupted the training run” or “exfiltrated model weights to an unsecured endpoint.” Recent work on sabotage of alignment research makes clear that this isn’t a hypothetical concern; it’s a concrete threat model that labs are already planning around.
The infrastructure that agents operate on was not designed with any of this in mind. The primary safety mechanism is still “prompt the model to be careful and hope for the best,” supplemented by increasingly sophisticated monitoring. But monitoring is a detection mechanism, and detection alone means the damage is done by the time you notice it.
The memory-safety analogy is useful here. For decades, the practical answer to memory safety was “just write careful C.” Then Rust showed that you could get safety by construction – that the compiler could enforce properties that previously relied on programmer discipline. The equivalent move for agent safety would be: instead of relying on the agent to behave well or the monitor to catch it misbehaving, engineer the execution environment so that the worst failures are structurally prevented.
We’re not going to get all the way to “everything is reversible.” Email is never going to support transactional rollback at the protocol level. But we can get a lot further than where we are now, which is “nothing is reversible and we just hope the agent doesn’t mess up.” The tools to build this – transactional execution, effect systems, capability-based security, deferred/buffered I/O – already exist in the programming languages and distributed systems literature. They just haven’t been assembled for this purpose yet.
References and further reading:
Chen et al. (2025). “STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds.” arXiv:2506.02009. NeurIPS 2025.
Chang & Geng (2025). “SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning.” arXiv:2503.11951. VLDB 2025.
Leike et al. (2017). “AI Safety Gridworlds.” arXiv:1711.09883. DeepMind.
Krakovna et al. (2019). “Penalizing side effects using stepwise relative reachability.” arXiv:1806.01186. DeepMind.
Greenblatt et al. (2024). “AI Control: Improving Safety Despite Intentional Subversion.” arXiv:2312.06942.
Anthropic (2025). “Measuring AI Agent Autonomy in Practice.” anthropic.com/research/measuring-agent-autonomy.
Plotkin & Pretnar (2009). “Handlers of Algebraic Effects.” ESOP 2009. (The foundational paper on algebraic effect handlers.)
CSET (2025). “AI Control: How to Make Use of Misbehaving AI Agents.” cset.georgetown.edu.
Timothy Lee (2025). “Keeping AI Agents Under Control Doesn’t Seem Very Hard.” understandingai.org.

