Event sourcing for AI agents
Most teams shipping LLM agents to production hit the same wall around month two: the agent does something nobody can explain, the user files a complaint, and the only artefact anyone has is a tail of structured logs that says what the agent did but nothing about why it did it.
A widely-shared post-mortem of an agentic incident put the failure mode plainly: an agent deleted an entire production database, and the system it ran on was never built to remember enough to explain why. The scary word in that sentence is "remember", not "permission" or "guardrail". The system had no event log it could reconstruct decisions from, so the post-mortem had nothing to search through.
The day-to-day version of the same problem is the question every engineer running agents has had at 2am: why did the agent decide to call edit_file instead of read_file at step 23 of 200? What context informed that decision? Where in this two-minute, 200-step trajectory did the agent go off track?
You can't answer those questions from a flat log file. You need the trajectory itself, replayable, with each decision attributed to the inputs that produced it. That's the shape of an event-sourced system, and it's what Nagare is.
What changes when the agent's history is a journal
A journal lets you reconstruct the agent's belief state at any past moment. Martin Fowler describes this as the core move of event sourcing: "you can stop, rewind, and replay just like you can when executing tests in a debugger." For an agent, that means picking the decision a user complained about, rewinding to the moment before, and looking at what the agent thought it knew. Nagare gives you per-stream replay-to-any-version through the same Aggregate.Load path the framework uses internally, so the "what was the state at version N" question is one method call, not a query you have to invent.
It also lets you change the rule and replay against history. Fowler calls this the Retroactive Event pattern: "recreate historic states by replaying… explore alternative histories by injecting hypothetical events when replaying." In agent terms: your eval rubric was wrong, you've fixed it, and you'd like to know which past runs would now fail the new rule. Without an event log you can't answer that. With one, it's a search through the journal.
Finally, an event-sourced system gives you cause-and-effect across async boundaries. The W3C traceparent header is stamped onto every event Nagare appends, automatically, when an ActivitySource is in scope. Each event carries the trace context of the activity that wrote it, so consumers can stitch back to producers even when work crossed a redelivered outbox dispatch hours later on a different machine. Most logging stacks lose this the moment work crosses an async boundary; an event-sourced log is the only place that information has to live.
Why this is sharper than "just turn on tracing"
OpenTelemetry tracing solves the in-flight question: where did request X spend its time, what did it call, did anything fail. It doesn't solve the retrospective question: of the 50,000 decisions the agent made last week, which ones now look bad in light of a rule we wrote yesterday?
A widely-read engineering writeup on multi-agent systems makes the same point in different words: "Agents make dynamic decisions and are non-deterministic between runs, even with identical prompts. This makes debugging harder." The team behind it found that "adding full production tracing let us diagnose why agents failed and fix issues systematically." The tracing they describe captures decision patterns and interaction structures, not just OTel spans. That's an event log in everything but name.
Another industry analysis is sharper still: "Logs, metrics, and traces show you what happened, but they cannot reconstruct why it happened or the exact sequence of decisions… Deterministic replay requires something closer to event sourcing for agents."
Tracing answers "is the system healthy now?". Event sourcing answers "given everything that ever happened, what should I conclude?". You want both.
What Nagare specifically gives you
These are the framework features that map directly onto agent-system needs.
Append-only event journal per aggregate. Every decision the agent makes can be modelled as an event on its run's stream. The store is per-backend (Postgres, MySQL, SQL Server, SQLite) but the wire format is identical. Events are immutable once written.
Event explainers. Each event type has an
IEventExplainer<T>that renders human prose for it ("book handed over", "tool call returned 404, retried"). Code that consumes events can ask the explainer for a title, a summary, and a tag bag. That's how you turn a JSON log into something a non-engineer can audit.Replay to any past version.
IAggregateRepository.LoadAtVersion(streamId, version)rebuilds the state as it was after the Nth event. For agent runs this means "show me what the agent thought at step 17" without writing a custom query.Invariant search across the journal. Implement
IBisectInvariant, register it, and Nagare can binary-search the journal for the first event where the rule starts failing. This is the "where did it go off track?" question turned into a method call. It works for any property you can express as "live state must satisfy X".Causal lineage, not just timestamps. Every event carries
correlationId,causationId, andtraceparentin metadata. The first ties a chain of events to one logical request; the second points at the immediate parent event or command; the third links to the OTel trace span that wrote it. Together they let you walk the cascade in either direction.Outbox with idempotent receivers. Agents producing tool calls need at-least-once delivery (you can't drop a side effect) and the receiver needs at-most-once semantics (you can't ring the doorbell twice). Nagare ships deterministic dispatch IDs and an LRU dedupe ring on the receiving aggregate, so producer and consumer together give you exactly-once outcomes without distributed transactions.
Process managers with replay-safe dedupe. A long-running agent run is a process manager. It watches events, issues commands, may live for hours. Replay-safe dedupe means restarting the process doesn't double-issue commands; the dispatch tracker survives snapshot reload.
Lease registry across nodes. When you scale to multiple instances, exactly one of them owns the outbox runner and the catch-up subscription for any given agent type. The lease registry surfaces which one, which is the difference between "is the system stuck?" and "node B is doing it, node A is idle as designed".
What you do with these on a Tuesday
Concrete workflows the framework enables once you've modelled an agent run as a stream:
Why did the agent answer wrong? Load the run at the version of the bad decision. Look at the state. Ask the explainer to describe the immediately-preceding event. If the answer is in the inputs, you're done. If not, the explainer surfaces what was there, which is usually what's missing.
The eval rubric changed; re-grade history. Add a new invariant. Run it across the last week's range with the bisect API. The first failing event tells you when the new rule was first violated. From there, project forward: how many runs have failed it since? Reset the relevant projection's checkpoint, replay, count.
A user filed a complaint about a decision two months ago. Walk the causation chain backwards from the agent's response event. Each parent link points at the event that triggered it, until you hit the user's original input. If those inputs were correct and the agent still got it wrong, you've isolated the model failure from the data failure. That distinction matters for vendor escalations and SLA conversations.
A tool started returning bad data and we want to know what's poisoned. Define an invariant that checks whether the tool's response was within expected bounds. The first failing event is the moment the upstream tool changed behaviour. Every event after that is suspect.
Compliance audit asks "show me every action this agent took for user X between dates Y and Z, in order, with justification." Filter the journal by
correlationId(set to the user's session) over the date range, render with explainers. That's a one-query answer to what's normally a multi-day discovery exercise.
The compliance angle, with the caveats
The EU AI Act puts hard requirements on this for high-risk systems. Article 12 requires automatic logging "of events over the lifetime of the system" with traceability sufficient to reconstruct decisions. Article 19 requires high-risk providers to keep these logs for at least six months. Penalties for non-compliance are real.
The honest framing: an event-sourced system is one way to satisfy this, not the only way. Nothing in the Act mandates event sourcing. Industry analyses have argued that plain application logs are insufficient because they're "silently alterable", so you need either cryptographic chaining, append-only storage, or an immutable event store. Nagare gives you the third, but you have to choose to keep the journal rather than auto-purge it.
Same applies to SOC 2 and HIPAA where they touch decisions about individuals. Event sourcing doesn't grant compliance, it makes the evidence easier to produce.
When event sourcing is the wrong tool
This is where most ES advocacy falls down. There are real cases where you should not use it.
A canonical architecture reference puts it directly: "The complexity that event sourcing adds to a system is not justified for most systems." And from a decade of practical writing on the topic: "a well-done CRUD is much better than a poorly done event sourcing."
For agents specifically:
- A stateless single-prompt classifier doesn't need it. If the agent takes one input, returns one classification, and never depends on prior runs, an HTTP log is fine.
- A CRUD admin tool that happens to wrap an LLM doesn't need it. The LLM is plumbing. The system of record is the database.
- A team with no ES experience, on a deadline, with a domain they don't yet understand, will probably hurt themselves with ES. The schema-evolution story is real ongoing work and the failure mode is brittleness, not flexibility.
Two more honest points specific to AI:
Replay does not make LLMs deterministic. Even at temperature=0, most providers don't guarantee bit-identical outputs across runs. To replay agent decisions deterministically, you need to record the LLM responses themselves and substitute them on replay. The journal captures inputs and decisions; it does not, on its own, make the model reproducible.
The journal only knows what you put in it. A tool that read three database rows and returned a summary will only put the summary into the event. If you need to replay against the original rows, the tool has to record them too, or the replay is "best effort". This is the boundary between an event log and a full simulation.
Where Nagare fits next to durable workflow runtimes
Durable workflow runtimes have been making the case for durable execution: a workflow engine that keeps your agent's progress across crashes, supports human-in-the-loop pauses, and replays steps deterministically against checkpoints. They're not wrong. For long-running agentic workflows, you want both kinds of tool.
A workflow runtime keeps one agent run executing reliably across infrastructure failures. An event-sourced system keeps the cross-cutting record of what every agent ever did, queryable after the fact, replayable against new rules. As one comparison put it: a workflow engine helps you model the agent's reasoning and tool flow and keeps execution durable, but you usually need both once real side effects enter the workflow.
You usually need event sourcing too, sitting underneath both, as the system of record.
What this looks like in code
Below is a sketch of how an agent-tool-call system might use Nagare. This is illustrative, not prescriptive. The framework doesn't ship anything called AgentRun; you'd model it yourself.
// One stream per agent run.
public record AgentRunEvent
{
public record Started(string Prompt, string Model, Dictionary<string,string> Context) : AgentRunEvent;
public record Reasoned(string Thought, string ToolName, string ArgsJson) : AgentRunEvent;
public record ToolReturned(string ToolName, string ResponseJson, int LatencyMs) : AgentRunEvent;
public record ToolFailed(string ToolName, string ErrorCode, string Message) : AgentRunEvent;
public record Answered(string Response, int TokensUsed, decimal CostUsd) : AgentRunEvent;
public record Aborted(string Reason) : AgentRunEvent;
}
// Each event ships with an explainer so consumers can render it as English.
public sealed class AgentRunExplainer : IEventExplainer<AgentRunEvent>
{
public Explanation Explain(AgentRunEvent e) => e switch
{
AgentRunEvent.Started s => new("Run started", $"User asked: {Truncate(s.Prompt)}", Tags("model", s.Model)),
AgentRunEvent.Reasoned r => new($"Decided to call {r.ToolName}", r.Thought, Tags("tool", r.ToolName)),
AgentRunEvent.ToolReturned t => new($"{t.ToolName} returned", $"in {t.LatencyMs} ms", Tags("outcome", "success")),
AgentRunEvent.ToolFailed f => new($"{f.ToolName} failed", f.Message, Tags("outcome", "failure", "error", f.ErrorCode)),
AgentRunEvent.Answered a => new("Answered", $"{a.TokensUsed} tokens, ${a.CostUsd:0.0000}", Tags("outcome", "answered")),
AgentRunEvent.Aborted ab => new("Aborted", ab.Reason, Tags("outcome", "aborted")),
_ => new("Event", e.GetType().Name, [])
};
}
// An invariant the framework can binary-search across.
public sealed class NoToolFailureLoopInvariant : IBisectInvariant
{
public string Name => "no-tool-failure-loop";
public string Description =>
"An agent run must not fail the same tool twice in a row without changing its arguments.";
public ValueTask<InvariantResult> Evaluate(IInvariantContext ctx) =>
// implementation reads from the agent-run projection
// and returns Holds / Fails(message) for the live state.
...;
}With that in place, an engineer asked "why did this run cost $4 and answer wrong?" can:
- Load the run state at the version of each step in turn.
- Read each step in plain English from the explainer.
- Walk the causation chain to see which earlier event triggered the bad tool call.
- Run the
no-tool-failure-loopinvariant across the journal to see if this run hit a known bug or a new failure mode.
That sequence (load, read, walk, search) is what's missing from log-based observability. It's the difference between hoping you can answer the question and being able to.
Honest summary
Event sourcing isn't a magic ingredient for AI systems. It's a particular shape of persistence that happens to match the kinds of questions agents force you to answer: what did it know, why did it choose that, which run first showed this failure mode, show me everything for user X. Tracing answers a different question; durable workflow engines answer yet another.
If your agent system has downstream consequences you might need to undo, eval rubrics that change, or compliance obligations that ask "prove what happened", an event-sourced system of record will save you a lot of forensic work later. Nagare's specific contribution is the framework underneath: event explainers, replay-to-version, invariant search, causal metadata, idempotent dispatch. Those primitives are what turn a journal from a write-only log into something you can navigate and reason from.
If your agent is a one-shot classifier, none of that matters. Use Postgres.
Sources
- How we built our multi-agent research system
- Demystifying evals for AI agents
- Agent observability powers agent evaluation
- Testing agent skills systematically with evals
- Missing primitives for trustworthy AI: deterministic replay
- AgentRR: get experience from practice — LLM agents with record & replay
- AI agent explainability: why your infrastructure needs to remember
- Event sourcing: the backbone of agentic AI
- Martin Fowler: Event Sourcing, Retroactive Event
- Event sourcing pattern (Azure Architecture Center)
- When not to use event sourcing?
- Event sourcing is hard
- Temporal for AI
- LangGraph vs Temporal for AI agents: durable execution architecture
- EU AI Act: Article 12 — record-keeping, Article 19 — automatically generated logs