Crash Containment: Treat Stale IPC State as Toxic

When a worker process dies, restart is not enough if the host still holds artifacts from the dead lifetime: ring buffers whose writer is gone, cached descriptors that still look valid, host-side readers, in-flight RPCs whose reply will never arrive. None of these fail loudly. They produce data that is structurally valid but no longer real. Once a crashed worker has shared memory, cached descriptors, in-flight RPCs, or background readers on the host side, recovery stops being "start a new process." It becomes a statement about which pieces of state are still trustworthy and which must be treated as dead along with the worker.

A crash boundary is easier to reason about when it is drawn explicitly:

Before crash: one coherent lifetime

[worker generation A] --> [ring generation A] --> [host reader generation A] --> [UI generation A]

After crash, before cleanup: same pointers, broken contract

[dead worker generation A] x
                           [ring generation A] --> [host reader generation A] --> [UI]

The memory may still be mapped and the handles may still be non-null, but the two-sided protocol that made them trustworthy is gone.

Why "just restart" is not enough

If a worker runs in strict isolation, restart is often enough. The OS reclaims its address space, the next instance starts from zero, and the damage stays local.

This article is about the harder case: a worker that shares state with its host because low latency matters. Think shared-memory rings with writer and reader indices updated from different processes, cached channel descriptors on the host, a UI thread reading from those descriptors, and background log drains or event readers that stay alive independently of the worker.

When that worker dies, every one of those artifacts becomes suspect at the same instant.

Consider a concrete example. A market-data worker publishes ticks into a shared-memory ring. The host has already opened that ring, mapped it, and handed a Reader object to a plotting thread. The worker crashes after incrementing the writer sequence but before finishing the payload, or after finishing the payload but before the host learns the writer is gone. The ring still exists in host memory. The descriptor still has a valid pointer. The reader can still call next(). Every shallow check says "this looks usable."

It is not usable. The protocol that made those bytes meaningful was two-sided, and one side has disappeared.

That is why stale IPC state is dangerous. It keeps enough of its shape to pass lightweight validation. The host is not reading obvious garbage; it is reading something that still looks live after it stopped being part of a coherent system.

If the restart path reuses any of it, the new worker inherits the dead worker's leftovers. A chart may show a few old samples followed by new ones and quietly look "mostly fine." A log drain may continue reading from the old ring while a replacement worker publishes to a fresh one, and the UI merges the streams as if they were one lifetime. Some runs produce visibly wrong numbers. Some produce a crash later in unrelated code. Almost none point cleanly back to the original worker death.

The right model is stricter: once the worker dies, every shared artifact it touched is stale by default. The supervisor does not get to reuse state unless it can prove that state is outside the crashed lifetime.

What is stale and what is not

After a crash, the useful question is not "what object still exists?" It is "what was coupled to the dead process?"

Treat these as stale immediately:

Ring buffers written by the dead worker. The writer side of the protocol died with the process.
Host-side readers and descriptors for those rings. They still point at the old lifetime, even if the pointers look valid.
Cached shared-memory pointers. Their meaning depended on a peer that no longer exists.
In-flight RPCs to the worker. The reply is never coming, so waiting on them can wedge recovery.
Duplicate crash notifications for the same lifetime. After teardown starts, they add noise, not new information.
Startup liveness flags. "Was alive a moment ago" is not useful after a crash.

Still reusable across the crash boundary:

Static configuration. It describes how to build a lifetime, not how to continue one.
Host-owned identity. Session names and restart policy belong to the supervisor.
Supervisor lifetime-management state. It exists to manage generations rather than participate in one.
The generation-assignment mechanism. It is the tool that separates old and new artifacts.

This split is not encoded in the types. A shared_ptr<Reader> looks perfectly healthy even when its writer died 200 milliseconds ago. A descriptor with non-null pointers is still stale if those pointers name the previous lifetime. The code needs an explicit contract because the types will not infer one for you.

The mental shift

The design becomes much clearer once a crash is treated as a generation transition instead of a process-management event.

Every shared artifact belongs to a generation. If a worker created ticks:ring:generation-a, then every descriptor, reader, callback, and UI reference to that ring belongs to generation A as well. When the worker crashes, generation A is over. The next worker must publish to generation B, not "the same ring again."

That sounds cosmetic until you look at what it buys you:

Old and new lifetimes cannot silently collide.
Readers can cheaply reject stale state by comparing generations.
Cleanup has a crisp goal: retire one generation completely before creating the next.

generation A lifetime

start --> publish --> observe --> crash
                               |
                               v
                    teardown readers/descriptors/RPCs
                               |
                               v
                      advance current generation
                               |
                               v
                           start generation B

Suppose a plotting thread wakes up after the crash still holding a reader for generation A. Without generations, it may keep reading until a deeper check fails, or worse, until it produces a plausible but wrong plot. With generations, the thread can see "I am holding generation A, the supervisor is now on generation B" and drop the reader immediately. That is the entire point: stale state should fail fast at the boundary, not linger until it causes an unrelated symptom.

Once you think in generations, "restart the worker" is no longer the core action. The core action is "end the old generation cleanly, then decide whether to create a new one."

Cleanup as an explicit contract

The practical consequence is that cleanup cannot be an accidental side effect of object destruction. It has to be a named, ordered sequence.

One reasonable sequence looks like this:

Mark the crashed lifetime as no longer current and suppress duplicate crash handling for that same generation.
Detach observers and fail in-flight RPCs promptly so no thread keeps waiting for a dead peer.
Drop host-side readers, descriptors, shared-memory mappings, and any queued work items that still reference the dead generation.
Advance the generation marker so stale consumers can no longer treat their handles as current.
Only then decide whether to start a replacement worker.
If restarting, rebuild through the same startup path used for a cold start so recovery does not grow its own special-case stack.

The order matters more than the individual API names.

Worker generation A     Supervisor              Host readers / UI
-------------------     ----------              -----------------
crash -------------->   detect crash
                        mark generation A not current
                        fail dead RPCs -------> waiting calls resolve as failure
                        drop descriptors -----> stale readers detach
                        advance to generation B -> old handles no longer compare current
                        optional restart ------> new readers attach to generation B only

Imagine restarting too early. The supervisor spawns a replacement worker before the old log-drain thread has released its descriptor. The new worker begins publishing to a fresh ring, but the old thread is still forwarding the tail end of generation A into the UI. The system now has two live-looking sources feeding one consumer. That is exactly the overlap crash containment is supposed to prevent.

Generations also make cleanup sequencing easier to reason about. If the generation advances only after readers and descriptors have been torn down, then "current generation" really means "the only lifetime that is allowed to publish or be observed as live." That is a clean boundary.

One thing deliberately does not survive the boundary: historical data the dead worker had already consumed internally. A restarted worker does not get to pretend it can continue from the exact point of failure. The host usually does not know what internal state the worker had derived, and the worker's in-memory state vanished with the crash. Starting from live data and admitting the gap is more honest than fabricating continuity.

Where this becomes expensive

This design is worth the ceremony, but it is not free.

Cleanup latency is real. Releasing readers, tearing down mappings, draining callbacks, and rebuilding IPC artifacts costs time. On a worker with many output channels, operators will notice a pause before the replacement is live. That pause is not accidental overhead; it is the time spent re-establishing a coherent boundary.

Dead RPCs must fail quickly. After a crash, an outstanding call is not "slow." It is impossible to complete. If the IPC layer cannot turn that into an immediate failure, recovery can wedge behind a reply that will never arrive.

Containment is not replay. Anything still in the dead worker's private memory is gone. If your correctness model requires exact replay of the worker's last in-flight state, you need durable state and a different architecture. Crash containment gives you a clean restart, not time travel.

The supervisor becomes more explicit. A cleanup-first restart path has more moving parts than "worker died, start another one." That extra code is the cost of forcing lifecycle boundaries into the open instead of hiding them behind optimistic reuse.

Crash loops still exist. If the worker contains a deterministic startup bug, every fresh generation may die the same way. Backoff, cooldowns, and restart limits are still sensible. They just solve a different problem. Policy only helps after the state model is coherent.

The mental model is short: when a worker dies, the generation dies with it. Everything tied to that generation is stale until proven otherwise, and in practice almost all of it should be dropped.

That is why the contract is not "restart the process." The contract is "retire the old generation, invalidate every shared artifact that came with it, and only then decide what comes next."

Systems that get this wrong rarely fail loudly. They fail by producing believable but incorrect behavior: mixed generations, ghost readers, stuck RPC waits, and UI views built from stale data that still looks structurally valid. Systems that get it right usually pause after a crash, clean up aggressively, and restart from scratch. That pause is the cost of returning to a state the host can trust.