Reliable Systems on Unreliable Data

Some systems get to assume their inputs are basically correct and their dependencies basically up. This writeup is about the other kind — where the data is wrong often enough that you have to design around it, the upstreams have no real API and fall over on their own schedule, and the whole thing has to keep working with nobody watching.

The problem

The job: take a stream of records, enrich each one by talking to a few external systems, and produce a correct result for every record — unattended, indefinitely.

The catch is everything around it:

The source-of-truth system exposes no usable API. You're working through an interface that was built for humans, not programs.
A downstream service returns a zoo of error codes, some retryable, some not, and some that look retryable but aren't.
The data itself is dirty often enough that "assume it's valid" guarantees a steady trickle of silent wrong answers.

"Handle the happy path, wrap it in a retry, page someone on failure" is not a design here. It's a way to generate pages forever.

The solution

The way through was to stop thinking about the happy path at all and start from what must always be true.

Every record ends in a known state

The load-bearing invariant: when the dust settles, every record is either done, in a typed error, or a documented skip — never an ambiguous nothing. That single rule forces you to enumerate the failure modes up front, because "I didn't think about that" is no longer an allowed resting state. Most of the design is just deciding, for each way a step can go wrong, which of those three buckets it lands in.

Classify failures instead of retrying them

A blanket retry treats a transient timeout and a permanently malformed record the same way, and punishes you for it. Instead each failure is classified: retryable (back off and try again, with a cap), droppable (record a typed error and move on — retrying can't help), or stop-the-line (something is wrong enough that continuing would do damage, so halt loudly). The taxonomy is the design; the code is just a switch statement over it.

Fail loudly, never silently fall back

The most dangerous code I removed from this system was a convenience: a missing identifier got a default so work would "just run." That's the textbook quiet assumption — it turns a loud, catchable error into a silent wrong write. Now a missing scope fails the unit of work on the spot. A stopped step is visible; a quietly-wrong one is an incident you find out about weeks later.

Self-heal the recoverable, ask for help on the rest

For the failures that can be repaired automatically — a record that didn't match because of a formatting quirk, a value that can be recovered from a secondary source — the system retries itself and writes the fix back, so the next occurrence is free. For the genuinely ambiguous ones, it surfaces a single, human-readable alert rather than guessing. The bar is simple: nothing unknown proceeds silently.

The result

None of this makes the inputs reliable. It makes the system reliable on top of them. No matter how many scheduled runs fail, how often an upstream is down, or how much bad data shows up, the work converges on a correct, known state without anyone intervening.

It isn't a perfect system — it's a known one. The complexity is hidden, the errors are legible, and the workflow keeps moving on its own. That, more than any uptime number, is what I mean by reliable.