About

An engineer who hunts assumptions

I’ve spent the last 12+ years building systems across media, analytics, healthcare, cloud infrastructure, and distributed platforms.

I’ve worked on real-time analytics engines, video processing systems, Kafka pipelines, multi-tenant SaaS platforms, healthcare automation products, observability stacks, and the infrastructure that keeps them running.

The technologies changed repeatedly. The problems rarely did. The question I keep coming back to is —

What must always be true?

Most production failures aren’t caused by complicated bugs. They’re caused by assumptions — something everyone believed was true turns out not to be.

A message was assumed to be delivered once.
A queue was assumed to be ordered.
A service was assumed to be available.
A user was assumed to belong to the right tenant.
A piece of state was assumed to exist.

Eventually reality asks the question anyway. I prefer asking it first.

I call these assumptions invariants. They’re the system’s safety net.

A request should never read another tenant’s data.
A message should never be acknowledged before its work is complete.
Every workflow should end in a known state.
An audit trail should exist regardless of which service performed the action.

Once the invariants are identified, architecture becomes less about opinion and more about enforcement.

The implementation evolves. The invariants tend to survive.

Most of my career has been spent finding the next bottleneck

For several years I worked on systems that were growing continuously. Every order of magnitude exposed a different problem.

At small scale, counting views is trivial. At larger scale, every path that touches a view count becomes part of the problem: aggregation, storage, caching, replication, hot keys, recovery, and operational visibility.

The same pattern repeated everywhere.

Analytics stopped being a reporting problem and became a streaming problem.
Video processing stopped being a media problem and became a scheduling problem.
Infrastructure stopped being provisioning and became systems design.
Kafka stopped being a queue and became a state-management problem.

Every time the system grew, some design that previously worked stopped working.

Those experiences taught me to be skeptical of permanent solutions. Most architectures are temporary. The goal isn’t to find a design that lasts forever — it’s to understand which assumptions will eventually break, and make change affordable when they do.

Reliability and scale are the same discipline

One lesson that surprised me is how similar reliability and scalability really are.

Both require clear ownership.
Both require explicit state transitions.
Both require well-defined boundaries.
Both punish hidden assumptions.

The systems that scaled well were usually easier to reason about. The systems that were reliable were usually easier to scale.

Many of the lessons that kept systems alive under load turned out to be the same lessons that kept them correct during failures.

State is harder than technology

I’ve worked with PostgreSQL, Redis, MongoDB, Cassandra, Elasticsearch, Kafka, RabbitMQ, Kubernetes, Terraform, and a long list of other technologies. One of my favorite observations from those years is:

Every database looks good when you’re not using it.

Every tool has strengths. Every tool eventually exposes weaknesses. Most engineering discussions happen around technologies; most production problems happen around state.

Who owns it?
Who can modify it?
How does it move?
How does it recover?
What happens when two systems disagree?

Those questions usually matter far more than framework choices.

I generally prefer explicit ownership over hidden magic, recovery over coordination, and making incorrect states structurally impossible whenever practical.

Builder first

I’ve always enjoyed building things from scratch.

Many of the systems I’ve worked on started as a vague requirement and gradually became platforms: infrastructure, deployment pipelines, observability, access control, integrations, automation, operational tooling, and the product itself evolving together.

The part I enjoy most isn’t writing code. It’s taking something ambiguous and gradually making it understandable.

Defining boundaries.
Removing assumptions.
Finding the simplest model that still survives reality.

Fail loudly. Recover predictably.

I assume every component will fail eventually.

A broker will disappear.
A deployment will go wrong.
A dependency will time out.
A message will be delivered twice.
A network partition will happen.

The goal isn’t to prevent every failure. The goal is to know what happens next.

I generally prefer systems that fail loudly and predictably over systems that continue operating quietly in a broken state.

Not perfect systems. Understandable systems — where recovery is part of the design instead of an afterthought.

Selected work

Multi-tenant isolation the database enforces.
Designed and shipped Postgres row-level security in production, so the database refuses another tenant’s rows even when a query forgets to filter — layered under application authorization as defense in depth.
A self-managed Kafka cluster, and a migration without losing it.
Built and operate a 3-broker KRaft cluster (no ZooKeeper) as code, then moved it onto encrypted disks via a blue-green partition reassignment with ~3 minutes of downtime.
An entire production cloud footprint as code, 0→1.
Stood up VPC, multi-AZ Postgres, self-managed Kafka, search, CDN, and VPN as Terraform for a HIPAA-regulated startup, with plan-gated CI/CD and a zero-long-lived-credential identity model.
A real-time analytics engine at ~5k events/sec.
A small DSL compiled analytical queries into a single runtime-reconfigurable consumer that aggregated, reduced resolution, and materialized results — no new service per metric.
Serverless video processing at ~50,000 videos/day.
Recognized encoding as bursty and parallelizable, and built it on serverless FFMPEG for low-cost elastic throughput at peak.
Accuracy engineering on noisy ground truth.
Drove insurance-eligibility extraction from a ~83% baseline toward a 95% target by drilling one payer/parameter bucket at a time against real ground truth; a Medicare secondary classifier at 94.7% over 456 cases.
Integrating a system that has no API.
Drove a cloud browser as the access layer for an EHR with no usable API, reverse-engineering its session model from a 2,675-session, 90-day log analysis.
Applied LLMs, behind deterministic gates.
Document and card extraction, and structured parsing of free-text notes into typed data — with correctness-bearing decisions kept deterministic and every model verdict checkable.

What I work with

Distributed systems: Kafka (KRaft, consumer groups, delivery semantics) · event-driven pipelines · idempotency & commutativity · fail-stop design · circuit breakers
Data & storage: PostgreSQL (RLS, advisory locks, JSONB, WAL/replication internals) · Redis · Cassandra · Elasticsearch / OpenSearch · MongoDB
Cloud & infrastructure: AWS (VPC, RDS, ASG, Lambda, CloudFront, Route53, KMS, EventBridge…) · Terraform · Packer / AMIs · Kubernetes · OIDC CI/CD
Observability: Prometheus · Grafana · Loki / Promtail · alerting design · incident root-cause analysis
Security & compliance: HIPAA & SOC2 controls · row-level security · RBAC · threat modeling · secrets management · DNSSEC
Applied LLM: structured extraction · vision / OCR pipelines · evaluation harnesses · determinism where correctness matters
Languages: TypeScript / Node.js · Go · Python · SQL

Experience

Founding Engineer · Kairos Health (formerly Shunya Health)

2025 – present

Single-handedly own the full stack of a HIPAA-regulated healthcare-automation platform: distributed pipelines, database-enforced multi-tenancy, the entire cloud footprint as Terraform (~20 AWS services), observability, security & compliance, and applied-LLM features.

Principal Architect · Addverb Technology

2024

Improved the Kubernetes platform, streamlined deployments across multiple clients, and added multi-tenancy for better resource utilization.

Senior Director of Technology · Rizzle

2019 – 2024

Owned the complete distributed backend of a large-scale consumer video product through several orders of magnitude of growth — a ~5k-events/sec analytics engine, ~50k-videos/day serverless encoding, five production datastores — kept highly available for five years.

Tech Head · Cuddll

2018 – 2019

Took a monolith to microservices, cut cloud spend ~50%, and ran a push backend delivering ~1M notifications a day.

Earlier · Tech Lead, Lead Backend, Backend Developer

2014 – 2018

REST APIs, media delivery, and backends for early-stage consumer products.

Outside of implementation, I enjoy design discussions, architecture reviews, and understanding how systems actually behave in production.

A lot of the writing on this site comes from that curiosity: notes from building systems, operating them, redesigning them, and occasionally discovering that something I was certain about yesterday was wrong today. That’s usually where the interesting lessons are.

“Scale and reliability are much closer than they appear. Both require good state management, clear ownership, and well-defined boundaries.”