Platform — Status

Six services, probed from outside, measured over 90 days. When something breaks, the postmortem is published here.

ALL SYSTEMS OPERATIONAL90-day window — updated daily

[ 01 ]Services — uptime, 90 days

operational planned dip incident

Y0 Runtime

run execution, scheduling, traces
99.96%
2026-03-132026-06-10

Context Graph

memory, retrieval, indexing
99.96%
2026-03-132026-06-10

Trust Kernel

permissions, scopes, audit log
99.98%
2026-03-132026-06-10

Gateway

API edge, auth, rate limiting
99.96%
2026-03-132026-06-10

Analytics

usage, cost and quality reporting
99.98%
2026-03-132026-06-10

Billing

metering, invoicing, receipts
99.98%
2026-03-132026-06-10

[ 02 ]Incident history

Three incidents in the window. Each one is written up honestly — detection time included, even when it embarrasses us.

[INC-003]2026-05-28Gateway38 min✓ resolved

Elevated 5xx errors at the API edge

A deploy shipped with a connection-pool ceiling sized for staging, not production. Under normal evening traffic the pool exhausted within minutes and the edge returned 502s for roughly a third of requests. We rolled back in 9 minutes; the remaining time was queue drain. The fix was boring and real: pool limits are now part of config review, and a synthetic load check runs against every release candidate before it can promote. No data was lost and no runs were double-executed.

[INC-002]2026-04-22Context Graph1h 12m✓ resolved

Slow retrieval during index rebuild

A scheduled index rebuild took a write lock that our read path was not supposed to wait on — but a regression three weeks earlier had quietly re-coupled them. Retrieval latency climbed from ~80ms to multi-second for just over an hour. We aborted the rebuild, restored read throughput, and re-ran the rebuild against a shadow index overnight. The regression had shipped without a test; that test now exists, and rebuilds run shadow-first as standard practice.

[INC-001]2026-03-15Y0 Runtime26 min✓ resolved

Run queue stall from a bad scheduler canary

A canary build of the run scheduler deadlocked on a rare retry path and the canary slice of the queue stopped draining. Detection took longer than it should have — 11 minutes — because our alert measured throughput, not queue age. We killed the canary, the queue drained, and every stalled run completed without manual intervention. Two changes followed: queue-age alerting with a 90-second threshold, and canaries now auto-revert on stall instead of waiting for a human.

[ 03 ]Stay informed

Subscribe to updates and we'll tell you when something breaks — and what we did about it.

Subscribe to updates
[ note ]incidents only — no marketing