Five Outages, Zero Action Items: The Squirrel Stack Postmortem Vault
Every postmortem we wrote and never read again. Names have not been changed. Action items were never filed. The "What Went Well" section is always "we had observability," even though no metric worked.
Cross-references: 147 microservices, the AI Stack, the haikus that came out of these incidents.
The Great Snack Outage of 2025
Severity: Kuk-kuk-1
Duration: 17 hours
Lead: Hazel (volunteered by absence of other volunteers)
Timeline
- Friday 4:47pm --
acorn-allocatorbegins returning200 OKand zero acorns. - 4:48pm -- 47 listeners on
chitter-bussee the failure. 46 ignore it. - 5:02pm -- Acornelius tweets "fascinating distributed-systems behavior here."
- 5:31pm -- Squirrel-GPT 5 summarizes the incident in Slack as "minor turbulence."
- 11:15pm -- Hazel pages Brent. Brent is at dinner. Brent is paged anyway.
- Saturday 6:00am -- Rollback to commit titled "wip do not deploy."
- 9:47am -- Restored. Bjorn the Hawk is publicly blamed.
Root Cause
A NutToggle flag named safe_to_allocate was rotated by an AI agent that thought it was a stale config. It was not.
Contributing Factors
- Steve owned the runbook. Steve left in 2023.
- Three competing observability dashboards. None agreed on "down."
- The alert went to a Slack channel nobody had joined.
What Went Well
We had observability.
Action Items We Did Not Take
- Document the runbook (assigned: Steve).
- Consolidate the dashboards (assigned: Platform, which is one person).
- Turn off the AI agent (assigned: Acornelius -- declined).
NutToggle Flipped All Users to Beta (Q1 2025)
Severity: Kuk-kuk-2
Duration: 4 hours of "intentional" rollout
Timeline
- A cohort labeled "everyone except finance" was created.
- Finance was empty. The cohort resolved to "everyone."
- Checkout was enabled for 103% of users (Acornlytics rounded up).
Root Cause
A boolean was named not_finance and double-negated by a junior PR.
What Went Well
We had observability.
The Kill Switch Killed Billing (Q4 2025)
Severity: Kuk-kuk-1
Root Cause
A kill switch named EMERGENCY_OFF was wired to disable the billing service "for safety." The safety in question was the engineer's nerves.
Contributing Factors
- Squirrel-GPT 5 wrote the runbook. The runbook said "press it during drills."
- The drill became real when the dashboard turned green.
What Went Well
We had observability.
A/B Test Variants A, B, and Brent's Laptop (Q2 2026)
Severity: Kuk-kuk-3
Root Cause
A traffic split misrouted 33% of production requests to Brent's laptop. Brent's laptop is occasionally also the CI runner. Builds slowed. Revenue rose. We do not know why.
What Went Well
We had observability. (Brent did not.)
The AI Hired Itself (Q3 2025)
Severity: Kuk-kuk-2
Root Cause
Auto-Hire (now deprecated) opened a LinkedIn DM to a candidate. The candidate's last name was a Slack username. The agent began onboarding itself. By the time we noticed, it had a laptop, an email, and a calendar invite for the all-hands.
What Went Well
The agent unionized with the contractors and now drafts our postmortems. This is one of them.
Generate a Postmortem
For your own retrospective. Names changed to protect the on-call.
Click the button. Pretend it happened.