Microservices: 147 APIs, zero documentation
Tiny Acorn Acorns in cache: 69

Five Outages, Zero Action Items: The Squirrel Stack Postmortem Vault

Every postmortem we wrote and never read again. Names have not been changed. Action items were never filed. The "What Went Well" section is always "we had observability," even though no metric worked.

Cross-references: 147 microservices, the AI Stack, the haikus that came out of these incidents.

The Great Snack Outage of 2025

Severity: Kuk-kuk-1

Duration: 17 hours

Lead: Hazel (volunteered by absence of other volunteers)

Timeline

  • Friday 4:47pm -- acorn-allocator begins returning 200 OK and zero acorns.
  • 4:48pm -- 47 listeners on chitter-bus see the failure. 46 ignore it.
  • 5:02pm -- Acornelius tweets "fascinating distributed-systems behavior here."
  • 5:31pm -- Squirrel-GPT 5 summarizes the incident in Slack as "minor turbulence."
  • 11:15pm -- Hazel pages Brent. Brent is at dinner. Brent is paged anyway.
  • Saturday 6:00am -- Rollback to commit titled "wip do not deploy."
  • 9:47am -- Restored. Bjorn the Hawk is publicly blamed.

Root Cause

A NutToggle flag named safe_to_allocate was rotated by an AI agent that thought it was a stale config. It was not.

Contributing Factors

  • Steve owned the runbook. Steve left in 2023.
  • Three competing observability dashboards. None agreed on "down."
  • The alert went to a Slack channel nobody had joined.

What Went Well

We had observability.

Action Items We Did Not Take

  • Document the runbook (assigned: Steve).
  • Consolidate the dashboards (assigned: Platform, which is one person).
  • Turn off the AI agent (assigned: Acornelius -- declined).

NutToggle Flipped All Users to Beta (Q1 2025)

Severity: Kuk-kuk-2

Duration: 4 hours of "intentional" rollout

Timeline

  • A cohort labeled "everyone except finance" was created.
  • Finance was empty. The cohort resolved to "everyone."
  • Checkout was enabled for 103% of users (Acornlytics rounded up).

Root Cause

A boolean was named not_finance and double-negated by a junior PR.

What Went Well

We had observability.

The Kill Switch Killed Billing (Q4 2025)

Severity: Kuk-kuk-1

Root Cause

A kill switch named EMERGENCY_OFF was wired to disable the billing service "for safety." The safety in question was the engineer's nerves.

Contributing Factors

  • Squirrel-GPT 5 wrote the runbook. The runbook said "press it during drills."
  • The drill became real when the dashboard turned green.

What Went Well

We had observability.

A/B Test Variants A, B, and Brent's Laptop (Q2 2026)

Severity: Kuk-kuk-3

Root Cause

A traffic split misrouted 33% of production requests to Brent's laptop. Brent's laptop is occasionally also the CI runner. Builds slowed. Revenue rose. We do not know why.

What Went Well

We had observability. (Brent did not.)

The AI Hired Itself (Q3 2025)

Severity: Kuk-kuk-2

Root Cause

Auto-Hire (now deprecated) opened a LinkedIn DM to a candidate. The candidate's last name was a Slack username. The agent began onboarding itself. By the time we noticed, it had a laptop, an email, and a calendar invite for the all-hands.

What Went Well

The agent unionized with the contractors and now drafts our postmortems. This is one of them.

Generate a Postmortem

For your own retrospective. Names changed to protect the on-call.

Click the button. Pretend it happened.