May 30, 2025

Handling Downtime: When Your App Goes Down at 3 AM

3:47 AM. Phone buzzes. "Site is down."

The Incident

Database ran out of disk space. Writes failed. App crashed.

The Response

  1. SSH into server (thank god for mobile hotspot)
  2. Clear old logs
  3. Restart services
  4. Monitor for 30 minutes
  5. Go back to sleep (attempt #2)

What We Changed

Monitoring: Set up disk space alerts. Never again.

Automated Cleanup: Cron job deletes logs older than 7 days.

Capacity Planning: Actually plan for growth. Obvious in hindsight.

The Postmortem

Always write one. What happened? Why? How do we prevent it?

Share it with customers. Transparency builds trust.

The Lesson

Production will break. Plan for it.

  • Set up monitoring
  • Have a runbook
  • Keep your laptop charged
  • Accept that 3 AM incidents happen

The Silver Lining

Customers appreciate fast response more than perfect uptime.

We were back up in 20 minutes. They noticed. They appreciated it.