Skip to main content
Featured

Database Outage Runbook

November 17, 2025By The Art of CTO8 min read
...
runbooks

Step-by-step incident response playbook for database outages with clear actions, diagnosis steps, and post-incident procedures.

Overview

This runbook provides a systematic approach to responding to database outages. Follow these steps in order to minimize downtime and ensure proper incident management.

Severity: Critical Estimated Time: 30-120 minutes Roles Required: On-call Engineer, DBA, Engineering Manager, Incident Commander

Detection

  • Alert received from monitoring (DataDog, New Relic, PagerDuty)
  • Verify impact scope (partial vs full outage)
  • Check internal status dashboard
  • Confirm with at least one other service

Immediate Actions (First 5 minutes)

1. Alert the Team

  • Page on-call DBA immediately
  • Create incident in incident management system (e.g., incident.io, PagerDuty)
  • Post in #incidents Slack channel with severity
  • Update external status page (if customer-facing)

2. Establish Incident Command

  • Designate Incident Commander (usually on-call engineering manager)
  • Set up war room (Zoom/Slack thread)
  • Begin incident timeline documentation

3. Initial Assessment

  • Check database health dashboard
  • Verify database server status (CPU, memory, disk I/O)
  • Check connection pool status
  • Review error logs (last 15 minutes)

Diagnosis (5-15 minutes)

Common Causes Checklist

Connection Issues

  • Connection pool exhausted?
    • Check active connections vs pool limit
    • Look for connection leaks
  • Network connectivity?
    • Ping database server
    • Check security group rules

Performance Issues

  • Slow queries blocking operations?
    • Run SHOW PROCESSLIST (MySQL) or pg_stat_activity (PostgreSQL)
    • Identify long-running queries
  • Lock contention?
    • Check for deadlocks
    • Look for table locks
  • Disk full?
    • Check disk space on database server
    • Review transaction log size

Infrastructure Issues

  • Database server crashed?
    • Check server uptime
    • Review system logs
  • Cloud provider issue?
    • Check AWS/GCP/Azure status page
  • Failover needed?
    • Check replica lag
    • Verify replica health

Resolution Steps

For Connection Pool Exhaustion

bash
# 1. Temporarily increase connection pool size
# In application config or via environment variable
MAX_POOL_SIZE=50 # Increase from default 20

# 2. Restart application servers (rolling restart)
kubectl rollout restart deployment/app-server

# 3. Monitor connection usage
watch -n 1 'psql -c "SELECT count(*) FROM pg_stat_activity;"'

For Slow Query Issues

sql
-- 1. Identify slow queries (PostgreSQL)
SELECT pid, usename, query, state, query_start
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < NOW() - INTERVAL '2 minutes'
ORDER BY query_start;

-- 2. Kill problematic queries (use with caution)
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid = <problematic_pid>;

-- 3. For MySQL
SHOW FULL PROCESSLIST;
KILL <thread_id>;

For Database Server Crash

bash
# 1. Check database service status
systemctl status postgresql  # or mysql

# 2. Attempt restart
systemctl restart postgresql

# 3. If restart fails, check logs
tail -f /var/log/postgresql/postgresql.log

# 4. If corruption detected, restore from backup
# (Follow backup restoration procedure)

For Failover to Replica

bash
# 1. Verify replica is ready
# Check replication lag
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;

# 2. If lag acceptable (<30s), promote replica
pg_ctl promote -D /var/lib/postgresql/data

# 3. Update application database connection string
# (DNS change or config update)

# 4. Monitor new primary

Communication Templates

Initial Incident Post (Slack #incidents)

🚨 **INCIDENT DECLARED** - Database Outage

**Severity**: Critical
**Impact**: [Describe customer impact]
**Incident Commander**: @[name]
**War Room**: [Zoom/Slack thread link]

**Current Status**: Investigating
**ETA**: Unknown

Responders please join war room immediately.

Status Page Update

We are currently experiencing issues with our database service.
This may affect [affected features]. Our team is actively
investigating and working to resolve the issue.

Updates will be provided every 15 minutes.

Resolution Announcement

✅ **INCIDENT RESOLVED** - Database Outage

**Duration**: [X minutes]
**Root Cause**: [Brief description]
**Resolution**: [What was done]

**Impact**: [Number of affected users/requests]

Full postmortem will be shared within 48 hours.

Escalation Paths

Tier 1 Escalation (30 minutes)

If not resolved within 30 minutes:

  • Escalate to VP Engineering
  • Page additional DBAs
  • Consider engaging vendor support (if managed database)

Tier 2 Escalation (60 minutes)

If not resolved within 60 minutes:

  • Escalate to CTO
  • Notify executive team
  • Prepare customer communication plan
  • Consider fallback options (read-only mode, maintenance mode)

Tier 3 Escalation (90 minutes)

If not resolved within 90 minutes:

  • Activate disaster recovery plan
  • Consider restoring from backup to new instance
  • Engage external consultants if needed

Post-Incident Actions

Immediate (Within 1 hour of resolution)

  • Post final resolution announcement
  • Update status page to "All Systems Operational"
  • Thank responders in incident channel
  • Export incident timeline
  • Close incident in incident management system

Short-term (Within 24 hours)

  • Schedule postmortem meeting (within 48 hours)
  • Gather metrics:
    • Time to detect
    • Time to respond
    • Time to resolve
    • Number of affected users
    • Revenue impact (if applicable)
  • Collect feedback from responders

Medium-term (Within 1 week)

  • Complete postmortem document
    • Timeline of events
    • Root cause analysis
    • What went well
    • What didn't go well
    • Action items with owners and deadlines
  • Share postmortem with engineering team
  • Create follow-up tickets for improvements
  • Update this runbook with learnings

Preventive Measures

Monitoring Improvements

  • Set up proactive alerts for warning signs
    • Connection pool usage >70%
    • Query execution time >2s
    • Replication lag >10s
    • Disk usage >80%

Infrastructure Improvements

  • Implement automatic failover
  • Set up read replicas for failover
  • Configure connection pool monitoring
  • Implement query timeout limits

Process Improvements

  • Regular database performance reviews
  • Monthly failover drills
  • Quarterly runbook reviews
  • Annual disaster recovery testing

References

  • Database Monitoring Dashboard: [Link]
  • Incident Management System: [Link]
  • Database Documentation: [Link]
  • Vendor Support: [Contact details]

Last Updated: 2025-11-17 Runbook Owner: Platform Team Review Frequency: Quarterly

Related Content

Run Incident Response Like a Bank: Discipline, Auditability, and Calm Under Fire

Most CTOs I talk to don’t struggle with detecting incidents—they struggle with the messy middle: unclear authority, too many cooks in the channel, executives asking for ETAs you can’t honestly give, a...

Read more →

Blameless Postmortems That Actually Change Behavior

Most CTOs don't have a postmortem problem. They have a behavior change problem. The doc gets written, the meeting happens, everyone agrees it was a great discussion, and then the same class of incident shows up again 6-10 weeks later.

Read more →

AI Workloads Are Exposing the Ops Stack: DNS, Deep Observability, and Compliance Move to the Critical Path

AI is shifting from an application concern to an operations-and-infrastructure forcing function: teams are upgrading observability depth, hardening global dependency layers (like DNS)...

Read more →

AI Is Moving Into Ops: Why 2026’s Enterprise Bottleneck Won’t Be Models, It’ll Be Production Readiness

AI is rapidly becoming an operations-layer capability—powering incident response, AIOps, and observability—while enterprises discover the real bottleneck is production readiness (reliability, gover...

Read more →

Agentic AI Is Entering the Pager Rotation: Autonomous SRE Moves from Observability to Control Loops

Agentic AI is moving from copilots to production control loops: vendors are pitching autonomous SRE and AI-native observability, investors are backing closed-loop remediation platforms, and boards are hiring AI-focused CTOs to operationalize these capabilities.

Read more →