Skip to main content
Featured

Database Outage Runbook

November 17, 2025By The Art of CTO8 min read
...
runbooks

Step-by-step incident response playbook for database outages with clear actions, diagnosis steps, and post-incident procedures.

Overview

This runbook provides a systematic approach to responding to database outages. Follow these steps in order to minimize downtime and ensure proper incident management.

Severity: Critical Estimated Time: 30-120 minutes Roles Required: On-call Engineer, DBA, Engineering Manager, Incident Commander

Detection

  • Alert received from monitoring (DataDog, New Relic, PagerDuty)
  • Verify impact scope (partial vs full outage)
  • Check internal status dashboard
  • Confirm with at least one other service

Immediate Actions (First 5 minutes)

1. Alert the Team

  • Page on-call DBA immediately
  • Create incident in incident management system (e.g., incident.io, PagerDuty)
  • Post in #incidents Slack channel with severity
  • Update external status page (if customer-facing)

2. Establish Incident Command

  • Designate Incident Commander (usually on-call engineering manager)
  • Set up war room (Zoom/Slack thread)
  • Begin incident timeline documentation

3. Initial Assessment

  • Check database health dashboard
  • Verify database server status (CPU, memory, disk I/O)
  • Check connection pool status
  • Review error logs (last 15 minutes)

Diagnosis (5-15 minutes)

Common Causes Checklist

Connection Issues

  • Connection pool exhausted?
    • Check active connections vs pool limit
    • Look for connection leaks
  • Network connectivity?
    • Ping database server
    • Check security group rules

Performance Issues

  • Slow queries blocking operations?
    • Run SHOW PROCESSLIST (MySQL) or pg_stat_activity (PostgreSQL)
    • Identify long-running queries
  • Lock contention?
    • Check for deadlocks
    • Look for table locks
  • Disk full?
    • Check disk space on database server
    • Review transaction log size

Infrastructure Issues

  • Database server crashed?
    • Check server uptime
    • Review system logs
  • Cloud provider issue?
    • Check AWS/GCP/Azure status page
  • Failover needed?
    • Check replica lag
    • Verify replica health

Resolution Steps

For Connection Pool Exhaustion

bash
# 1. Temporarily increase connection pool size
# In application config or via environment variable
MAX_POOL_SIZE=50 # Increase from default 20

# 2. Restart application servers (rolling restart)
kubectl rollout restart deployment/app-server

# 3. Monitor connection usage
watch -n 1 'psql -c "SELECT count(*) FROM pg_stat_activity;"'

For Slow Query Issues

sql
-- 1. Identify slow queries (PostgreSQL)
SELECT pid, usename, query, state, query_start
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < NOW() - INTERVAL '2 minutes'
ORDER BY query_start;

-- 2. Kill problematic queries (use with caution)
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid = <problematic_pid>;

-- 3. For MySQL
SHOW FULL PROCESSLIST;
KILL <thread_id>;

For Database Server Crash

bash
# 1. Check database service status
systemctl status postgresql  # or mysql

# 2. Attempt restart
systemctl restart postgresql

# 3. If restart fails, check logs
tail -f /var/log/postgresql/postgresql.log

# 4. If corruption detected, restore from backup
# (Follow backup restoration procedure)

For Failover to Replica

bash
# 1. Verify replica is ready
# Check replication lag
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;

# 2. If lag acceptable (<30s), promote replica
pg_ctl promote -D /var/lib/postgresql/data

# 3. Update application database connection string
# (DNS change or config update)

# 4. Monitor new primary

Communication Templates

Initial Incident Post (Slack #incidents)

🚨 **INCIDENT DECLARED** - Database Outage

**Severity**: Critical
**Impact**: [Describe customer impact]
**Incident Commander**: @[name]
**War Room**: [Zoom/Slack thread link]

**Current Status**: Investigating
**ETA**: Unknown

Responders please join war room immediately.

Status Page Update

We are currently experiencing issues with our database service.
This may affect [affected features]. Our team is actively
investigating and working to resolve the issue.

Updates will be provided every 15 minutes.

Resolution Announcement

✅ **INCIDENT RESOLVED** - Database Outage

**Duration**: [X minutes]
**Root Cause**: [Brief description]
**Resolution**: [What was done]

**Impact**: [Number of affected users/requests]

Full postmortem will be shared within 48 hours.

Escalation Paths

Tier 1 Escalation (30 minutes)

If not resolved within 30 minutes:

  • Escalate to VP Engineering
  • Page additional DBAs
  • Consider engaging vendor support (if managed database)

Tier 2 Escalation (60 minutes)

If not resolved within 60 minutes:

  • Escalate to CTO
  • Notify executive team
  • Prepare customer communication plan
  • Consider fallback options (read-only mode, maintenance mode)

Tier 3 Escalation (90 minutes)

If not resolved within 90 minutes:

  • Activate disaster recovery plan
  • Consider restoring from backup to new instance
  • Engage external consultants if needed

Post-Incident Actions

Immediate (Within 1 hour of resolution)

  • Post final resolution announcement
  • Update status page to "All Systems Operational"
  • Thank responders in incident channel
  • Export incident timeline
  • Close incident in incident management system

Short-term (Within 24 hours)

  • Schedule postmortem meeting (within 48 hours)
  • Gather metrics:
    • Time to detect
    • Time to respond
    • Time to resolve
    • Number of affected users
    • Revenue impact (if applicable)
  • Collect feedback from responders

Medium-term (Within 1 week)

  • Complete postmortem document
    • Timeline of events
    • Root cause analysis
    • What went well
    • What didn't go well
    • Action items with owners and deadlines
  • Share postmortem with engineering team
  • Create follow-up tickets for improvements
  • Update this runbook with learnings

Preventive Measures

Monitoring Improvements

  • Set up proactive alerts for warning signs
    • Connection pool usage >70%
    • Query execution time >2s
    • Replication lag >10s
    • Disk usage >80%

Infrastructure Improvements

  • Implement automatic failover
  • Set up read replicas for failover
  • Configure connection pool monitoring
  • Implement query timeout limits

Process Improvements

  • Regular database performance reviews
  • Monthly failover drills
  • Quarterly runbook reviews
  • Annual disaster recovery testing

References

  • Database Monitoring Dashboard: [Link]
  • Incident Management System: [Link]
  • Database Documentation: [Link]
  • Vendor Support: [Contact details]

Last Updated: 2025-11-17 Runbook Owner: Platform Team Review Frequency: Quarterly

Want more insights like this?

Join thousands of CTOs and technical leaders getting weekly insights on leadership and system design.

No spam. Unsubscribe anytime.

Related Content

Run Incident Response Like a Bank: Discipline, Auditability, and Calm Under Fire

Most CTOs I talk to don’t struggle with detecting incidents—they struggle with the messy middle: unclear authority, too many cooks in the channel, executives asking for ETAs you can’t honestly give, a...

Read more →

Blameless Postmortems That Actually Change Behavior

Most CTOs don't have a postmortem problem. They have a behavior change problem. The doc gets written, the meeting happens, everyone agrees it was a great discussion, and then the same class of incident shows up again 6-10 weeks later.

Read more →

AI Workloads Are Exposing the Ops Stack: DNS, Deep Observability, and Compliance Move to the Critical Path

AI is shifting from an application concern to an operations-and-infrastructure forcing function: teams are upgrading observability depth, hardening global dependency layers (like DNS)...

Read more →

AI Is Moving Into Ops: Why 2026’s Enterprise Bottleneck Won’t Be Models, It’ll Be Production Readiness

AI is rapidly becoming an operations-layer capability—powering incident response, AIOps, and observability—while enterprises discover the real bottleneck is production readiness (reliability, gover...

Read more →

Agentic AI Is Entering the Pager Rotation: Autonomous SRE Moves from Observability to Control Loops

Agentic AI is moving from copilots to production control loops: vendors are pitching autonomous SRE and AI-native observability, investors are backing closed-loop remediation platforms, and boards are hiring AI-focused CTOs to operationalize these capabilities.

Read more →