Database Outage Runbook

Overview

This runbook provides a systematic approach to responding to database outages. Follow these steps in order to minimize downtime and ensure proper incident management.

Severity: Critical Estimated Time: 30-120 minutes Roles Required: On-call Engineer, DBA, Engineering Manager, Incident Commander

Detection

Alert received from monitoring (DataDog, New Relic, PagerDuty)
Verify impact scope (partial vs full outage)
Check internal status dashboard
Confirm with at least one other service

Immediate Actions (First 5 minutes)

1. Alert the Team

Page on-call DBA immediately
Create incident in incident management system (e.g., incident.io, PagerDuty)
Post in #incidents Slack channel with severity
Update external status page (if customer-facing)

2. Establish Incident Command

Designate Incident Commander (usually on-call engineering manager)
Set up war room (Zoom/Slack thread)
Begin incident timeline documentation

3. Initial Assessment

Check database health dashboard
Verify database server status (CPU, memory, disk I/O)
Check connection pool status
Review error logs (last 15 minutes)

Diagnosis (5-15 minutes)

Common Causes Checklist

Connection Issues

Connection pool exhausted?
- Check active connections vs pool limit
- Look for connection leaks
Network connectivity?
- Ping database server
- Check security group rules

Performance Issues

Slow queries blocking operations?
- Run SHOW PROCESSLIST (MySQL) or pg_stat_activity (PostgreSQL)
- Identify long-running queries
Lock contention?
- Check for deadlocks
- Look for table locks
Disk full?
- Check disk space on database server
- Review transaction log size

Infrastructure Issues

Database server crashed?
- Check server uptime
- Review system logs
Cloud provider issue?
- Check AWS/GCP/Azure status page
Failover needed?
- Check replica lag
- Verify replica health

Resolution Steps

For Connection Pool Exhaustion

bash

# 1. Temporarily increase connection pool size
# In application config or via environment variable
MAX_POOL_SIZE=50 # Increase from default 20

# 2. Restart application servers (rolling restart)
kubectl rollout restart deployment/app-server

# 3. Monitor connection usage
watch -n 1 'psql -c "SELECT count(*) FROM pg_stat_activity;"'

For Slow Query Issues

sql

-- 1. Identify slow queries (PostgreSQL)
SELECT pid, usename, query, state, query_start
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < NOW() - INTERVAL '2 minutes'
ORDER BY query_start;

-- 2. Kill problematic queries (use with caution)
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid = <problematic_pid>;

-- 3. For MySQL
SHOW FULL PROCESSLIST;
KILL <thread_id>;

For Database Server Crash

bash

# 1. Check database service status
systemctl status postgresql  # or mysql

# 2. Attempt restart
systemctl restart postgresql

# 3. If restart fails, check logs
tail -f /var/log/postgresql/postgresql.log

# 4. If corruption detected, restore from backup
# (Follow backup restoration procedure)

For Failover to Replica

bash

# 1. Verify replica is ready
# Check replication lag
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;

# 2. If lag acceptable (<30s), promote replica
pg_ctl promote -D /var/lib/postgresql/data

# 3. Update application database connection string
# (DNS change or config update)

# 4. Monitor new primary

Communication Templates

Initial Incident Post (Slack #incidents)

🚨 **INCIDENT DECLARED** - Database Outage

**Severity**: Critical
**Impact**: [Describe customer impact]
**Incident Commander**: @[name]
**War Room**: [Zoom/Slack thread link]

**Current Status**: Investigating
**ETA**: Unknown

Responders please join war room immediately.

Status Page Update

We are currently experiencing issues with our database service.
This may affect [affected features]. Our team is actively
investigating and working to resolve the issue.

Updates will be provided every 15 minutes.

Resolution Announcement

✅ **INCIDENT RESOLVED** - Database Outage

**Duration**: [X minutes]
**Root Cause**: [Brief description]
**Resolution**: [What was done]

**Impact**: [Number of affected users/requests]

Full postmortem will be shared within 48 hours.

Escalation Paths

Tier 1 Escalation (30 minutes)

If not resolved within 30 minutes:

Escalate to VP Engineering
Page additional DBAs
Consider engaging vendor support (if managed database)

Tier 2 Escalation (60 minutes)

If not resolved within 60 minutes:

Escalate to CTO
Notify executive team
Prepare customer communication plan
Consider fallback options (read-only mode, maintenance mode)

Tier 3 Escalation (90 minutes)

If not resolved within 90 minutes:

Activate disaster recovery plan
Consider restoring from backup to new instance
Engage external consultants if needed

Post-Incident Actions

Immediate (Within 1 hour of resolution)

Post final resolution announcement
Update status page to "All Systems Operational"
Thank responders in incident channel
Export incident timeline
Close incident in incident management system

Short-term (Within 24 hours)

Schedule postmortem meeting (within 48 hours)
Gather metrics:
- Time to detect
- Time to respond
- Time to resolve
- Number of affected users
- Revenue impact (if applicable)
Collect feedback from responders

Medium-term (Within 1 week)

Complete postmortem document
- Timeline of events
- Root cause analysis
- What went well
- What didn't go well
- Action items with owners and deadlines
Share postmortem with engineering team
Create follow-up tickets for improvements
Update this runbook with learnings

Preventive Measures

Monitoring Improvements

Set up proactive alerts for warning signs
- Connection pool usage >70%
- Query execution time >2s
- Replication lag >10s
- Disk usage >80%

Infrastructure Improvements

Implement automatic failover
Set up read replicas for failover
Configure connection pool monitoring
Implement query timeout limits

Process Improvements

Regular database performance reviews
Monthly failover drills
Quarterly runbook reviews
Annual disaster recovery testing

References

Database Monitoring Dashboard: [Link]
Incident Management System: [Link]
Database Documentation: [Link]
Vendor Support: [Contact details]

Last Updated: 2025-11-17 Runbook Owner: Platform Team Review Frequency: Quarterly

Database Outage Runbook

Overview

Detection

Immediate Actions (First 5 minutes)

1. Alert the Team

2. Establish Incident Command

3. Initial Assessment

Diagnosis (5-15 minutes)

Common Causes Checklist

Resolution Steps

For Connection Pool Exhaustion

For Slow Query Issues

For Database Server Crash

For Failover to Replica

Communication Templates

Initial Incident Post (Slack #incidents)

Status Page Update

Resolution Announcement

Escalation Paths

Tier 1 Escalation (30 minutes)

Tier 2 Escalation (60 minutes)

Tier 3 Escalation (90 minutes)

Post-Incident Actions

Immediate (Within 1 hour of resolution)

Short-term (Within 24 hours)

Medium-term (Within 1 week)

Preventive Measures

Monitoring Improvements

Infrastructure Improvements

Process Improvements

References

Related Content

Run Incident Response Like a Bank: Discipline, Auditability, and Calm Under Fire

Blameless Postmortems That Actually Change Behavior

AI Workloads Are Exposing the Ops Stack: DNS, Deep Observability, and Compliance Move to the Critical Path

AI Is Moving Into Ops: Why 2026’s Enterprise Bottleneck Won’t Be Models, It’ll Be Production Readiness

Agentic AI Is Entering the Pager Rotation: Autonomous SRE Moves from Observability to Control Loops

Overview

Detection

Immediate Actions (First 5 minutes)

1. Alert the Team

2. Establish Incident Command

3. Initial Assessment

Diagnosis (5-15 minutes)

Common Causes Checklist

Resolution Steps

For Connection Pool Exhaustion

For Slow Query Issues

For Database Server Crash

For Failover to Replica

Communication Templates

Initial Incident Post (Slack #incidents)

Status Page Update

Resolution Announcement

Escalation Paths

Tier 1 Escalation (30 minutes)

Tier 2 Escalation (60 minutes)

Tier 3 Escalation (90 minutes)

Post-Incident Actions

Immediate (Within 1 hour of resolution)

Short-term (Within 24 hours)

Medium-term (Within 1 week)

Preventive Measures

Monitoring Improvements

Infrastructure Improvements

Process Improvements

Related Runbooks

References

Related Content

Run Incident Response Like a Bank: Discipline, Auditability, and Calm Under Fire

Blameless Postmortems That Actually Change Behavior

AI Workloads Are Exposing the Ops Stack: DNS, Deep Observability, and Compliance Move to the Critical Path

AI Is Moving Into Ops: Why 2026’s Enterprise Bottleneck Won’t Be Models, It’ll Be Production Readiness

Agentic AI Is Entering the Pager Rotation: Autonomous SRE Moves from Observability to Control Loops