Database Outage Runbook
Step-by-step incident response playbook for database outages with clear actions, diagnosis steps, and post-incident procedures.
Overview
This runbook provides a systematic approach to responding to database outages. Follow these steps in order to minimize downtime and ensure proper incident management.
Severity: Critical Estimated Time: 30-120 minutes Roles Required: On-call Engineer, DBA, Engineering Manager, Incident Commander
Detection
- Alert received from monitoring (DataDog, New Relic, PagerDuty)
- Verify impact scope (partial vs full outage)
- Check internal status dashboard
- Confirm with at least one other service
Immediate Actions (First 5 minutes)
1. Alert the Team
- Page on-call DBA immediately
- Create incident in incident management system (e.g., incident.io, PagerDuty)
- Post in
#incidentsSlack channel with severity - Update external status page (if customer-facing)
2. Establish Incident Command
- Designate Incident Commander (usually on-call engineering manager)
- Set up war room (Zoom/Slack thread)
- Begin incident timeline documentation
3. Initial Assessment
- Check database health dashboard
- Verify database server status (CPU, memory, disk I/O)
- Check connection pool status
- Review error logs (last 15 minutes)
Diagnosis (5-15 minutes)
Common Causes Checklist
Connection Issues
- Connection pool exhausted?
- Check active connections vs pool limit
- Look for connection leaks
- Network connectivity?
- Ping database server
- Check security group rules
Performance Issues
- Slow queries blocking operations?
- Run
SHOW PROCESSLIST(MySQL) orpg_stat_activity(PostgreSQL) - Identify long-running queries
- Run
- Lock contention?
- Check for deadlocks
- Look for table locks
- Disk full?
- Check disk space on database server
- Review transaction log size
Infrastructure Issues
- Database server crashed?
- Check server uptime
- Review system logs
- Cloud provider issue?
- Check AWS/GCP/Azure status page
- Failover needed?
- Check replica lag
- Verify replica health
Resolution Steps
For Connection Pool Exhaustion
# 1. Temporarily increase connection pool size
# In application config or via environment variable
MAX_POOL_SIZE=50 # Increase from default 20
# 2. Restart application servers (rolling restart)
kubectl rollout restart deployment/app-server
# 3. Monitor connection usage
watch -n 1 'psql -c "SELECT count(*) FROM pg_stat_activity;"'For Slow Query Issues
-- 1. Identify slow queries (PostgreSQL)
SELECT pid, usename, query, state, query_start
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < NOW() - INTERVAL '2 minutes'
ORDER BY query_start;
-- 2. Kill problematic queries (use with caution)
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid = <problematic_pid>;
-- 3. For MySQL
SHOW FULL PROCESSLIST;
KILL <thread_id>;For Database Server Crash
# 1. Check database service status
systemctl status postgresql # or mysql
# 2. Attempt restart
systemctl restart postgresql
# 3. If restart fails, check logs
tail -f /var/log/postgresql/postgresql.log
# 4. If corruption detected, restore from backup
# (Follow backup restoration procedure)For Failover to Replica
# 1. Verify replica is ready
# Check replication lag
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;
# 2. If lag acceptable (<30s), promote replica
pg_ctl promote -D /var/lib/postgresql/data
# 3. Update application database connection string
# (DNS change or config update)
# 4. Monitor new primaryCommunication Templates
Initial Incident Post (Slack #incidents)
🚨 **INCIDENT DECLARED** - Database Outage
**Severity**: Critical
**Impact**: [Describe customer impact]
**Incident Commander**: @[name]
**War Room**: [Zoom/Slack thread link]
**Current Status**: Investigating
**ETA**: Unknown
Responders please join war room immediately.
Status Page Update
We are currently experiencing issues with our database service.
This may affect [affected features]. Our team is actively
investigating and working to resolve the issue.
Updates will be provided every 15 minutes.
Resolution Announcement
✅ **INCIDENT RESOLVED** - Database Outage
**Duration**: [X minutes]
**Root Cause**: [Brief description]
**Resolution**: [What was done]
**Impact**: [Number of affected users/requests]
Full postmortem will be shared within 48 hours.
Escalation Paths
Tier 1 Escalation (30 minutes)
If not resolved within 30 minutes:
- Escalate to VP Engineering
- Page additional DBAs
- Consider engaging vendor support (if managed database)
Tier 2 Escalation (60 minutes)
If not resolved within 60 minutes:
- Escalate to CTO
- Notify executive team
- Prepare customer communication plan
- Consider fallback options (read-only mode, maintenance mode)
Tier 3 Escalation (90 minutes)
If not resolved within 90 minutes:
- Activate disaster recovery plan
- Consider restoring from backup to new instance
- Engage external consultants if needed
Post-Incident Actions
Immediate (Within 1 hour of resolution)
- Post final resolution announcement
- Update status page to "All Systems Operational"
- Thank responders in incident channel
- Export incident timeline
- Close incident in incident management system
Short-term (Within 24 hours)
- Schedule postmortem meeting (within 48 hours)
- Gather metrics:
- Time to detect
- Time to respond
- Time to resolve
- Number of affected users
- Revenue impact (if applicable)
- Collect feedback from responders
Medium-term (Within 1 week)
- Complete postmortem document
- Timeline of events
- Root cause analysis
- What went well
- What didn't go well
- Action items with owners and deadlines
- Share postmortem with engineering team
- Create follow-up tickets for improvements
- Update this runbook with learnings
Preventive Measures
Monitoring Improvements
- Set up proactive alerts for warning signs
- Connection pool usage >70%
- Query execution time >2s
- Replication lag >10s
- Disk usage >80%
Infrastructure Improvements
- Implement automatic failover
- Set up read replicas for failover
- Configure connection pool monitoring
- Implement query timeout limits
Process Improvements
- Regular database performance reviews
- Monthly failover drills
- Quarterly runbook reviews
- Annual disaster recovery testing
Related Runbooks
References
- Database Monitoring Dashboard: [Link]
- Incident Management System: [Link]
- Database Documentation: [Link]
- Vendor Support: [Contact details]
Last Updated: 2025-11-17 Runbook Owner: Platform Team Review Frequency: Quarterly