Runbook Template
A structured template for documenting operational procedures, troubleshooting steps, and incident response.
Table of Contents
Runbook Template
Runbooks document operational procedures that enable anyone on-call to handle incidents and perform routine operations. Good runbooks reduce mean time to recovery (MTTR) and minimize reliance on tribal knowledge.
Why Use Runbooks?
Benefits:
- Enables consistent incident response
- Reduces dependency on specific individuals
- Decreases mean time to recovery (MTTR)
- Supports on-call engineers effectively
- Creates institutional knowledge
When to write a runbook:
- New services going to production
- After incidents (capture learnings)
- Common operational tasks
- Complex troubleshooting procedures
- Compliance requirements
The Template
# Runbook: [Service/System Name]
**Version:** [1.0]
**Last Updated:** [YYYY-MM-DD]
**Owner:** [Team Name]
**On-Call:** [Rotation/Contact Info]
## Overview
**Service Description:**
[Brief description of what this service does]
**Criticality:** [P0/P1/P2/P3]
**Dependencies:**
- Upstream: [Services this depends on]
- Downstream: [Services that depend on this]
**Documentation:**
- Architecture: [Link]
- API Docs: [Link]
- Dashboard: [Link]
## Quick Reference
### Key URLs
| Resource | URL |
|----------|-----|
| Production | [URL] |
| Staging | [URL] |
| Logs | [URL] |
| Metrics | [URL] |
| Alerts | [URL] |
### Key Commands
```bash
# Check service status
[command]
# View logs
[command]
# Restart service
[command]Escalation Path
| Level | Contact | When |
|---|---|---|
| L1 | [On-call engineer] | Initial response |
| L2 | [Team lead] | >30 min unresolved |
| L3 | [Engineering manager] | P0 incidents |
Health Checks
Service Health
Endpoint: GET /health
Expected Response:
{
"status": "healthy",
"version": "1.2.3",
"dependencies": {
"database": "healthy",
"cache": "healthy"
}
}Key Metrics
| Metric | Healthy Range | Alert Threshold |
|---|---|---|
| [Metric 1] | [Range] | [Threshold] |
| [Metric 2] | [Range] | [Threshold] |
Common Alerts
Alert: [Alert Name]
Severity: [Critical/Warning/Info]
Description: [What this alert means]
Possible Causes:
- [Cause 1]
- [Cause 2]
Investigation Steps:
- [Step 1]
- [Step 2]
Resolution:
- [Resolution step 1]
- [Resolution step 2]
Escalation: [When and how to escalate]
[Repeat for each alert]
Troubleshooting
Symptom: [Description]
Diagnosis:
# Commands to diagnose
[command]Possible Causes:
- [Cause 1]
- [Cause 2]
Resolution:
- [Step 1]
- [Step 2]
[Repeat for common symptoms]
Operational Procedures
Procedure: [Name]
Purpose: [Why this is done]
When to Use: [Conditions]
Prerequisites:
- [Prerequisite 1]
- [Prerequisite 2]
Steps:
- [Step 1]
bash
[command] - [Step 2]
- [Step 3]
Verification:
- [How to verify success]
Rollback:
- [Rollback step 1]
- [Rollback step 2]
[Repeat for each procedure]
Emergency Procedures
Service Down - Complete Outage
Immediate Actions:
- Check dashboard for related alerts
- Verify infrastructure status
- Check recent deployments
- Communicate in incident channel
Diagnosis:
# Check pod status
kubectl get pods -n [namespace]
# Check recent events
kubectl get events -n [namespace] --sort-by='.lastTimestamp'
# Check logs
kubectl logs -n [namespace] [pod] --tail=100Recovery Options:
- Rollback deployment (if recent deploy)
- Scale up (if capacity issue)
- Failover (if regional issue)
Database Connection Issues
[Specific procedures]
High Latency
[Specific procedures]
Maintenance Procedures
Scheduled Maintenance Window
Pre-Maintenance Checklist:
- Notify stakeholders
- Verify backup
- Prepare rollback plan
- Schedule maintenance window
During Maintenance:
- [Steps]
Post-Maintenance:
- Verify service health
- Monitor metrics
- Update documentation
Recovery Procedures
Restore from Backup
Prerequisites:
- Access to backup storage
- Database credentials
Steps:
- [Detailed steps]
Disaster Recovery
RTO: [Time] RPO: [Time]
Steps:
- [DR procedure]
Contacts
| Role | Name | Contact |
|---|---|---|
| Team Lead | [Name] | [Email/Slack] |
| DBA | [Name] | [Email/Slack] |
| Security | [Name] | [Email/Slack] |
| Vendor Support | [Company] | [Contact] |
Appendix
Environment Variables
| Variable | Description | Example |
|---|---|---|
| [VAR] | [Description] | [Example] |
Architecture Diagram
[ASCII diagram]
Changelog
| Date | Author | Changes |
|---|---|---|
| [Date] | [Name] | [Description] |
## Complete Example
```markdown
# Runbook: Payment Processing Service
**Version:** 2.3
**Last Updated:** 2025-10-15
**Owner:** Payments Team
**On-Call:** #payments-oncall (PagerDuty rotation)
## Overview
**Service Description:**
The Payment Processing Service handles all payment transactions including credit cards, digital wallets, and bank transfers. It integrates with Stripe, PayPal, and direct bank APIs.
**Criticality:** P0 (Revenue-critical)
**Dependencies:**
- Upstream: Order Service, Customer Service, Auth Service
- Downstream: Fulfillment Service, Notification Service, Analytics
**Documentation:**
- Architecture: [link to design doc]
- API Docs: https://api-docs.internal/payments
- Dashboard: https://datadog.com/dash/payments
- PCI Compliance: [link to compliance docs]
## Quick Reference
### Key URLs
| Resource | URL |
|----------|-----|
| Production API | https://api.example.com/payments |
| Staging API | https://api.staging.example.com/payments |
| Logs (Datadog) | https://datadog.com/logs?service=payment |
| Metrics Dashboard | https://datadog.com/dash/payment-svc |
| Alerts | https://pagerduty.com/services/payments |
| Stripe Dashboard | https://dashboard.stripe.com |
### Key Commands
```bash
# Check service status
kubectl get pods -n payments -l app=payment-service
# View recent logs
kubectl logs -n payments -l app=payment-service --tail=100 -f
# Restart service (rolling)
kubectl rollout restart deployment/payment-service -n payments
# Check Stripe API status
curl -s https://status.stripe.com/api/v2/status.json | jq '.status'
# Force circuit breaker reset
curl -X POST http://localhost:8080/admin/circuit-breaker/reset
Escalation Path
| Level | Contact | When |
|---|---|---|
| L1 | On-call engineer | Initial response |
| L2 | @payments-lead | >15 min unresolved or revenue impact |
| L3 | @engineering-manager | P0 incidents, data breach |
| External | Stripe Support (priority line) | Provider issues |
Health Checks
Service Health
Endpoint: GET /health
Expected Response:
{
"status": "healthy",
"version": "3.2.1",
"uptime": "15d 4h 32m",
"dependencies": {
"database": "healthy",
"redis": "healthy",
"stripe": "healthy",
"paypal": "degraded"
}
}Unhealthy Response Actions:
database: unhealthy→ Check RDS, see Database sectionredis: unhealthy→ Check ElastiCache, see Cache sectionstripe: unhealthy→ Check Stripe status pagepaypal: unhealthy→ Check PayPal status, may need failover
Key Metrics
| Metric | Healthy Range | Alert Threshold |
|---|---|---|
| payment_success_rate | >99% | <98% |
| payment_latency_p99 | <2s | >3s |
| stripe_api_latency_p99 | <500ms | >1s |
| active_connections | 10-50 | >80 |
| error_rate | <0.5% | >1% |
| circuit_breaker_open | 0 | >0 |
Common Alerts
Alert: PaymentSuccessRateLow
Severity: Critical (pages immediately)
Description: Payment success rate has dropped below 98% over 5-minute window.
Possible Causes:
- Payment provider outage (Stripe/PayPal)
- Database connection issues
- Fraud detection blocking legitimate payments
- Network issues to payment providers
- Bad deployment
Investigation Steps:
-
Check payment provider status pages
- Stripe: https://status.stripe.com
- PayPal: https://www.paypal-status.com
-
Check error breakdown in Datadog:
service:payment-service status:error | top 10 @error.type -
Check recent deployments:
bashkubectl rollout history deployment/payment-service -n payments -
Check database health:
bashpsql -h $DB_HOST -U $DB_USER -c "SELECT 1" -
Check Stripe API specifically:
bash# Look for Stripe errors kubectl logs -n payments -l app=payment-service --tail=500 | grep -i stripe
Resolution:
If provider outage:
- Acknowledge alert
- Post in #incidents with provider status link
- Enable fallback provider if available
- Monitor until resolved
If database issue:
- Check RDS metrics in AWS console
- Look for connection exhaustion
- Restart connection pool if needed:
bash
curl -X POST http://localhost:8080/admin/db/reconnect
If bad deployment:
- Rollback immediately:
bash
kubectl rollout undo deployment/payment-service -n payments - Verify recovery
- Investigate root cause
Escalation:
- If unresolved after 15 minutes → Page team lead
- If revenue loss >$10K → Page engineering manager
- If provider outage → Contact Stripe priority support
Alert: PaymentLatencyHigh
Severity: Warning
Description: Payment API p99 latency exceeds 3 seconds.
Possible Causes:
- Stripe API slowness
- Database query performance
- High traffic/capacity issues
- Network latency
- Fraud check taking too long
Investigation Steps:
-
Check which payment provider is slow:
sqlSELECT provider, avg(latency_ms), p99(latency_ms) FROM payment_metrics WHERE timestamp > NOW() - INTERVAL '15 minutes' GROUP BY provider; -
Check if it's database:
bash# Look at slow queries kubectl logs -n payments -l app=payment-service | grep "slow query" -
Check pod resource usage:
bashkubectl top pods -n payments
Resolution:
If Stripe slowness:
- Usually self-resolving
- Consider increasing timeouts temporarily
- Check Stripe status page
If database:
- Check for missing indexes
- Look for connection pool exhaustion
- Consider read replica routing
If capacity:
# Scale up pods
kubectl scale deployment/payment-service -n payments --replicas=10Alert: CircuitBreakerOpen
Severity: Critical
Description: Circuit breaker for external payment provider is open, blocking all requests.
Possible Causes:
- Payment provider completely down
- Network partition
- Provider rate limiting us
Investigation Steps:
-
Check which circuit is open:
bashcurl http://localhost:8080/admin/circuit-breaker/status | jq -
Check provider status pages
-
Check our request rate:
sqlSELECT COUNT(*) FROM payment_requests WHERE provider = 'stripe' AND timestamp > NOW() - INTERVAL '1 minute';
Resolution:
-
If provider is recovering, manually reset:
bashcurl -X POST http://localhost:8080/admin/circuit-breaker/reset/stripe -
If provider is down, enable fallback:
bashcurl -X POST http://localhost:8080/admin/failover/enable/paypal -
Communicate status to stakeholders
Troubleshooting
Symptom: Payments failing with "Card Declined"
Diagnosis:
# Check decline reasons
kubectl logs -n payments -l app=payment-service --tail=1000 | \
grep "card_declined" | jq '.decline_code' | sort | uniq -cPossible Causes:
- Legitimate fraud detection
- Stripe Radar rules too aggressive
- Card network issues
- Customer's bank issues
Resolution:
- Check if specific decline code is spiking
- Review Stripe Radar rules if false positives
- Advise customer service on messaging
Symptom: Duplicate charges appearing
Diagnosis:
-- Find duplicate transactions
SELECT customer_id, amount, COUNT(*)
FROM payments
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY customer_id, amount, idempotency_key
HAVING COUNT(*) > 1;Resolution:
- Identify affected transactions
- Process refunds via Stripe dashboard
- Investigate idempotency key generation
- Check for retry logic issues
Operational Procedures
Procedure: Manual Payment Refund
Purpose: Process refunds that can't go through normal flow
When to Use: Customer service escalation, system errors
Prerequisites:
- Stripe dashboard access
- Original transaction ID
- Customer verification complete
Steps:
-
Find transaction in Stripe:
Dashboard → Payments → Search by ID -
Verify transaction details match request
-
Click "Refund" and enter amount
-
Add metadata:
json{ "reason": "customer_request", "ticket": "CS-12345", "processed_by": "your_email" } -
Update internal database:
sqlUPDATE payments SET status = 'refunded', refund_reason = 'manual - CS-12345', updated_at = NOW() WHERE stripe_payment_id = 'pi_xxx';
Verification:
- Refund shows in Stripe dashboard
- Customer notified
- Internal records updated
Procedure: Rotate Stripe API Keys
Purpose: Regular security rotation or key compromise
When to Use: Quarterly rotation or security incident
Prerequisites:
- Stripe admin access
- AWS Secrets Manager access
- Deployment access
Steps:
-
Generate new key in Stripe Dashboard:
Developers → API Keys → Create restricted key -
Update secret in AWS:
bashaws secretsmanager update-secret \ --secret-id payment-service/stripe-api-key \ --secret-string '{"key": "sk_live_xxx"}' -
Trigger secret refresh:
bashkubectl rollout restart deployment/payment-service -n payments -
Verify new key is working:
bashkubectl logs -n payments -l app=payment-service | grep "Stripe client initialized" -
Revoke old key in Stripe Dashboard
Rollback: If issues occur, restore old key from Secrets Manager version history.
Emergency Procedures
Complete Payment Outage
Immediate Actions (First 5 minutes):
-
Acknowledge alert in PagerDuty
-
Join #incident-payments channel
-
Post initial status:
🔴 Investigating payment processing issues. Payments may be failing. Impact: All payment types affected Started: [time] -
Run quick diagnostics:
bash# Service status kubectl get pods -n payments # Recent errors kubectl logs -n payments -l app=payment-service --tail=50 | grep ERROR # Provider status curl -s https://status.stripe.com/api/v2/status.json
Diagnosis Tree:
Payments failing?
├── All providers failing?
│ ├── Yes → Check our infrastructure
│ │ ├── Pods running? → Restart/scale
│ │ ├── DB connected? → Check RDS
│ │ └── Network issue? → Check VPC/SG
│ └── No → Provider-specific issue
│ ├── Stripe only → Check Stripe status
│ └── PayPal only → Check PayPal status
Recovery Actions:
Infrastructure issue:
# Scale up
kubectl scale deployment/payment-service -n payments --replicas=10
# Force restart
kubectl delete pods -n payments -l app=payment-service
# Check and fix database
psql -h $DB_HOST -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction';"Provider issue:
# Failover to backup provider
kubectl set env deployment/payment-service \
PRIMARY_PROVIDER=paypal \
-n paymentsCommunication:
- Update #incident-payments every 15 minutes
- Notify customer service team
- Update status page if >10 minutes
Suspected Fraud or Data Breach
Immediate Actions:
- DO NOT discuss in public channels
- Page Security team immediately
- Document exact time of discovery
- Preserve logs (do not delete anything)
If active breach:
# Block suspicious IPs (if identified)
# Contact Security team for specific commandsContacts
| Role | Name | Contact |
|---|---|---|
| Team Lead | Sarah Chen | @sarah-chen, +1-555-0123 |
| Engineering Manager | Mike Johnson | @mike-j |
| DBA | David Kim | @david-kim |
| Security | Lisa Wang | @security-oncall |
| Stripe Support | Priority Line | support@stripe.com, 1-888-xxx |
Appendix
Environment Variables
| Variable | Description | Example |
|---|---|---|
| STRIPE_API_KEY | Stripe secret key | sk_live_xxx |
| STRIPE_WEBHOOK_SECRET | Webhook signing secret | whsec_xxx |
| DB_CONNECTION_STRING | PostgreSQL connection | postgres://... |
| REDIS_URL | Redis connection | redis://... |
| FRAUD_CHECK_ENABLED | Enable fraud checks | true |
Changelog
| Date | Author | Changes |
|---|---|---|
| 2025-10-15 | Sarah Chen | Added PayPal failover procedure |
| 2025-09-01 | Mike Johnson | Updated Stripe key rotation |
| 2025-08-15 | David Kim | Added database troubleshooting |
## Runbook Best Practices
### 1. Assume Tired, Stressed Reader
- Clear, numbered steps
- One action per step
- Copy-paste ready commands
- No ambiguity
### 2. Keep Current
- Review after every incident
- Update quarterly minimum
- Version control changes
### 3. Test Procedures
- Practice runbooks in staging
- Include in game days
- Verify commands work
### 4. Include Context
- Why, not just what
- Links to deeper documentation
- Escalation paths
---
*A runbook's value is proven at 3 AM during an incident. Write for that moment.*