Runbook Template

Runbooks document operational procedures that enable anyone on-call to handle incidents and perform routine operations. Good runbooks reduce mean time to recovery (MTTR) and minimize reliance on tribal knowledge.

Why Use Runbooks?

Benefits:

Enables consistent incident response
Reduces dependency on specific individuals
Decreases mean time to recovery (MTTR)
Supports on-call engineers effectively
Creates institutional knowledge

When to write a runbook:

New services going to production
After incidents (capture learnings)
Common operational tasks
Complex troubleshooting procedures
Compliance requirements

The Template

markdown

# Runbook: [Service/System Name]

**Version:** [1.0]
**Last Updated:** [YYYY-MM-DD]
**Owner:** [Team Name]
**On-Call:** [Rotation/Contact Info]

## Overview

**Service Description:**
[Brief description of what this service does]

**Criticality:** [P0/P1/P2/P3]

**Dependencies:**
- Upstream: [Services this depends on]
- Downstream: [Services that depend on this]

**Documentation:**
- Architecture: [Link]
- API Docs: [Link]
- Dashboard: [Link]

## Quick Reference

### Key URLs

| Resource | URL |
|----------|-----|
| Production | [URL] |
| Staging | [URL] |
| Logs | [URL] |
| Metrics | [URL] |
| Alerts | [URL] |

### Key Commands

```bash
# Check service status
[command]

# View logs
[command]

# Restart service
[command]

Escalation Path

Level	Contact	When
L1	[On-call engineer]	Initial response
L2	[Team lead]	>30 min unresolved
L3	[Engineering manager]	P0 incidents

Health Checks

Service Health

Endpoint: GET /health

Expected Response:

json

{
  "status": "healthy",
  "version": "1.2.3",
  "dependencies": {
    "database": "healthy",
    "cache": "healthy"
  }
}

Key Metrics

Metric	Healthy Range	Alert Threshold
[Metric 1]	[Range]	[Threshold]
[Metric 2]	[Range]	[Threshold]

Common Alerts

Alert: [Alert Name]

Severity: [Critical/Warning/Info]

Description: [What this alert means]

Possible Causes:

[Cause 1]
[Cause 2]

Investigation Steps:

[Step 1]
[Step 2]

Resolution:

[Resolution step 1]
[Resolution step 2]

Escalation: [When and how to escalate]

[Repeat for each alert]

Troubleshooting

Symptom: [Description]

Diagnosis:

bash

# Commands to diagnose
[command]

Possible Causes:

[Cause 1]
[Cause 2]

Resolution:

[Step 1]
[Step 2]

[Repeat for common symptoms]

Operational Procedures

Procedure: [Name]

Purpose: [Why this is done]

When to Use: [Conditions]

Prerequisites:

[Prerequisite 1]
[Prerequisite 2]

Steps:

[Step 1]
bash
```
[command]
```
[Step 2]
[Step 3]

Verification:

[How to verify success]

Rollback:

[Rollback step 1]
[Rollback step 2]

[Repeat for each procedure]

Emergency Procedures

Service Down - Complete Outage

Immediate Actions:

Check dashboard for related alerts
Verify infrastructure status
Check recent deployments
Communicate in incident channel

Diagnosis:

bash

# Check pod status
kubectl get pods -n [namespace]

# Check recent events
kubectl get events -n [namespace] --sort-by='.lastTimestamp'

# Check logs
kubectl logs -n [namespace] [pod] --tail=100

Recovery Options:

Rollback deployment (if recent deploy)
Scale up (if capacity issue)
Failover (if regional issue)

Database Connection Issues

[Specific procedures]

High Latency

[Specific procedures]

Maintenance Procedures

Scheduled Maintenance Window

Pre-Maintenance Checklist:

Notify stakeholders
Verify backup
Prepare rollback plan
Schedule maintenance window

During Maintenance:

[Steps]

Post-Maintenance:

Verify service health
Monitor metrics
Update documentation

Recovery Procedures

Restore from Backup

Prerequisites:

Access to backup storage
Database credentials

Steps:

[Detailed steps]

Disaster Recovery

RTO: [Time] RPO: [Time]

Steps:

[DR procedure]

Contacts

Role	Name	Contact
Team Lead	[Name]	[Email/Slack]
DBA	[Name]	[Email/Slack]
Security	[Name]	[Email/Slack]
Vendor Support	[Company]	[Contact]

Appendix

Environment Variables

Variable	Description	Example
[VAR]	[Description]	[Example]

Architecture Diagram

[ASCII diagram]

Changelog

Date	Author	Changes
[Date]	[Name]	[Description]


## Complete Example

```markdown
# Runbook: Payment Processing Service

**Version:** 2.3
**Last Updated:** 2025-10-15
**Owner:** Payments Team
**On-Call:** #payments-oncall (PagerDuty rotation)

## Overview

**Service Description:**
The Payment Processing Service handles all payment transactions including credit cards, digital wallets, and bank transfers. It integrates with Stripe, PayPal, and direct bank APIs.

**Criticality:** P0 (Revenue-critical)

**Dependencies:**
- Upstream: Order Service, Customer Service, Auth Service
- Downstream: Fulfillment Service, Notification Service, Analytics

**Documentation:**
- Architecture: [link to design doc]
- API Docs: https://api-docs.internal/payments
- Dashboard: https://datadog.com/dash/payments
- PCI Compliance: [link to compliance docs]

## Quick Reference

### Key URLs

| Resource | URL |
|----------|-----|
| Production API | https://api.example.com/payments |
| Staging API | https://api.staging.example.com/payments |
| Logs (Datadog) | https://datadog.com/logs?service=payment |
| Metrics Dashboard | https://datadog.com/dash/payment-svc |
| Alerts | https://pagerduty.com/services/payments |
| Stripe Dashboard | https://dashboard.stripe.com |

### Key Commands

```bash
# Check service status
kubectl get pods -n payments -l app=payment-service

# View recent logs
kubectl logs -n payments -l app=payment-service --tail=100 -f

# Restart service (rolling)
kubectl rollout restart deployment/payment-service -n payments

# Check Stripe API status
curl -s https://status.stripe.com/api/v2/status.json | jq '.status'

# Force circuit breaker reset
curl -X POST http://localhost:8080/admin/circuit-breaker/reset

Escalation Path

Level	Contact	When
L1	On-call engineer	Initial response
L2	@payments-lead	>15 min unresolved or revenue impact
L3	@engineering-manager	P0 incidents, data breach
External	Stripe Support (priority line)	Provider issues

Health Checks

Service Health

Endpoint: GET /health

Expected Response:

json

{
  "status": "healthy",
  "version": "3.2.1",
  "uptime": "15d 4h 32m",
  "dependencies": {
    "database": "healthy",
    "redis": "healthy",
    "stripe": "healthy",
    "paypal": "degraded"
  }
}

Unhealthy Response Actions:

database: unhealthy → Check RDS, see Database section
redis: unhealthy → Check ElastiCache, see Cache section
stripe: unhealthy → Check Stripe status page
paypal: unhealthy → Check PayPal status, may need failover

Key Metrics

Metric	Healthy Range	Alert Threshold
payment_success_rate	>99%	<98%
payment_latency_p99	<2s	>3s
stripe_api_latency_p99	<500ms	>1s
active_connections	10-50	>80
error_rate	<0.5%	>1%
circuit_breaker_open	0	>0

Common Alerts

Alert: PaymentSuccessRateLow

Severity: Critical (pages immediately)

Description: Payment success rate has dropped below 98% over 5-minute window.

Possible Causes:

Payment provider outage (Stripe/PayPal)
Database connection issues
Fraud detection blocking legitimate payments
Network issues to payment providers
Bad deployment

Investigation Steps:

Check payment provider status pages
- Stripe: https://status.stripe.com
- PayPal: https://www.paypal-status.com

Check error breakdown in Datadog:

service:payment-service status:error | top 10 @error.type

Check recent deployments:

bash

kubectl rollout history deployment/payment-service -n payments

Check database health:

bash

psql -h $DB_HOST -U $DB_USER -c "SELECT 1"

Check Stripe API specifically:

bash

# Look for Stripe errors
kubectl logs -n payments -l app=payment-service --tail=500 | grep -i stripe

Resolution:

If provider outage:

Acknowledge alert
Post in #incidents with provider status link
Enable fallback provider if available
Monitor until resolved

If database issue:

Check RDS metrics in AWS console
Look for connection exhaustion

Restart connection pool if needed:

bash

curl -X POST http://localhost:8080/admin/db/reconnect

If bad deployment:

Rollback immediately:

bash

kubectl rollout undo deployment/payment-service -n payments

Verify recovery
Investigate root cause

Escalation:

If unresolved after 15 minutes → Page team lead
If revenue loss >$10K → Page engineering manager
If provider outage → Contact Stripe priority support

Alert: PaymentLatencyHigh

Severity: Warning

Description: Payment API p99 latency exceeds 3 seconds.

Possible Causes:

Stripe API slowness
Database query performance
High traffic/capacity issues
Network latency
Fraud check taking too long

Investigation Steps:

Check which payment provider is slow:

sql

SELECT provider, avg(latency_ms), p99(latency_ms)
FROM payment_metrics
WHERE timestamp > NOW() - INTERVAL '15 minutes'
GROUP BY provider;

Check if it's database:

bash

# Look at slow queries
kubectl logs -n payments -l app=payment-service | grep "slow query"

Check pod resource usage:
bash
```
kubectl top pods -n payments
```

Resolution:

If Stripe slowness:

Usually self-resolving
Consider increasing timeouts temporarily
Check Stripe status page

If database:

Check for missing indexes
Look for connection pool exhaustion
Consider read replica routing

If capacity:

bash

# Scale up pods
kubectl scale deployment/payment-service -n payments --replicas=10

Alert: CircuitBreakerOpen

Severity: Critical

Description: Circuit breaker for external payment provider is open, blocking all requests.

Possible Causes:

Payment provider completely down
Network partition
Provider rate limiting us

Investigation Steps:

Check which circuit is open:

bash

curl http://localhost:8080/admin/circuit-breaker/status | jq

Check provider status pages

Check our request rate:

sql

SELECT COUNT(*) FROM payment_requests
WHERE provider = 'stripe' AND timestamp > NOW() - INTERVAL '1 minute';

Resolution:

If provider is recovering, manually reset:

bash

curl -X POST http://localhost:8080/admin/circuit-breaker/reset/stripe

If provider is down, enable fallback:

bash

curl -X POST http://localhost:8080/admin/failover/enable/paypal

Communicate status to stakeholders

Troubleshooting

Symptom: Payments failing with "Card Declined"

Diagnosis:

bash

# Check decline reasons
kubectl logs -n payments -l app=payment-service --tail=1000 | \
  grep "card_declined" | jq '.decline_code' | sort | uniq -c

Possible Causes:

Legitimate fraud detection
Stripe Radar rules too aggressive
Card network issues
Customer's bank issues

Resolution:

Check if specific decline code is spiking
Review Stripe Radar rules if false positives
Advise customer service on messaging

Symptom: Duplicate charges appearing

Diagnosis:

sql

-- Find duplicate transactions
SELECT customer_id, amount, COUNT(*)
FROM payments
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY customer_id, amount, idempotency_key
HAVING COUNT(*) > 1;

Resolution:

Identify affected transactions
Process refunds via Stripe dashboard
Investigate idempotency key generation
Check for retry logic issues

Operational Procedures

Procedure: Manual Payment Refund

Purpose: Process refunds that can't go through normal flow

When to Use: Customer service escalation, system errors

Prerequisites:

Stripe dashboard access
Original transaction ID
Customer verification complete

Steps:

Find transaction in Stripe:

Dashboard → Payments → Search by ID

Verify transaction details match request
Click "Refund" and enter amount

Add metadata:

json

{
  "reason": "customer_request",
  "ticket": "CS-12345",
  "processed_by": "your_email"
}

Update internal database:

sql

UPDATE payments
SET status = 'refunded',
    refund_reason = 'manual - CS-12345',
    updated_at = NOW()
WHERE stripe_payment_id = 'pi_xxx';

Verification:

Refund shows in Stripe dashboard
Customer notified
Internal records updated

Procedure: Rotate Stripe API Keys

Purpose: Regular security rotation or key compromise

When to Use: Quarterly rotation or security incident

Prerequisites:

Stripe admin access
AWS Secrets Manager access
Deployment access

Steps:

Generate new key in Stripe Dashboard:

Developers → API Keys → Create restricted key

Update secret in AWS:

bash

aws secretsmanager update-secret \
  --secret-id payment-service/stripe-api-key \
  --secret-string '{"key": "sk_live_xxx"}'

Trigger secret refresh:

bash

kubectl rollout restart deployment/payment-service -n payments

Verify new key is working:

bash

kubectl logs -n payments -l app=payment-service | grep "Stripe client initialized"

Revoke old key in Stripe Dashboard

Rollback: If issues occur, restore old key from Secrets Manager version history.

Emergency Procedures

Complete Payment Outage

Immediate Actions (First 5 minutes):

Acknowledge alert in PagerDuty
Join #incident-payments channel

Post initial status:

🔴 Investigating payment processing issues. Payments may be failing.
Impact: All payment types affected
Started: [time]

Run quick diagnostics:

bash

# Service status
kubectl get pods -n payments

# Recent errors
kubectl logs -n payments -l app=payment-service --tail=50 | grep ERROR

# Provider status
curl -s https://status.stripe.com/api/v2/status.json

Diagnosis Tree:

Payments failing?
├── All providers failing?
│   ├── Yes → Check our infrastructure
│   │   ├── Pods running? → Restart/scale
│   │   ├── DB connected? → Check RDS
│   │   └── Network issue? → Check VPC/SG
│   └── No → Provider-specific issue
│       ├── Stripe only → Check Stripe status
│       └── PayPal only → Check PayPal status

Recovery Actions:

Infrastructure issue:

bash

# Scale up
kubectl scale deployment/payment-service -n payments --replicas=10

# Force restart
kubectl delete pods -n payments -l app=payment-service

# Check and fix database
psql -h $DB_HOST -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction';"

Provider issue:

bash

# Failover to backup provider
kubectl set env deployment/payment-service \
  PRIMARY_PROVIDER=paypal \
  -n payments

Communication:

Update #incident-payments every 15 minutes
Notify customer service team
Update status page if >10 minutes

Suspected Fraud or Data Breach

Immediate Actions:

DO NOT discuss in public channels
Page Security team immediately
Document exact time of discovery
Preserve logs (do not delete anything)

If active breach:

bash

# Block suspicious IPs (if identified)
# Contact Security team for specific commands

Contacts

Role	Name	Contact
Team Lead	Sarah Chen	@sarah-chen, +1-555-0123
Engineering Manager	Mike Johnson	@mike-j
DBA	David Kim	@david-kim
Security	Lisa Wang	@security-oncall
Stripe Support	Priority Line	support@stripe.com, 1-888-xxx

Appendix

Environment Variables

Variable	Description	Example
STRIPE_API_KEY	Stripe secret key	sk_live_xxx
STRIPE_WEBHOOK_SECRET	Webhook signing secret	whsec_xxx
DB_CONNECTION_STRING	PostgreSQL connection	postgres://...
REDIS_URL	Redis connection	redis://...
FRAUD_CHECK_ENABLED	Enable fraud checks	true

Changelog

Date	Author	Changes
2025-10-15	Sarah Chen	Added PayPal failover procedure
2025-09-01	Mike Johnson	Updated Stripe key rotation
2025-08-15	David Kim	Added database troubleshooting


## Runbook Best Practices

### 1. Assume Tired, Stressed Reader

- Clear, numbered steps
- One action per step
- Copy-paste ready commands
- No ambiguity

### 2. Keep Current

- Review after every incident
- Update quarterly minimum
- Version control changes

### 3. Test Procedures

- Practice runbooks in staging
- Include in game days
- Verify commands work

### 4. Include Context

- Why, not just what
- Links to deeper documentation
- Escalation paths

---

*A runbook's value is proven at 3 AM during an incident. Write for that moment.*