Skip to main content

Runbook Template

October 15, 2025By CTO25 min read
...
templates

A structured template for documenting operational procedures, troubleshooting steps, and incident response.

Template Type:Documentation

Runbook Template

Runbooks document operational procedures that enable anyone on-call to handle incidents and perform routine operations. Good runbooks reduce mean time to recovery (MTTR) and minimize reliance on tribal knowledge.

Why Use Runbooks?

Benefits:

  • Enables consistent incident response
  • Reduces dependency on specific individuals
  • Decreases mean time to recovery (MTTR)
  • Supports on-call engineers effectively
  • Creates institutional knowledge

When to write a runbook:

  • New services going to production
  • After incidents (capture learnings)
  • Common operational tasks
  • Complex troubleshooting procedures
  • Compliance requirements

The Template

markdown
# Runbook: [Service/System Name]

**Version:** [1.0]
**Last Updated:** [YYYY-MM-DD]
**Owner:** [Team Name]
**On-Call:** [Rotation/Contact Info]

## Overview

**Service Description:**
[Brief description of what this service does]

**Criticality:** [P0/P1/P2/P3]

**Dependencies:**
- Upstream: [Services this depends on]
- Downstream: [Services that depend on this]

**Documentation:**
- Architecture: [Link]
- API Docs: [Link]
- Dashboard: [Link]

## Quick Reference

### Key URLs

| Resource | URL |
|----------|-----|
| Production | [URL] |
| Staging | [URL] |
| Logs | [URL] |
| Metrics | [URL] |
| Alerts | [URL] |

### Key Commands

```bash
# Check service status
[command]

# View logs
[command]

# Restart service
[command]

Escalation Path

LevelContactWhen
L1[On-call engineer]Initial response
L2[Team lead]>30 min unresolved
L3[Engineering manager]P0 incidents

Health Checks

Service Health

Endpoint: GET /health

Expected Response:

json
{
  "status": "healthy",
  "version": "1.2.3",
  "dependencies": {
    "database": "healthy",
    "cache": "healthy"
  }
}

Key Metrics

MetricHealthy RangeAlert Threshold
[Metric 1][Range][Threshold]
[Metric 2][Range][Threshold]

Common Alerts

Alert: [Alert Name]

Severity: [Critical/Warning/Info]

Description: [What this alert means]

Possible Causes:

  1. [Cause 1]
  2. [Cause 2]

Investigation Steps:

  1. [Step 1]
  2. [Step 2]

Resolution:

  1. [Resolution step 1]
  2. [Resolution step 2]

Escalation: [When and how to escalate]

[Repeat for each alert]

Troubleshooting

Symptom: [Description]

Diagnosis:

bash
# Commands to diagnose
[command]

Possible Causes:

  • [Cause 1]
  • [Cause 2]

Resolution:

  1. [Step 1]
  2. [Step 2]

[Repeat for common symptoms]

Operational Procedures

Procedure: [Name]

Purpose: [Why this is done]

When to Use: [Conditions]

Prerequisites:

  • [Prerequisite 1]
  • [Prerequisite 2]

Steps:

  1. [Step 1]
    bash
    [command]
  2. [Step 2]
  3. [Step 3]

Verification:

  • [How to verify success]

Rollback:

  1. [Rollback step 1]
  2. [Rollback step 2]

[Repeat for each procedure]

Emergency Procedures

Service Down - Complete Outage

Immediate Actions:

  1. Check dashboard for related alerts
  2. Verify infrastructure status
  3. Check recent deployments
  4. Communicate in incident channel

Diagnosis:

bash
# Check pod status
kubectl get pods -n [namespace]

# Check recent events
kubectl get events -n [namespace] --sort-by='.lastTimestamp'

# Check logs
kubectl logs -n [namespace] [pod] --tail=100

Recovery Options:

  1. Rollback deployment (if recent deploy)
  2. Scale up (if capacity issue)
  3. Failover (if regional issue)

Database Connection Issues

[Specific procedures]

High Latency

[Specific procedures]

Maintenance Procedures

Scheduled Maintenance Window

Pre-Maintenance Checklist:

  • Notify stakeholders
  • Verify backup
  • Prepare rollback plan
  • Schedule maintenance window

During Maintenance:

  1. [Steps]

Post-Maintenance:

  • Verify service health
  • Monitor metrics
  • Update documentation

Recovery Procedures

Restore from Backup

Prerequisites:

  • Access to backup storage
  • Database credentials

Steps:

  1. [Detailed steps]

Disaster Recovery

RTO: [Time] RPO: [Time]

Steps:

  1. [DR procedure]

Contacts

RoleNameContact
Team Lead[Name][Email/Slack]
DBA[Name][Email/Slack]
Security[Name][Email/Slack]
Vendor Support[Company][Contact]

Appendix

Environment Variables

VariableDescriptionExample
[VAR][Description][Example]

Architecture Diagram

[ASCII diagram]

Changelog

DateAuthorChanges
[Date][Name][Description]

## Complete Example

```markdown
# Runbook: Payment Processing Service

**Version:** 2.3
**Last Updated:** 2025-10-15
**Owner:** Payments Team
**On-Call:** #payments-oncall (PagerDuty rotation)

## Overview

**Service Description:**
The Payment Processing Service handles all payment transactions including credit cards, digital wallets, and bank transfers. It integrates with Stripe, PayPal, and direct bank APIs.

**Criticality:** P0 (Revenue-critical)

**Dependencies:**
- Upstream: Order Service, Customer Service, Auth Service
- Downstream: Fulfillment Service, Notification Service, Analytics

**Documentation:**
- Architecture: [link to design doc]
- API Docs: https://api-docs.internal/payments
- Dashboard: https://datadog.com/dash/payments
- PCI Compliance: [link to compliance docs]

## Quick Reference

### Key URLs

| Resource | URL |
|----------|-----|
| Production API | https://api.example.com/payments |
| Staging API | https://api.staging.example.com/payments |
| Logs (Datadog) | https://datadog.com/logs?service=payment |
| Metrics Dashboard | https://datadog.com/dash/payment-svc |
| Alerts | https://pagerduty.com/services/payments |
| Stripe Dashboard | https://dashboard.stripe.com |

### Key Commands

```bash
# Check service status
kubectl get pods -n payments -l app=payment-service

# View recent logs
kubectl logs -n payments -l app=payment-service --tail=100 -f

# Restart service (rolling)
kubectl rollout restart deployment/payment-service -n payments

# Check Stripe API status
curl -s https://status.stripe.com/api/v2/status.json | jq '.status'

# Force circuit breaker reset
curl -X POST http://localhost:8080/admin/circuit-breaker/reset

Escalation Path

LevelContactWhen
L1On-call engineerInitial response
L2@payments-lead>15 min unresolved or revenue impact
L3@engineering-managerP0 incidents, data breach
ExternalStripe Support (priority line)Provider issues

Health Checks

Service Health

Endpoint: GET /health

Expected Response:

json
{
  "status": "healthy",
  "version": "3.2.1",
  "uptime": "15d 4h 32m",
  "dependencies": {
    "database": "healthy",
    "redis": "healthy",
    "stripe": "healthy",
    "paypal": "degraded"
  }
}

Unhealthy Response Actions:

  • database: unhealthy → Check RDS, see Database section
  • redis: unhealthy → Check ElastiCache, see Cache section
  • stripe: unhealthy → Check Stripe status page
  • paypal: unhealthy → Check PayPal status, may need failover

Key Metrics

MetricHealthy RangeAlert Threshold
payment_success_rate>99%<98%
payment_latency_p99<2s>3s
stripe_api_latency_p99<500ms>1s
active_connections10-50>80
error_rate<0.5%>1%
circuit_breaker_open0>0

Common Alerts

Alert: PaymentSuccessRateLow

Severity: Critical (pages immediately)

Description: Payment success rate has dropped below 98% over 5-minute window.

Possible Causes:

  1. Payment provider outage (Stripe/PayPal)
  2. Database connection issues
  3. Fraud detection blocking legitimate payments
  4. Network issues to payment providers
  5. Bad deployment

Investigation Steps:

  1. Check payment provider status pages

  2. Check error breakdown in Datadog:

    service:payment-service status:error | top 10 @error.type
    
  3. Check recent deployments:

    bash
    kubectl rollout history deployment/payment-service -n payments
  4. Check database health:

    bash
    psql -h $DB_HOST -U $DB_USER -c "SELECT 1"
  5. Check Stripe API specifically:

    bash
    # Look for Stripe errors
    kubectl logs -n payments -l app=payment-service --tail=500 | grep -i stripe

Resolution:

If provider outage:

  1. Acknowledge alert
  2. Post in #incidents with provider status link
  3. Enable fallback provider if available
  4. Monitor until resolved

If database issue:

  1. Check RDS metrics in AWS console
  2. Look for connection exhaustion
  3. Restart connection pool if needed:
    bash
    curl -X POST http://localhost:8080/admin/db/reconnect

If bad deployment:

  1. Rollback immediately:
    bash
    kubectl rollout undo deployment/payment-service -n payments
  2. Verify recovery
  3. Investigate root cause

Escalation:

  • If unresolved after 15 minutes → Page team lead
  • If revenue loss >$10K → Page engineering manager
  • If provider outage → Contact Stripe priority support

Alert: PaymentLatencyHigh

Severity: Warning

Description: Payment API p99 latency exceeds 3 seconds.

Possible Causes:

  1. Stripe API slowness
  2. Database query performance
  3. High traffic/capacity issues
  4. Network latency
  5. Fraud check taking too long

Investigation Steps:

  1. Check which payment provider is slow:

    sql
    SELECT provider, avg(latency_ms), p99(latency_ms)
    FROM payment_metrics
    WHERE timestamp > NOW() - INTERVAL '15 minutes'
    GROUP BY provider;
  2. Check if it's database:

    bash
    # Look at slow queries
    kubectl logs -n payments -l app=payment-service | grep "slow query"
  3. Check pod resource usage:

    bash
    kubectl top pods -n payments

Resolution:

If Stripe slowness:

  • Usually self-resolving
  • Consider increasing timeouts temporarily
  • Check Stripe status page

If database:

  • Check for missing indexes
  • Look for connection pool exhaustion
  • Consider read replica routing

If capacity:

bash
# Scale up pods
kubectl scale deployment/payment-service -n payments --replicas=10

Alert: CircuitBreakerOpen

Severity: Critical

Description: Circuit breaker for external payment provider is open, blocking all requests.

Possible Causes:

  1. Payment provider completely down
  2. Network partition
  3. Provider rate limiting us

Investigation Steps:

  1. Check which circuit is open:

    bash
    curl http://localhost:8080/admin/circuit-breaker/status | jq
  2. Check provider status pages

  3. Check our request rate:

    sql
    SELECT COUNT(*) FROM payment_requests
    WHERE provider = 'stripe' AND timestamp > NOW() - INTERVAL '1 minute';

Resolution:

  1. If provider is recovering, manually reset:

    bash
    curl -X POST http://localhost:8080/admin/circuit-breaker/reset/stripe
  2. If provider is down, enable fallback:

    bash
    curl -X POST http://localhost:8080/admin/failover/enable/paypal
  3. Communicate status to stakeholders

Troubleshooting

Symptom: Payments failing with "Card Declined"

Diagnosis:

bash
# Check decline reasons
kubectl logs -n payments -l app=payment-service --tail=1000 | \
  grep "card_declined" | jq '.decline_code' | sort | uniq -c

Possible Causes:

  • Legitimate fraud detection
  • Stripe Radar rules too aggressive
  • Card network issues
  • Customer's bank issues

Resolution:

  1. Check if specific decline code is spiking
  2. Review Stripe Radar rules if false positives
  3. Advise customer service on messaging

Symptom: Duplicate charges appearing

Diagnosis:

sql
-- Find duplicate transactions
SELECT customer_id, amount, COUNT(*)
FROM payments
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY customer_id, amount, idempotency_key
HAVING COUNT(*) > 1;

Resolution:

  1. Identify affected transactions
  2. Process refunds via Stripe dashboard
  3. Investigate idempotency key generation
  4. Check for retry logic issues

Operational Procedures

Procedure: Manual Payment Refund

Purpose: Process refunds that can't go through normal flow

When to Use: Customer service escalation, system errors

Prerequisites:

  • Stripe dashboard access
  • Original transaction ID
  • Customer verification complete

Steps:

  1. Find transaction in Stripe:

    Dashboard → Payments → Search by ID
    
  2. Verify transaction details match request

  3. Click "Refund" and enter amount

  4. Add metadata:

    json
    {
      "reason": "customer_request",
      "ticket": "CS-12345",
      "processed_by": "your_email"
    }
  5. Update internal database:

    sql
    UPDATE payments
    SET status = 'refunded',
        refund_reason = 'manual - CS-12345',
        updated_at = NOW()
    WHERE stripe_payment_id = 'pi_xxx';

Verification:

  • Refund shows in Stripe dashboard
  • Customer notified
  • Internal records updated

Procedure: Rotate Stripe API Keys

Purpose: Regular security rotation or key compromise

When to Use: Quarterly rotation or security incident

Prerequisites:

  • Stripe admin access
  • AWS Secrets Manager access
  • Deployment access

Steps:

  1. Generate new key in Stripe Dashboard:

    Developers → API Keys → Create restricted key
    
  2. Update secret in AWS:

    bash
    aws secretsmanager update-secret \
      --secret-id payment-service/stripe-api-key \
      --secret-string '{"key": "sk_live_xxx"}'
  3. Trigger secret refresh:

    bash
    kubectl rollout restart deployment/payment-service -n payments
  4. Verify new key is working:

    bash
    kubectl logs -n payments -l app=payment-service | grep "Stripe client initialized"
  5. Revoke old key in Stripe Dashboard

Rollback: If issues occur, restore old key from Secrets Manager version history.

Emergency Procedures

Complete Payment Outage

Immediate Actions (First 5 minutes):

  1. Acknowledge alert in PagerDuty

  2. Join #incident-payments channel

  3. Post initial status:

    🔴 Investigating payment processing issues. Payments may be failing.
    Impact: All payment types affected
    Started: [time]
    
  4. Run quick diagnostics:

    bash
    # Service status
    kubectl get pods -n payments
    
    # Recent errors
    kubectl logs -n payments -l app=payment-service --tail=50 | grep ERROR
    
    # Provider status
    curl -s https://status.stripe.com/api/v2/status.json

Diagnosis Tree:

Payments failing?
├── All providers failing?
│   ├── Yes → Check our infrastructure
│   │   ├── Pods running? → Restart/scale
│   │   ├── DB connected? → Check RDS
│   │   └── Network issue? → Check VPC/SG
│   └── No → Provider-specific issue
│       ├── Stripe only → Check Stripe status
│       └── PayPal only → Check PayPal status

Recovery Actions:

Infrastructure issue:

bash
# Scale up
kubectl scale deployment/payment-service -n payments --replicas=10

# Force restart
kubectl delete pods -n payments -l app=payment-service

# Check and fix database
psql -h $DB_HOST -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction';"

Provider issue:

bash
# Failover to backup provider
kubectl set env deployment/payment-service \
  PRIMARY_PROVIDER=paypal \
  -n payments

Communication:

  • Update #incident-payments every 15 minutes
  • Notify customer service team
  • Update status page if >10 minutes

Suspected Fraud or Data Breach

Immediate Actions:

  1. DO NOT discuss in public channels
  2. Page Security team immediately
  3. Document exact time of discovery
  4. Preserve logs (do not delete anything)

If active breach:

bash
# Block suspicious IPs (if identified)
# Contact Security team for specific commands

Contacts

RoleNameContact
Team LeadSarah Chen@sarah-chen, +1-555-0123
Engineering ManagerMike Johnson@mike-j
DBADavid Kim@david-kim
SecurityLisa Wang@security-oncall
Stripe SupportPriority Linesupport@stripe.com, 1-888-xxx

Appendix

Environment Variables

VariableDescriptionExample
STRIPE_API_KEYStripe secret keysk_live_xxx
STRIPE_WEBHOOK_SECRETWebhook signing secretwhsec_xxx
DB_CONNECTION_STRINGPostgreSQL connectionpostgres://...
REDIS_URLRedis connectionredis://...
FRAUD_CHECK_ENABLEDEnable fraud checkstrue

Changelog

DateAuthorChanges
2025-10-15Sarah ChenAdded PayPal failover procedure
2025-09-01Mike JohnsonUpdated Stripe key rotation
2025-08-15David KimAdded database troubleshooting

## Runbook Best Practices

### 1. Assume Tired, Stressed Reader

- Clear, numbered steps
- One action per step
- Copy-paste ready commands
- No ambiguity

### 2. Keep Current

- Review after every incident
- Update quarterly minimum
- Version control changes

### 3. Test Procedures

- Practice runbooks in staging
- Include in game days
- Verify commands work

### 4. Include Context

- Why, not just what
- Links to deeper documentation
- Escalation paths

---

*A runbook's value is proven at 3 AM during an incident. Write for that moment.*

Want more insights like this?

Join thousands of CTOs and technical leaders getting weekly insights on leadership and system design.

No spam. Unsubscribe anytime.