Skip to main content
Featured

Incident Postmortem Template

October 15, 2025By CTO25 min read
...
templates

A structured template for blameless incident analysis with timeline, root cause, and action items.

Template Type:Documentation

Incident Postmortem Template

Postmortems (also called retrospectives or post-incident reviews) help teams learn from incidents without blame. They create institutional knowledge, identify systemic improvements, and prevent recurrence.

Why Use Postmortems?

Benefits:

  • Prevents recurrence of similar incidents
  • Builds organizational learning
  • Identifies systemic weaknesses
  • Creates accountability for improvements
  • Develops incident response skills

When to write a postmortem:

  • Any P0 or P1 incident
  • Customer-impacting outages
  • Security incidents
  • Near-misses with learning value
  • When requested by stakeholders

The Template

markdown
# Incident Postmortem: [Title]

**Incident ID:** INC-[YYYY]-[NNN]
**Date:** [YYYY-MM-DD]
**Duration:** [X hours Y minutes]
**Severity:** [P0/P1/P2/P3]
**Status:** [Draft | In Review | Final]

**Author:** [Name]
**Reviewers:** [Names]

## Executive Summary

[2-3 sentence summary: what happened, impact, and current status]

## Impact

### User Impact

- **Users affected:** [Number/percentage]
- **Geographic scope:** [Regions]
- **Duration of impact:** [Time]

### Business Impact

- **Revenue impact:** [If applicable]
- **SLA impact:** [If applicable]
- **Reputational impact:** [Assessment]

### Technical Impact

- **Services affected:** [List]
- **Data impact:** [Any data loss/corruption]
- **Downstream effects:** [Other systems affected]

## Timeline

All times in [UTC/timezone]

| Time | Event |
|------|-------|
| [HH:MM] | [First signs of issue] |
| [HH:MM] | [Alert triggered] |
| [HH:MM] | [On-call acknowledged] |
| [HH:MM] | [Initial investigation began] |
| [HH:MM] | [Root cause identified] |
| [HH:MM] | [Mitigation started] |
| [HH:MM] | [Service restored] |
| [HH:MM] | [Incident resolved] |

## Root Cause Analysis

### What happened

[Detailed technical explanation of the failure]

### Contributing factors

1. **[Factor 1]:** [Explanation]
2. **[Factor 2]:** [Explanation]
3. **[Factor 3]:** [Explanation]

### Why it wasn't caught

[Explanation of why existing safeguards didn't prevent this]

## Detection

- **How was it detected:** [Alert/Customer report/etc]
- **Time to detect:** [Duration from start to detection]
- **Detection gap:** [What could have detected it sooner]

## Response

### What went well

- [Positive aspect 1]
- [Positive aspect 2]

### What could be improved

- [Improvement area 1]
- [Improvement area 2]

### Where we got lucky

- [Lucky factor 1]

## Action Items

| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| P0 | [Immediate fix] | [Name] | [Date] | [Status] |
| P1 | [Short-term improvement] | [Name] | [Date] | [Status] |
| P2 | [Longer-term fix] | [Name] | [Date] | [Status] |

## Lessons Learned

### Technical

- [Technical lesson 1]
- [Technical lesson 2]

### Process

- [Process lesson 1]
- [Process lesson 2]

### Communication

- [Communication lesson]

## Supporting Information

### Related Incidents

- [Link to similar past incidents]

### References

- [Link to dashboards]
- [Link to logs]
- [Link to related documentation]

## Appendix

### Detailed Technical Analysis

[Deep dive into technical details if needed]

### Communication Log

[Key communications during incident]

Complete Example

markdown
# Incident Postmortem: Payment Processing Outage

**Incident ID:** INC-2025-047
**Date:** 2025-10-08
**Duration:** 2 hours 34 minutes
**Severity:** P0
**Status:** Final

**Author:** Sarah Chen
**Reviewers:** Mike Johnson, David Kim, Lisa Wang

## Executive Summary

On October 8, 2025, our payment processing service experienced a complete outage from 14:23 to 16:57 UTC, preventing all customer transactions. The root cause was a database connection pool exhaustion triggered by a slow query introduced in a routine deployment. Approximately 12,000 transactions failed, resulting in an estimated $180,000 in delayed revenue (later recovered) and 340 customer support tickets.

## Impact

### User Impact

- **Users affected:** ~8,500 unique users attempted transactions
- **Transactions failed:** 12,347
- **Geographic scope:** Global (all regions)
- **Duration of impact:** 2 hours 34 minutes

### Business Impact

- **Revenue impact:** $180,000 delayed (recovered via retry)
  - $12,000 permanently lost (cart abandonment)
- **SLA impact:** Breached 99.9% monthly SLA
  - Will provide service credits to enterprise customers
- **Reputational impact:** Moderate
  - 47 negative social media mentions
  - 2 tech blog articles

### Technical Impact

- **Services affected:**
  - Payment Service (complete outage)
  - Order Service (degraded - could create orders but not process payment)
  - Checkout UI (degraded - payment step failed)
- **Data impact:** None - no data loss or corruption
- **Downstream effects:**
  - Fulfillment queue backed up by 2 hours
  - Analytics data gap for incident period

## Timeline

All times in UTC

| Time | Event |
|------|-------|
| 13:45 | Deployment of payment-service v3.2.1 completed |
| 14:15 | Slow query begins executing (not yet visible) |
| 14:23 | First customer reports payment failure (support ticket) |
| 14:26 | Automated alert: Payment success rate < 98% |
| 14:28 | On-call engineer (Alex) acknowledges alert |
| 14:32 | Alex begins investigation, checks Stripe status (healthy) |
| 14:38 | Payment success rate drops to 0% |
| 14:40 | Alex escalates to team lead (Sarah) |
| 14:45 | Sarah joins investigation, notices DB connection errors |
| 14:52 | Database team (David) paged |
| 15:05 | David identifies connection pool exhaustion |
| 15:15 | Slow query identified as root cause |
| 15:22 | Decision to rollback deployment |
| 15:28 | Rollback initiated |
| 15:35 | Rollback complete, connections still exhausted |
| 15:42 | Database connection pool forcefully reset |
| 15:48 | Services recovering, success rate at 45% |
| 16:10 | Success rate at 85% |
| 16:45 | Success rate at 99%, monitoring |
| 16:57 | Incident resolved, success rate stable at 99.5% |

## Root Cause Analysis

### What happened

The deployment at 13:45 included a change to the payment reconciliation query that inadvertently removed an index hint. This caused the query to perform a full table scan on the `transactions` table (47M rows) instead of using the `idx_transactions_created_at` index.

```sql
-- Before (fast, ~50ms)
SELECT * FROM transactions
WHERE created_at > NOW() - INTERVAL '24 hours'
  AND status = 'pending'
/*+ INDEX(transactions idx_transactions_created_at) */

-- After (slow, ~45 seconds)
SELECT * FROM transactions
WHERE created_at > NOW() - INTERVAL '24 hours'
  AND status = 'pending'

The reconciliation job runs every 5 minutes. After deployment:

  1. First execution at 13:50 took 45 seconds instead of 50ms
  2. Second execution at 13:55 started while first was still running
  3. By 14:15, 6 concurrent executions were holding connections
  4. By 14:23, all 100 connections in the pool were consumed
  5. New payment requests couldn't acquire connections and failed

Contributing factors

  1. Missing index hint removal: Code review didn't catch the removed hint because the query still returned correct results. No performance testing in CI.

  2. No query timeout: The reconciliation query had no statement timeout, allowing it to run indefinitely and hold connections.

  3. Connection pool sizing: Pool size of 100 was adequate for normal operations but couldn't absorb runaway queries. No circuit breaker for database access.

  4. Monitoring gap: We alert on payment success rate but not on database connection utilization or query duration anomalies.

  5. Deployment timing: Deployed at 13:45, just before the reconciliation job's 13:50 execution. Earlier deployment would have been caught before peak traffic.

Why it wasn't caught

  • Code review: Focused on correctness, not performance. The change looked like a code cleanup.
  • Staging environment: Has only 1M transactions vs 47M in production. Query completed in <1s.
  • Load testing: Not run for this "minor" change.
  • Canary deployment: We use canary for traffic, but this was a background job issue.

Detection

  • How was it detected: Customer support ticket (14:23), then automated alert (14:26)
  • Time to detect: 38 minutes from deployment, 3 minutes from customer impact
  • Detection gap: Should have detected:
    • Slow query immediately via query duration monitoring
    • Connection pool exhaustion before it hit 100%
    • Background job execution time increase

Response

What went well

  • Fast escalation: On-call appropriately escalated when stuck at 14:40
  • Clear communication: Status updates posted every 15 minutes in #incidents
  • Customer communication: Status page updated at 14:45, email sent at 15:00
  • Documentation: Team referenced runbook for connection pool issues
  • Parallel investigation: Multiple team members effectively divided work

What could be improved

  • Initial diagnosis: Spent 10 minutes checking external dependencies (Stripe) when issue was internal
  • Rollback decision: Took 30 minutes from identifying root cause to deciding to rollback
  • Recovery time: After rollback, pool was still exhausted; should have reset connections immediately
  • Metric visibility: Had to manually query database for connection stats

Where we got lucky

  • No data corruption: Long-running queries could have caused lock contention and inconsistent state
  • Weekend timing: 14:00-17:00 UTC is lower traffic; peak hours would have doubled impact
  • Recoverable transactions: Most failed payments could be retried; cart recovery emails brought back 92% of customers

Action Items

PriorityActionOwnerDue DateStatus
P0Add statement_timeout to all background jobsDavidOct 10✅ Done
P0Add connection pool utilization alerts (>80%)SarahOct 10✅ Done
P0Add query duration anomaly detectionSarahOct 15✅ Done
P1Implement circuit breaker for database accessAlexOct 22In Progress
P1Add query plan analysis to CI for changed queriesDavidOct 25Not Started
P1Create runbook for connection pool exhaustionSarahOct 20In Progress
P2Increase staging data volume to 10M+ rowsOps TeamNov 15Not Started
P2Implement connection pool auto-scalingDavidNov 30Not Started
P2Add deployment time recommendations (avoid peak)PlatformDec 15Not Started

Lessons Learned

Technical

  • Query performance is a correctness issue: A query that's 1000x slower is effectively broken, even if it returns the right data. Need to treat performance regressions as bugs.

  • Connection pools need protection: Database connections are a limited resource. Need circuit breakers, timeouts, and monitoring to prevent exhaustion.

  • Staging parity matters: The 47x difference in data volume between staging and production masked the issue completely.

Process

  • Performance testing for "minor" changes: Any database query change should require performance verification, regardless of perceived risk.

  • Deployment timing awareness: Background jobs create "hidden" execution windows. Need visibility into job schedules during deployment decisions.

  • Runbook gaps: We had a generic database troubleshooting runbook but nothing specific to connection pool exhaustion. Specific runbooks for known failure modes are more useful than generic ones.

Communication

  • Status page update timing: We updated status page at 14:45, 22 minutes after first customer impact. Should be faster.

  • Internal communication worked: The #incidents channel with regular updates kept everyone informed without disrupting the response team.

Supporting Information

  • INC-2024-089: Similar connection pool exhaustion from ORM N+1 queries (different root cause, similar symptoms)
  • INC-2025-012: Slow query caused by missing index (caught in staging, no production impact)

References

Appendix

Detailed Technical Analysis

Query execution plan before (EXPLAIN ANALYZE):

Index Scan using idx_transactions_created_at on transactions
  Index Cond: (created_at > (now() - '24:00:00'::interval))
  Filter: (status = 'pending')
  Rows Removed by Filter: 2,341
  Actual time: 12.3..48.7 ms
  Actual rows: 1,247

Query execution plan after:

Seq Scan on transactions
  Filter: ((created_at > (now() - '24:00:00'::interval)) AND (status = 'pending'))
  Rows Removed by Filter: 47,234,891
  Actual time: 3,421.2..45,892.4 ms
  Actual rows: 1,247

Connection pool state at peak (15:05):

Active connections: 100/100
Waiting requests: 847
Oldest waiting: 12m 34s
Connection acquisition timeout: 30s

Communication Log

TimeChannelMessage
14:45Status Page"Investigating payment processing issues"
15:00EmailCustomer notification sent to affected users
15:15Status Page"Identified - Database performance issue"
15:30Status Page"Implementing fix - Rollback in progress"
16:00TwitterResponse to customer complaints
17:00Status Page"Resolved - Payments functioning normally"
17:30EmailFollow-up to affected customers with details

## Postmortem Best Practices

### 1. Blameless Culture

- Focus on systems, not individuals
- "How did the system allow this?" not "Who made this mistake?"
- Use "we" language

### 2. Complete the Timeline

- Gather data from logs, chat, and memory while fresh
- Include decision points, not just events
- Note what information was available at each point

### 3. Dig Deep on Root Cause

- Use "5 Whys" technique
- Look for systemic issues, not just proximate causes
- Ask: "What would have prevented this?"

### 4. Actionable Items

- Each item has an owner and due date
- Prioritize ruthlessly
- Follow up and track completion

### 5. Share and Learn

- Publish internally (and externally if appropriate)
- Discuss in team meetings
- Create patterns from repeated issues

---

*The goal of a postmortem is not to assign blame, but to ensure we learn from incidents and continuously improve our systems and processes.*

Want more insights like this?

Join thousands of CTOs and technical leaders getting weekly insights on leadership and system design.

No spam. Unsubscribe anytime.