Incident Postmortem Template

Postmortems (also called retrospectives or post-incident reviews) help teams learn from incidents without blame. They create institutional knowledge, identify systemic improvements, and prevent recurrence.

Why Use Postmortems?

Benefits:

Prevents recurrence of similar incidents
Builds organizational learning
Identifies systemic weaknesses
Creates accountability for improvements
Develops incident response skills

When to write a postmortem:

Any P0 or P1 incident
Customer-impacting outages
Security incidents
Near-misses with learning value
When requested by stakeholders

The Template

markdown

# Incident Postmortem: [Title]

**Incident ID:** INC-[YYYY]-[NNN]
**Date:** [YYYY-MM-DD]
**Duration:** [X hours Y minutes]
**Severity:** [P0/P1/P2/P3]
**Status:** [Draft | In Review | Final]

**Author:** [Name]
**Reviewers:** [Names]

## Executive Summary

[2-3 sentence summary: what happened, impact, and current status]

## Impact

### User Impact

- **Users affected:** [Number/percentage]
- **Geographic scope:** [Regions]
- **Duration of impact:** [Time]

### Business Impact

- **Revenue impact:** [If applicable]
- **SLA impact:** [If applicable]
- **Reputational impact:** [Assessment]

### Technical Impact

- **Services affected:** [List]
- **Data impact:** [Any data loss/corruption]
- **Downstream effects:** [Other systems affected]

## Timeline

All times in [UTC/timezone]

| Time | Event |
|------|-------|
| [HH:MM] | [First signs of issue] |
| [HH:MM] | [Alert triggered] |
| [HH:MM] | [On-call acknowledged] |
| [HH:MM] | [Initial investigation began] |
| [HH:MM] | [Root cause identified] |
| [HH:MM] | [Mitigation started] |
| [HH:MM] | [Service restored] |
| [HH:MM] | [Incident resolved] |

## Root Cause Analysis

### What happened

[Detailed technical explanation of the failure]

### Contributing factors

1. **[Factor 1]:** [Explanation]
2. **[Factor 2]:** [Explanation]
3. **[Factor 3]:** [Explanation]

### Why it wasn't caught

[Explanation of why existing safeguards didn't prevent this]

## Detection

- **How was it detected:** [Alert/Customer report/etc]
- **Time to detect:** [Duration from start to detection]
- **Detection gap:** [What could have detected it sooner]

## Response

### What went well

- [Positive aspect 1]
- [Positive aspect 2]

### What could be improved

- [Improvement area 1]
- [Improvement area 2]

### Where we got lucky

- [Lucky factor 1]

## Action Items

| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| P0 | [Immediate fix] | [Name] | [Date] | [Status] |
| P1 | [Short-term improvement] | [Name] | [Date] | [Status] |
| P2 | [Longer-term fix] | [Name] | [Date] | [Status] |

## Lessons Learned

### Technical

- [Technical lesson 1]
- [Technical lesson 2]

### Process

- [Process lesson 1]
- [Process lesson 2]

### Communication

- [Communication lesson]

## Supporting Information

### Related Incidents

- [Link to similar past incidents]

### References

- [Link to dashboards]
- [Link to logs]
- [Link to related documentation]

## Appendix

### Detailed Technical Analysis

[Deep dive into technical details if needed]

### Communication Log

[Key communications during incident]

Complete Example

markdown

# Incident Postmortem: Payment Processing Outage

**Incident ID:** INC-2025-047
**Date:** 2025-10-08
**Duration:** 2 hours 34 minutes
**Severity:** P0
**Status:** Final

**Author:** Sarah Chen
**Reviewers:** Mike Johnson, David Kim, Lisa Wang

## Executive Summary

On October 8, 2025, our payment processing service experienced a complete outage from 14:23 to 16:57 UTC, preventing all customer transactions. The root cause was a database connection pool exhaustion triggered by a slow query introduced in a routine deployment. Approximately 12,000 transactions failed, resulting in an estimated $180,000 in delayed revenue (later recovered) and 340 customer support tickets.

## Impact

### User Impact

- **Users affected:** ~8,500 unique users attempted transactions
- **Transactions failed:** 12,347
- **Geographic scope:** Global (all regions)
- **Duration of impact:** 2 hours 34 minutes

### Business Impact

- **Revenue impact:** $180,000 delayed (recovered via retry)
  - $12,000 permanently lost (cart abandonment)
- **SLA impact:** Breached 99.9% monthly SLA
  - Will provide service credits to enterprise customers
- **Reputational impact:** Moderate
  - 47 negative social media mentions
  - 2 tech blog articles

### Technical Impact

- **Services affected:**
  - Payment Service (complete outage)
  - Order Service (degraded - could create orders but not process payment)
  - Checkout UI (degraded - payment step failed)
- **Data impact:** None - no data loss or corruption
- **Downstream effects:**
  - Fulfillment queue backed up by 2 hours
  - Analytics data gap for incident period

## Timeline

All times in UTC

| Time | Event |
|------|-------|
| 13:45 | Deployment of payment-service v3.2.1 completed |
| 14:15 | Slow query begins executing (not yet visible) |
| 14:23 | First customer reports payment failure (support ticket) |
| 14:26 | Automated alert: Payment success rate < 98% |
| 14:28 | On-call engineer (Alex) acknowledges alert |
| 14:32 | Alex begins investigation, checks Stripe status (healthy) |
| 14:38 | Payment success rate drops to 0% |
| 14:40 | Alex escalates to team lead (Sarah) |
| 14:45 | Sarah joins investigation, notices DB connection errors |
| 14:52 | Database team (David) paged |
| 15:05 | David identifies connection pool exhaustion |
| 15:15 | Slow query identified as root cause |
| 15:22 | Decision to rollback deployment |
| 15:28 | Rollback initiated |
| 15:35 | Rollback complete, connections still exhausted |
| 15:42 | Database connection pool forcefully reset |
| 15:48 | Services recovering, success rate at 45% |
| 16:10 | Success rate at 85% |
| 16:45 | Success rate at 99%, monitoring |
| 16:57 | Incident resolved, success rate stable at 99.5% |

## Root Cause Analysis

### What happened

The deployment at 13:45 included a change to the payment reconciliation query that inadvertently removed an index hint. This caused the query to perform a full table scan on the `transactions` table (47M rows) instead of using the `idx_transactions_created_at` index.

```sql
-- Before (fast, ~50ms)
SELECT * FROM transactions
WHERE created_at > NOW() - INTERVAL '24 hours'
  AND status = 'pending'
/*+ INDEX(transactions idx_transactions_created_at) */

-- After (slow, ~45 seconds)
SELECT * FROM transactions
WHERE created_at > NOW() - INTERVAL '24 hours'
  AND status = 'pending'

The reconciliation job runs every 5 minutes. After deployment:

First execution at 13:50 took 45 seconds instead of 50ms
Second execution at 13:55 started while first was still running
By 14:15, 6 concurrent executions were holding connections
By 14:23, all 100 connections in the pool were consumed
New payment requests couldn't acquire connections and failed

Contributing factors

Missing index hint removal: Code review didn't catch the removed hint because the query still returned correct results. No performance testing in CI.
No query timeout: The reconciliation query had no statement timeout, allowing it to run indefinitely and hold connections.
Connection pool sizing: Pool size of 100 was adequate for normal operations but couldn't absorb runaway queries. No circuit breaker for database access.
Monitoring gap: We alert on payment success rate but not on database connection utilization or query duration anomalies.
Deployment timing: Deployed at 13:45, just before the reconciliation job's 13:50 execution. Earlier deployment would have been caught before peak traffic.

Why it wasn't caught

Code review: Focused on correctness, not performance. The change looked like a code cleanup.
Staging environment: Has only 1M transactions vs 47M in production. Query completed in <1s.
Load testing: Not run for this "minor" change.
Canary deployment: We use canary for traffic, but this was a background job issue.

Detection

How was it detected: Customer support ticket (14:23), then automated alert (14:26)
Time to detect: 38 minutes from deployment, 3 minutes from customer impact
Detection gap: Should have detected:
- Slow query immediately via query duration monitoring
- Connection pool exhaustion before it hit 100%
- Background job execution time increase

Response

What went well

Fast escalation: On-call appropriately escalated when stuck at 14:40
Clear communication: Status updates posted every 15 minutes in #incidents
Customer communication: Status page updated at 14:45, email sent at 15:00
Documentation: Team referenced runbook for connection pool issues
Parallel investigation: Multiple team members effectively divided work

What could be improved

Initial diagnosis: Spent 10 minutes checking external dependencies (Stripe) when issue was internal
Rollback decision: Took 30 minutes from identifying root cause to deciding to rollback
Recovery time: After rollback, pool was still exhausted; should have reset connections immediately
Metric visibility: Had to manually query database for connection stats

Where we got lucky

No data corruption: Long-running queries could have caused lock contention and inconsistent state
Weekend timing: 14:00-17:00 UTC is lower traffic; peak hours would have doubled impact
Recoverable transactions: Most failed payments could be retried; cart recovery emails brought back 92% of customers

Action Items

Priority	Action	Owner	Due Date	Status
P0	Add statement_timeout to all background jobs	David	Oct 10	✅ Done
P0	Add connection pool utilization alerts (>80%)	Sarah	Oct 10	✅ Done
P0	Add query duration anomaly detection	Sarah	Oct 15	✅ Done
P1	Implement circuit breaker for database access	Alex	Oct 22	In Progress
P1	Add query plan analysis to CI for changed queries	David	Oct 25	Not Started
P1	Create runbook for connection pool exhaustion	Sarah	Oct 20	In Progress
P2	Increase staging data volume to 10M+ rows	Ops Team	Nov 15	Not Started
P2	Implement connection pool auto-scaling	David	Nov 30	Not Started
P2	Add deployment time recommendations (avoid peak)	Platform	Dec 15	Not Started

Lessons Learned

Technical

Query performance is a correctness issue: A query that's 1000x slower is effectively broken, even if it returns the right data. Need to treat performance regressions as bugs.
Connection pools need protection: Database connections are a limited resource. Need circuit breakers, timeouts, and monitoring to prevent exhaustion.
Staging parity matters: The 47x difference in data volume between staging and production masked the issue completely.

Process

Performance testing for "minor" changes: Any database query change should require performance verification, regardless of perceived risk.
Deployment timing awareness: Background jobs create "hidden" execution windows. Need visibility into job schedules during deployment decisions.
Runbook gaps: We had a generic database troubleshooting runbook but nothing specific to connection pool exhaustion. Specific runbooks for known failure modes are more useful than generic ones.

Communication

Status page update timing: We updated status page at 14:45, 22 minutes after first customer impact. Should be faster.
Internal communication worked: The #incidents channel with regular updates kept everyone informed without disrupting the response team.

Supporting Information

INC-2024-089: Similar connection pool exhaustion from ORM N+1 queries (different root cause, similar symptoms)
INC-2025-012: Slow query caused by missing index (caught in staging, no production impact)

References

Appendix

Detailed Technical Analysis

Query execution plan before (EXPLAIN ANALYZE):

Index Scan using idx_transactions_created_at on transactions
  Index Cond: (created_at > (now() - '24:00:00'::interval))
  Filter: (status = 'pending')
  Rows Removed by Filter: 2,341
  Actual time: 12.3..48.7 ms
  Actual rows: 1,247

Query execution plan after:

Seq Scan on transactions
  Filter: ((created_at > (now() - '24:00:00'::interval)) AND (status = 'pending'))
  Rows Removed by Filter: 47,234,891
  Actual time: 3,421.2..45,892.4 ms
  Actual rows: 1,247

Connection pool state at peak (15:05):

Active connections: 100/100
Waiting requests: 847
Oldest waiting: 12m 34s
Connection acquisition timeout: 30s

Communication Log

Time	Channel	Message
14:45	Status Page	"Investigating payment processing issues"
15:00	Email	Customer notification sent to affected users
15:15	Status Page	"Identified - Database performance issue"
15:30	Status Page	"Implementing fix - Rollback in progress"
16:00	Twitter	Response to customer complaints
17:00	Status Page	"Resolved - Payments functioning normally"
17:30	Email	Follow-up to affected customers with details


## Postmortem Best Practices

### 1. Blameless Culture

- Focus on systems, not individuals
- "How did the system allow this?" not "Who made this mistake?"
- Use "we" language

### 2. Complete the Timeline

- Gather data from logs, chat, and memory while fresh
- Include decision points, not just events
- Note what information was available at each point

### 3. Dig Deep on Root Cause

- Use "5 Whys" technique
- Look for systemic issues, not just proximate causes
- Ask: "What would have prevented this?"

### 4. Actionable Items

- Each item has an owner and due date
- Prioritize ruthlessly
- Follow up and track completion

### 5. Share and Learn

- Publish internally (and externally if appropriate)
- Discuss in team meetings
- Create patterns from repeated issues

---

*The goal of a postmortem is not to assign blame, but to ensure we learn from incidents and continuously improve our systems and processes.*

Incident Postmortem Template

Incident Postmortem Template

Why Use Postmortems?

The Template

Complete Example

Contributing factors

Why it wasn't caught

Detection

Response

What went well

What could be improved

Where we got lucky

Action Items

Lessons Learned

Technical

Process

Communication

Supporting Information

References

Appendix

Detailed Technical Analysis

Communication Log

Want more insights like this?

Related Content

Runbook Template

RFC (Request for Comments) Template

System Design Document Template

Technical Specification Template

Architecture Decision Record (ADR) Template