Incident Postmortem Template
A structured template for blameless incident analysis with timeline, root cause, and action items.
Table of Contents
Incident Postmortem Template
Postmortems (also called retrospectives or post-incident reviews) help teams learn from incidents without blame. They create institutional knowledge, identify systemic improvements, and prevent recurrence.
Why Use Postmortems?
Benefits:
- Prevents recurrence of similar incidents
- Builds organizational learning
- Identifies systemic weaknesses
- Creates accountability for improvements
- Develops incident response skills
When to write a postmortem:
- Any P0 or P1 incident
- Customer-impacting outages
- Security incidents
- Near-misses with learning value
- When requested by stakeholders
The Template
# Incident Postmortem: [Title]
**Incident ID:** INC-[YYYY]-[NNN]
**Date:** [YYYY-MM-DD]
**Duration:** [X hours Y minutes]
**Severity:** [P0/P1/P2/P3]
**Status:** [Draft | In Review | Final]
**Author:** [Name]
**Reviewers:** [Names]
## Executive Summary
[2-3 sentence summary: what happened, impact, and current status]
## Impact
### User Impact
- **Users affected:** [Number/percentage]
- **Geographic scope:** [Regions]
- **Duration of impact:** [Time]
### Business Impact
- **Revenue impact:** [If applicable]
- **SLA impact:** [If applicable]
- **Reputational impact:** [Assessment]
### Technical Impact
- **Services affected:** [List]
- **Data impact:** [Any data loss/corruption]
- **Downstream effects:** [Other systems affected]
## Timeline
All times in [UTC/timezone]
| Time | Event |
|------|-------|
| [HH:MM] | [First signs of issue] |
| [HH:MM] | [Alert triggered] |
| [HH:MM] | [On-call acknowledged] |
| [HH:MM] | [Initial investigation began] |
| [HH:MM] | [Root cause identified] |
| [HH:MM] | [Mitigation started] |
| [HH:MM] | [Service restored] |
| [HH:MM] | [Incident resolved] |
## Root Cause Analysis
### What happened
[Detailed technical explanation of the failure]
### Contributing factors
1. **[Factor 1]:** [Explanation]
2. **[Factor 2]:** [Explanation]
3. **[Factor 3]:** [Explanation]
### Why it wasn't caught
[Explanation of why existing safeguards didn't prevent this]
## Detection
- **How was it detected:** [Alert/Customer report/etc]
- **Time to detect:** [Duration from start to detection]
- **Detection gap:** [What could have detected it sooner]
## Response
### What went well
- [Positive aspect 1]
- [Positive aspect 2]
### What could be improved
- [Improvement area 1]
- [Improvement area 2]
### Where we got lucky
- [Lucky factor 1]
## Action Items
| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| P0 | [Immediate fix] | [Name] | [Date] | [Status] |
| P1 | [Short-term improvement] | [Name] | [Date] | [Status] |
| P2 | [Longer-term fix] | [Name] | [Date] | [Status] |
## Lessons Learned
### Technical
- [Technical lesson 1]
- [Technical lesson 2]
### Process
- [Process lesson 1]
- [Process lesson 2]
### Communication
- [Communication lesson]
## Supporting Information
### Related Incidents
- [Link to similar past incidents]
### References
- [Link to dashboards]
- [Link to logs]
- [Link to related documentation]
## Appendix
### Detailed Technical Analysis
[Deep dive into technical details if needed]
### Communication Log
[Key communications during incident]Complete Example
# Incident Postmortem: Payment Processing Outage
**Incident ID:** INC-2025-047
**Date:** 2025-10-08
**Duration:** 2 hours 34 minutes
**Severity:** P0
**Status:** Final
**Author:** Sarah Chen
**Reviewers:** Mike Johnson, David Kim, Lisa Wang
## Executive Summary
On October 8, 2025, our payment processing service experienced a complete outage from 14:23 to 16:57 UTC, preventing all customer transactions. The root cause was a database connection pool exhaustion triggered by a slow query introduced in a routine deployment. Approximately 12,000 transactions failed, resulting in an estimated $180,000 in delayed revenue (later recovered) and 340 customer support tickets.
## Impact
### User Impact
- **Users affected:** ~8,500 unique users attempted transactions
- **Transactions failed:** 12,347
- **Geographic scope:** Global (all regions)
- **Duration of impact:** 2 hours 34 minutes
### Business Impact
- **Revenue impact:** $180,000 delayed (recovered via retry)
- $12,000 permanently lost (cart abandonment)
- **SLA impact:** Breached 99.9% monthly SLA
- Will provide service credits to enterprise customers
- **Reputational impact:** Moderate
- 47 negative social media mentions
- 2 tech blog articles
### Technical Impact
- **Services affected:**
- Payment Service (complete outage)
- Order Service (degraded - could create orders but not process payment)
- Checkout UI (degraded - payment step failed)
- **Data impact:** None - no data loss or corruption
- **Downstream effects:**
- Fulfillment queue backed up by 2 hours
- Analytics data gap for incident period
## Timeline
All times in UTC
| Time | Event |
|------|-------|
| 13:45 | Deployment of payment-service v3.2.1 completed |
| 14:15 | Slow query begins executing (not yet visible) |
| 14:23 | First customer reports payment failure (support ticket) |
| 14:26 | Automated alert: Payment success rate < 98% |
| 14:28 | On-call engineer (Alex) acknowledges alert |
| 14:32 | Alex begins investigation, checks Stripe status (healthy) |
| 14:38 | Payment success rate drops to 0% |
| 14:40 | Alex escalates to team lead (Sarah) |
| 14:45 | Sarah joins investigation, notices DB connection errors |
| 14:52 | Database team (David) paged |
| 15:05 | David identifies connection pool exhaustion |
| 15:15 | Slow query identified as root cause |
| 15:22 | Decision to rollback deployment |
| 15:28 | Rollback initiated |
| 15:35 | Rollback complete, connections still exhausted |
| 15:42 | Database connection pool forcefully reset |
| 15:48 | Services recovering, success rate at 45% |
| 16:10 | Success rate at 85% |
| 16:45 | Success rate at 99%, monitoring |
| 16:57 | Incident resolved, success rate stable at 99.5% |
## Root Cause Analysis
### What happened
The deployment at 13:45 included a change to the payment reconciliation query that inadvertently removed an index hint. This caused the query to perform a full table scan on the `transactions` table (47M rows) instead of using the `idx_transactions_created_at` index.
```sql
-- Before (fast, ~50ms)
SELECT * FROM transactions
WHERE created_at > NOW() - INTERVAL '24 hours'
AND status = 'pending'
/*+ INDEX(transactions idx_transactions_created_at) */
-- After (slow, ~45 seconds)
SELECT * FROM transactions
WHERE created_at > NOW() - INTERVAL '24 hours'
AND status = 'pending'The reconciliation job runs every 5 minutes. After deployment:
- First execution at 13:50 took 45 seconds instead of 50ms
- Second execution at 13:55 started while first was still running
- By 14:15, 6 concurrent executions were holding connections
- By 14:23, all 100 connections in the pool were consumed
- New payment requests couldn't acquire connections and failed
Contributing factors
-
Missing index hint removal: Code review didn't catch the removed hint because the query still returned correct results. No performance testing in CI.
-
No query timeout: The reconciliation query had no statement timeout, allowing it to run indefinitely and hold connections.
-
Connection pool sizing: Pool size of 100 was adequate for normal operations but couldn't absorb runaway queries. No circuit breaker for database access.
-
Monitoring gap: We alert on payment success rate but not on database connection utilization or query duration anomalies.
-
Deployment timing: Deployed at 13:45, just before the reconciliation job's 13:50 execution. Earlier deployment would have been caught before peak traffic.
Why it wasn't caught
- Code review: Focused on correctness, not performance. The change looked like a code cleanup.
- Staging environment: Has only 1M transactions vs 47M in production. Query completed in <1s.
- Load testing: Not run for this "minor" change.
- Canary deployment: We use canary for traffic, but this was a background job issue.
Detection
- How was it detected: Customer support ticket (14:23), then automated alert (14:26)
- Time to detect: 38 minutes from deployment, 3 minutes from customer impact
- Detection gap: Should have detected:
- Slow query immediately via query duration monitoring
- Connection pool exhaustion before it hit 100%
- Background job execution time increase
Response
What went well
- Fast escalation: On-call appropriately escalated when stuck at 14:40
- Clear communication: Status updates posted every 15 minutes in #incidents
- Customer communication: Status page updated at 14:45, email sent at 15:00
- Documentation: Team referenced runbook for connection pool issues
- Parallel investigation: Multiple team members effectively divided work
What could be improved
- Initial diagnosis: Spent 10 minutes checking external dependencies (Stripe) when issue was internal
- Rollback decision: Took 30 minutes from identifying root cause to deciding to rollback
- Recovery time: After rollback, pool was still exhausted; should have reset connections immediately
- Metric visibility: Had to manually query database for connection stats
Where we got lucky
- No data corruption: Long-running queries could have caused lock contention and inconsistent state
- Weekend timing: 14:00-17:00 UTC is lower traffic; peak hours would have doubled impact
- Recoverable transactions: Most failed payments could be retried; cart recovery emails brought back 92% of customers
Action Items
| Priority | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| P0 | Add statement_timeout to all background jobs | David | Oct 10 | ✅ Done |
| P0 | Add connection pool utilization alerts (>80%) | Sarah | Oct 10 | ✅ Done |
| P0 | Add query duration anomaly detection | Sarah | Oct 15 | ✅ Done |
| P1 | Implement circuit breaker for database access | Alex | Oct 22 | In Progress |
| P1 | Add query plan analysis to CI for changed queries | David | Oct 25 | Not Started |
| P1 | Create runbook for connection pool exhaustion | Sarah | Oct 20 | In Progress |
| P2 | Increase staging data volume to 10M+ rows | Ops Team | Nov 15 | Not Started |
| P2 | Implement connection pool auto-scaling | David | Nov 30 | Not Started |
| P2 | Add deployment time recommendations (avoid peak) | Platform | Dec 15 | Not Started |
Lessons Learned
Technical
-
Query performance is a correctness issue: A query that's 1000x slower is effectively broken, even if it returns the right data. Need to treat performance regressions as bugs.
-
Connection pools need protection: Database connections are a limited resource. Need circuit breakers, timeouts, and monitoring to prevent exhaustion.
-
Staging parity matters: The 47x difference in data volume between staging and production masked the issue completely.
Process
-
Performance testing for "minor" changes: Any database query change should require performance verification, regardless of perceived risk.
-
Deployment timing awareness: Background jobs create "hidden" execution windows. Need visibility into job schedules during deployment decisions.
-
Runbook gaps: We had a generic database troubleshooting runbook but nothing specific to connection pool exhaustion. Specific runbooks for known failure modes are more useful than generic ones.
Communication
-
Status page update timing: We updated status page at 14:45, 22 minutes after first customer impact. Should be faster.
-
Internal communication worked: The #incidents channel with regular updates kept everyone informed without disrupting the response team.
Supporting Information
Related Incidents
- INC-2024-089: Similar connection pool exhaustion from ORM N+1 queries (different root cause, similar symptoms)
- INC-2025-012: Slow query caused by missing index (caught in staging, no production impact)
References
- Incident Dashboard
- Payment Service Logs
- Database Metrics
- Deployment Pipeline Run
- Original PR with query change
Appendix
Detailed Technical Analysis
Query execution plan before (EXPLAIN ANALYZE):
Index Scan using idx_transactions_created_at on transactions
Index Cond: (created_at > (now() - '24:00:00'::interval))
Filter: (status = 'pending')
Rows Removed by Filter: 2,341
Actual time: 12.3..48.7 ms
Actual rows: 1,247
Query execution plan after:
Seq Scan on transactions
Filter: ((created_at > (now() - '24:00:00'::interval)) AND (status = 'pending'))
Rows Removed by Filter: 47,234,891
Actual time: 3,421.2..45,892.4 ms
Actual rows: 1,247
Connection pool state at peak (15:05):
Active connections: 100/100
Waiting requests: 847
Oldest waiting: 12m 34s
Connection acquisition timeout: 30s
Communication Log
| Time | Channel | Message |
|---|---|---|
| 14:45 | Status Page | "Investigating payment processing issues" |
| 15:00 | Customer notification sent to affected users | |
| 15:15 | Status Page | "Identified - Database performance issue" |
| 15:30 | Status Page | "Implementing fix - Rollback in progress" |
| 16:00 | Response to customer complaints | |
| 17:00 | Status Page | "Resolved - Payments functioning normally" |
| 17:30 | Follow-up to affected customers with details |
## Postmortem Best Practices
### 1. Blameless Culture
- Focus on systems, not individuals
- "How did the system allow this?" not "Who made this mistake?"
- Use "we" language
### 2. Complete the Timeline
- Gather data from logs, chat, and memory while fresh
- Include decision points, not just events
- Note what information was available at each point
### 3. Dig Deep on Root Cause
- Use "5 Whys" technique
- Look for systemic issues, not just proximate causes
- Ask: "What would have prevented this?"
### 4. Actionable Items
- Each item has an owner and due date
- Prioritize ruthlessly
- Follow up and track completion
### 5. Share and Learn
- Publish internally (and externally if appropriate)
- Discuss in team meetings
- Create patterns from repeated issues
---
*The goal of a postmortem is not to assign blame, but to ensure we learn from incidents and continuously improve our systems and processes.*