Mean Time to Recovery (MTTR)
Measure how quickly your team restores service after an incident. A key DORA metric that indicates your organization's resilience.
Overview
As a CTO, you're often asked about the resilience of your systems, especially when incidents occur. Mean Time to Recovery (MTTR) is a metric that directly reflects your team's ability to bounce back from disruptions. It's not just about fixing things quickly; it's about having robust processes in place like effective runbooks, healthy on-call rotations, automated rollbacks, and a culture of learning from past incidents. These are the levers you can pull to ensure your organization not only recovers swiftly but also grows stronger with each incident.
Mean Time to Recovery (MTTR), also called Mean Time to Restore, measures how long it takes to restore service after a production incident. It's one of the four key DORA metrics and a critical indicator of your organization's resilience and incident response capabilities.
Why It Matters
- Customer impact: Faster recovery means less downtime
- Revenue protection: Every minute of downtime costs money
- Team stress: Quick recovery reduces firefighting burnout
- Resilience indicator: Shows your ability to handle failures
- Competitive advantage: Reliable systems build customer trust
- Innovation enabler: Fast recovery enables bolder experiments
The MTTR Formula
MTTR = Total Recovery Time ÷ Number of Incidents
Example:
- Incident 1: 30 minutes
- Incident 2: 2 hours
- Incident 3: 45 minutes
- Total: 3 hours 15 minutes (195 minutes)
- MTTR: 195 ÷ 3 = 65 minutes
Incident Lifecycle
Understanding what to measure:
Timeline:
─────────────────────────────────────────────────────────────
Detection → Acknowledgment → Response → Resolution → Verification
↓ ↓ ↓ ↓ ↓
Alert On-call Investigation Fix Confirm
fires responds begins deployed working
│←─── MTTD ───→│←────── MTTR ──────────────────→│
│←───────────── Total Incident Duration ─────────────────→│
Key Timestamps:
- Incident Start: When issue begins (may be before detection)
- Detection: When monitoring/alerts fire (MTTD)
- Acknowledgment: When on-call engineer responds
- Resolution Start: When fix is deployed
- Incident End: When service is fully restored
Recommended Visualizations
1. Histogram (Distribution)
Best for: Understanding recovery time patterns
X-axis: MTTR buckets (0-15m, 15-30m, 30m-1h, 1-4h, 4h+)
Y-axis: Number of incidents
Insight: Reveals if most incidents are quick vs. some very long
2. Percentile Chart (P50, P75, P95)
Best for: Tracking consistency
Y-axis: MTTR (minutes)
X-axis: Time (weeks/months)
Lines: P50 (median), P75, P95
Target: P95 under 1 hour for elite performance
<MTTRPercentilesChart />
3. Gauge (Current Status)
Best for: Executive dashboards
Gauge ranges:
- Elite: under 1 hour (green)
- High: 1 hour to 1 day (blue)
- Medium: 1 day to 1 week (yellow)
- Low: over 1 week (red)
Display: Last 30 days MTTR
<MTTRGauge />
4. Breakdown Chart (Components)
Best for: Identifying bottlenecks
Stacked bar showing:
- Detection time (MTTD)
- Response time (acknowledgment)
- Investigation time
- Fix implementation time
- Verification time
<MTTRBreakdownChart />
Target Ranges (DORA Benchmarks)
| Performance Level | MTTR |
|---|---|
| Elite | Less than one hour |
| High | Less than one day |
| Medium | Between one day and one week |
| Low | More than one week |
By Severity
| Severity | Target MTTR | Acceptable Max |
|---|---|---|
| P0 (Critical) | under 30 minutes | under 1 hour |
| P1 (High) | under 2 hours | under 4 hours |
| P2 (Medium) | under 1 day | under 3 days |
| P3 (Low) | under 1 week | under 2 weeks |
By System
- Payment systems: under 15 minutes
- Core API: under 30 minutes
- User-facing features: under 1 hour
- Background jobs: under 4 hours
- Analytics: under 1 day
How to Improve MTTR
1. Improve Detection (Reduce MTTD)
Monitoring:
- Comprehensive metrics (RED: Rate, Errors, Duration)
- Real user monitoring (RUM)
- Synthetic monitoring for critical paths
- Log aggregation and analysis
Alerting:
- Smart alerts (reduce noise)
- Alert on symptoms, not causes
- Escalation policies
- Multiple notification channels
Example Alert:
alert: HighErrorRate
expr: error_rate > 5% for 2 minutes
severity: P1
description: "{{ $labels.service }} error rate at {{ $value }}%"
runbook: https://wiki.company.com/runbooks/high-error-rate2. Improve Response
On-Call:
- Clear on-call schedules
- Sufficient on-call rotation (avoid burnout)
- Fair on-call compensation
- Defined response SLAs (P0: 5 min, P1: 15 min, P2: 1 hour)
Runbooks:
# Runbook: High API Error Rate
## Symptoms
- Error rate above 5%
- Alert: "HighAPIErrorRate"
## Immediate Actions (under 5 min)
1. Check dashboard: https://...
2. Check recent deployments
3. Check dependency status
## Investigation (5-15 min)
1. Review error logs
2. Check database connection pool
3. Verify third-party services
## Resolution Options
1. If recent deployment → rollback
2. If database → increase pool size
3. If third-party → enable circuit breaker
## Escalation
If not resolved in 30 min, page @backend-lead3. Improve Investigation
Tools:
- Centralized logging (ELK, Splunk, DataDog)
- Distributed tracing (Jaeger, Zipkin, DataDog APM)
- APM tools (New Relic, AppDynamics)
- Database query analyzers
Practices:
- Structured logging
- Correlation IDs
- Service mesh observability
- Feature flags for quick rollback
4. Improve Fix Deployment
Fast Rollback:
# One-command rollback
./rollback.sh production
# Or automated based on error rate
if error_rate > 5% for 5min:
auto_rollback()Fast Forward Fix:
- CI/CD pipeline optimized for hotfixes
- Skip non-critical checks in emergency mode
- Separate "hotfix" branch with fast-track deployment
5. Improve Verification
Smoke Tests:
# Automated post-deployment checks
check_api_health()
check_error_rate() under 1%
check_latency() under baseline * 1.1
check_key_metrics()Gradual Rollout:
- Deploy fix to canary first
- Monitor for 5 minutes
- Gradually increase traffic
- Full rollout only if metrics healthy
Common Pitfalls
❌ Measuring Only Average
Problem: Average hides long-tail incidents Solution: Track P50, P75, P95, P99
❌ Starting Timer at Detection
Problem: Misses actual customer impact Solution: Measure from incident start (when issue began)
❌ Declaring Victory Too Early
Problem: Marking incident resolved before verification Solution: Include verification time in MTTR
❌ Not Segmenting by Severity
Problem: P3 incidents drag down metrics Solution: Track MTTR separately by severity
❌ Optimizing for the Metric
Problem: Closing incidents quickly without proper fix Solution: Track reopen rate and customer satisfaction
❌ Hero Culture
Problem: Relying on specific individuals to resolve incidents Solution: Document knowledge, distribute expertise
Implementation Guide
Week 1: Instrumentation
# Track incident timestamps
incident = {
'id': generate_id(),
'started_at': datetime.now(), # When issue began
'detected_at': None, # When alert fired
'acknowledged_at': None, # When on-call responded
'resolved_at': None, # When fix deployed
'verified_at': None, # When confirmed working
'severity': 'P1',
'service': 'api',
'root_cause': None
}Week 2: Baseline
- Track all incidents for 2 weeks
- Calculate current MTTR
- Break down by component (detection, response, fix, verify)
- Identify slowest incidents and why
Week 3: Quick Wins
- Create runbooks for top 5 incident types
- Set up one-command rollback
- Add alerting for missing monitors
- Document on-call procedures
Week 4: Long-Term Improvements
- Implement distributed tracing
- Build automated remediation for common issues
- Create incident dashboard
- Start blameless post-mortems
Dashboard Example
Executive View
┌──────────────────────────────────────────────┐
│ MTTR: 42 minutes │
│ ██████████████████░░░░░░░ Elite │
│ │
│ Last 30 Days: │
│ • P50: 28 minutes │
│ • P95: 2.3 hours │
│ • Incidents: 15 │
│ │
│ Trend: ↓ 15% improvement vs. last month │
└──────────────────────────────────────────────┘
Operations View
Incident Breakdown (Last 30 Days)
─────────────────────────────────────────────────────
Severity Count MTTR Longest Shortest
─────────────────────────────────────────────────────
P0 2 18 min 25 min 12 min
P1 5 45 min 2.5 hours 15 min
P2 8 1.2 hours 4 hours 30 min
─────────────────────────────────────────────────────
Time Breakdown (Average)
─────────────────────────────────────────────────────
Detection: 8 minutes (19%)
Acknowledgment: 3 minutes (7%)
Investigation: 15 minutes (36%)
Fix Deploy: 10 minutes (24%)
Verification: 6 minutes (14%)
─────────────────────────────────────────────────────
Total: 42 minutes
Related Metrics
The Four Horsemen of Reliability:
- MTTR: How fast you recover
- MTTD (Mean Time to Detection): How fast you notice
- MTTA (Mean Time to Acknowledgment): How fast you respond
- MTBF (Mean Time Between Failures): How often you fail
Related DORA metrics:
- Change Failure Rate: Are deployments causing incidents?
- Deployment Frequency: Does high velocity increase incidents?
- Lead Time: Can you deploy fixes quickly?
Business Metrics:
- System Uptime: Overall availability
- Customer Impact: How many users affected?
- Revenue Impact: Money lost during downtime
Tools & Integrations
Incident Management
- PagerDuty: Incident response platform
- Opsgenie: On-call and alert management
- Incident.io: Modern incident management
- FireHydrant: Incident response orchestration
- VictorOps (Splunk): Incident management
Monitoring & Observability
- DataDog: Full-stack monitoring
- New Relic: Application performance
- Prometheus + Grafana: Open-source monitoring
- Sentry: Error tracking
- Honeycomb: Observability platform
Communication
- Slack: Incident channels
- MS Teams: Incident response
- Zoom: Incident war rooms
- StatusPage: Customer communication
DIY Approach
import pandas as pd
from datetime import datetime
# Load incidents from database
incidents = pd.read_csv('incidents.csv')
# Calculate MTTR
incidents['mttr'] = (incidents['resolved_at'] - incidents['started_at']).dt.total_seconds() / 60
# Overall MTTR
print(f"MTTR: {incidents['mttr'].mean():.1f} minutes")
# By severity
print("\nMTTR by Severity:")
print(incidents.groupby('severity')['mttr'].agg(['mean', 'median', 'max']))
# Percentiles
print(f"\nP50: {incidents['mttr'].quantile(0.5):.1f} min")
print(f"P95: {incidents['mttr'].quantile(0.95):.1f} min")Questions to Ask
For Leadership
- Are we responding to incidents fast enough?
- Do we have adequate on-call coverage?
- Are certain teams or services outliers?
- Do we need better tooling or training?
For Teams
- What slows down our incident response?
- Do we have good runbooks?
- Can we automate common fixes?
- Do we learn from every incident?
For Individuals
- How confident am I responding to incidents?
- Do I know where to look for information?
- Can I deploy a fix quickly?
- Who do I escalate to?
Success Stories
SaaS Company
- Before: 4.5-hour MTTR, manual response
- After: 22-minute MTTR, automated recovery
- Changes:
- Implemented one-click rollback
- Created comprehensive runbooks
- Automated common fixes (restart services, clear cache)
- Added distributed tracing
- Impact: 92% reduction in MTTR, 75% reduction in customer complaints
E-commerce Platform
- Before: 2-hour MTTR, frequent escalations
- After: 35-minute MTTR, self-service recovery
- Changes:
- Improved monitoring (5x more metrics)
- Real-time alerting vs. 5-minute delay
- Automated traffic shifting to healthy regions
- Weekly incident response drills
- Impact: 71% reduction in MTTR, $2M annual savings from reduced downtime
Advanced Topics
Chaos Engineering
Test your recovery time with controlled failures:
# Randomly kill services to test recovery
def chaos_monkey():
if random() < 0.01: # 1% chance
kill_random_instance()
start_timer()
monitor_recovery()Automated Remediation
# Auto-heal common issues
rules:
- trigger: high_memory_usage
threshold: 90%
action: restart_service
- trigger: database_connection_pool_exhausted
action: scale_up_connections
- trigger: high_error_rate
threshold: 5%
duration: 5min
action: rollback_deploymentGame Days
Practice incident response:
- Monthly: Simulate incidents
- Rotate: Different team members lead
- Measure: Track MTTR during drills
- Improve: Refine runbooks based on learnings
Balancing MTTR with Other Metrics
Ideal State:
✅ High Deployment Frequency
✅ Low Change Failure Rate
✅ Fast Lead Time
✅ Low MTTR
Common Trade-offs:
⚖️ Fast recovery vs. thorough investigation
⚖️ Quick fix vs. root cause resolution
⚖️ Speed vs. prevention
Best Practice:
- Recover quickly (MTTR)
- Investigate thoroughly after (post-mortem)
- Implement preventive measures (reduce future incidents)
Frequently Asked Questions
What is MTTR?
MTTR, or Mean Time to Recovery, measures the average time it takes to restore service after a production incident. It's a critical metric for assessing the resilience and efficiency of your incident management processes.
How is MTTR calculated?
MTTR is calculated by dividing the total downtime by the number of incidents over a specific period. This provides an average recovery time, helping teams identify trends and areas for improvement.
What is a good MTTR?
A good MTTR varies by industry and system criticality, but elite teams often aim for less than one hour. According to the 2022 State of DevOps Report, high-performing teams typically recover in less than a day.
What is the difference between MTTR, MTTD, MTTA, and MTBF?
MTTR measures recovery time. MTTD is the time to detect an issue. MTTA is the time to acknowledge it. MTBF measures the average time between failures. Together, they provide a comprehensive view of system reliability.
How can a CTO drive MTTR down without burning out the on-call team?
CTOs can reduce MTTR by automating repetitive tasks, ensuring clear documentation and runbooks, maintaining healthy on-call rotations, and fostering a culture of continuous improvement and learning.
Should MTTR include weekend and off-hours incidents the same as business hours?
Including all incidents, regardless of timing, provides a more accurate picture of system resilience. However, it's essential to consider the impact of off-hours incidents on team workload and adjust processes accordingly.
Conclusion
MTTR measures your resilience—not just your reliability. Elite teams recover from incidents in under an hour through comprehensive monitoring, clear runbooks, automated remediation, and regular practice. Focus on reducing each component: faster detection, quicker response, streamlined investigation, easy rollback, and automated verification. Remember: incidents will happen. What matters is how quickly you recover and what you learn from each one.
Start Today:
- Instrument your incident tracking
- Calculate your current MTTR
- Find your biggest bottleneck
- Implement one improvement
- Measure the impact
- Repeat