Tech Tree · Engineering
SRE & Resilience
Advance your reliability practice from reactive break-fix to continuous chaos engineering. Each node represents a concrete capability — monitoring, SLOs, on-call discipline, fault injection — with realistic effort estimates and cross-track dependencies.
Maturity tiers
Reactive
You find out about outages from customers. Engineers are paged ad-hoc, with no formal runbooks, and incident learnings rarely make it back into the system.
Proactive
Monitoring, alerting, and a real on-call rotation exist. Incidents follow a process; postmortems are written but inconsistently acted on.
Engineered
Reliability is owned. SLOs and error budgets govern release pace, distributed tracing makes root-causing fast, and chaos experiments validate assumptions before production breaks them.
Autonomous
The system catches and corrects most failures before humans notice. Continuous chaos runs in production, AIOps spots anomalies, and reliability is encoded as policy.
Tracks
Observability
Metrics, logs, traces, and the telemetry pipeline. Answers "what is the system doing right now, and why?"
Reliability
SLIs, SLOs, error budgets, capacity planning, and the architectural patterns (retries, circuit breakers, graceful degradation) that keep the system honest.
Incident
On-call rotations, response process, severity classification, communication, and postmortems. Turns surprises into learning.
Chaos
Deliberate fault injection — disaster-recovery drills, game days, latency/network failures, full chaos engineering — to verify resilience instead of hoping for it.
All capabilities (25)
Reactive
Basic Uptime Monitoring
External pings on production URLs catch hard outages within a few minutes. Notifications land in a chat channel rather than relying on customer reports.
monitoring · uptime · observability
Centralised Logging
Every service streams logs to one searchable place. SSHing into machines to tail logs becomes a fallback, not the default.
logging · observability
Informal On-call Rotation
A rota exists. One named engineer is responsible for production at any given time, even if "the process" is just a chat channel and goodwill.
on-call · incident-response
Production Smoke Tests
A short suite of read-only tests runs against production after every deploy and on a schedule, catching gross regressions before users do.
smoke-tests · production-testing
Public Status Page
A status page tells customers when something is wrong without needing them to ask. Owned by SRE/ops, updated during incidents.
status-page · communication
Proactive
Alerting & Routing
Alerts fire on symptoms users feel (latency, error rate, queue depth) — not causes. Routing sends them to the right team without paging everyone.
alerting · pagerduty · observability
Blameless Postmortems
Every Sev1/Sev2 incident produces a postmortem within 5 days. Focus is on systemic causes, not individual mistakes. Action items have owners and due dates.
postmortems · incident-response · culture
Disaster Recovery Drills
You actually test backups and failover, on a schedule. "Restore from backup" is a practised skill, not a hypothetical capability.
disaster-recovery · backup · failover
Health Checks & Graceful Shutdown
Every service exposes /healthz and /readyz. Orchestrators (Kubernetes, ECS, Cloudflare) use them to route traffic and drain gracefully.
health-checks · kubernetes · graceful-shutdown
Metrics & Dashboards
Application + infra metrics flow into Prometheus / Datadog / Grafana. Every service has a default dashboard covering RED metrics (rate, errors, duration).
metrics · dashboards · prometheus · grafana
Paging Tool & Severity Levels
A paging tool (PagerDuty, Opsgenie, Better Stack) handles escalation and schedules. Incident severities are defined and used consistently.
pagerduty · paging · incident-response
Runbooks per Service
Every service has a runbook covering the top 5-10 alert conditions: what to check, common causes, how to mitigate. Linked directly from the alert.
runbooks · incident-response · documentation
Structured Logging + Correlation
Logs are JSON with a consistent schema; a trace/request ID flows from edge to database. Correlating across services is a single search.
logging · correlation · tracing
Engineered
Capacity Planning & Autoscaling
Headroom is measured, growth forecasts drive infra decisions, and autoscaling responds to demand without manual intervention. Cost is part of the conversation.
capacity · autoscaling · planning
Chaos Engineering Experiments
Hypothesised failures (latency, errors, node loss) are injected against staging — and eventually production — under controlled conditions. Every experiment has a hypothesis, blast radius, and abort criteria.
chaos-engineering · chaos-monkey · resilience
Distributed Tracing
OpenTelemetry traces flow from the edge through every service. A single trace ID gives you a full request waterfall, latency breakdown, and span attributes.
tracing · opentelemetry · observability
Error Budgets Drive Release Pace
When the error budget is healthy, you ship faster. When it's exhausted, release pace slows and reliability work prioritises. The policy is written down and respected.
error-budget · slo · release-management
Game Days
Scheduled, scenario-driven exercises where a team responds to a simulated incident in real time. Surfaces broken runbooks, communication gaps, and tooling that nobody actually knows how to use.
game-days · chaos-engineering · incident-response
Incident Commander Role
Sev1/Sev2 incidents have a named IC running the response. Comms, technical lead, and scribe roles are explicit. Conference bridges and chat channels are spun up automatically.
incident-commander · incident-response
SLIs & SLOs
Each user-facing service has 2-3 SLIs (latency, availability, correctness) with target SLOs the business has signed off on. Dashboards show SLO burn rate at a glance.
slo · sli · reliability
Autonomous
AI-Assisted Triage
During an incident, an LLM-backed assistant proposes likely causes by correlating recent deploys, related metrics, and prior postmortems. Cuts MTTR for known failure modes.
aiops · incident-response · ai
Anomaly Detection & AIOps
Statistical or ML-based detection spots deviations from normal behaviour before they cross human-set thresholds. Reduces both alert noise and mean-time-to-detect.
aiops · anomaly-detection · observability
Continuous Chaos in Production
Chaos experiments run automatically against production on a low-blast-radius schedule. The system is continuously verified, not just at game-day frequency.
chaos-engineering · continuous-chaos · resilience
Resilience as Code
Reliability policies — circuit-breaker thresholds, retry budgets, SLO targets, deploy gates — live in version control and apply uniformly via policy-as-code (OPA, Kyverno) or a service mesh.
policy-as-code · opa · service-mesh · reliability
Self-Healing Systems
Common failure modes are auto-remediated — failed pods restart, leaked memory triggers a recycle, stale caches refresh, retries with exponential backoff are the default. Humans only see the rare class of incident.
self-healing · automation · resilience