What is a technology tree?

A technology tree (tech tree) is a visual progression map inspired by strategy games like Age of Empires. It shows capabilities organised by domain (columns) and maturity level (rows), with dependency lines showing what must be achieved before advancing. Each node includes effort estimates, actionable steps, and links to relevant tools.

How do I use the tech tree for my organisation?

Select an organisational tree (like Engineering Org Maturity or Security & Compliance), then mark nodes as completed based on your current state. The tree automatically highlights what is available to work on next based on prerequisites. Click any available node to see the concrete steps required to achieve it.

SRE & Resilience — Tech Tree | The Art of CTO

Maturity tiers

Reactive
You find out about outages from customers. Engineers are paged ad-hoc, with no formal runbooks, and incident learnings rarely make it back into the system.
Proactive
Monitoring, alerting, and a real on-call rotation exist. Incidents follow a process; postmortems are written but inconsistently acted on.
Engineered
Reliability is owned. SLOs and error budgets govern release pace, distributed tracing makes root-causing fast, and chaos experiments validate assumptions before production breaks them.
Autonomous
The system catches and corrects most failures before humans notice. Continuous chaos runs in production, AIOps spots anomalies, and reliability is encoded as policy.

Tracks

Observability
Metrics, logs, traces, and the telemetry pipeline. Answers "what is the system doing right now, and why?"
Reliability
SLIs, SLOs, error budgets, capacity planning, and the architectural patterns (retries, circuit breakers, graceful degradation) that keep the system honest.
Incident
On-call rotations, response process, severity classification, communication, and postmortems. Turns surprises into learning.
Chaos
Deliberate fault injection — disaster-recovery drills, game days, latency/network failures, full chaos engineering — to verify resilience instead of hoping for it.

All capabilities (25)

Reactive

Basic Uptime Monitoring
External pings on production URLs catch hard outages within a few minutes. Notifications land in a chat channel rather than relying on customer reports.
monitoring · uptime · observability
Centralised Logging
Every service streams logs to one searchable place. SSHing into machines to tail logs becomes a fallback, not the default.
logging · observability
Informal On-call Rotation
A rota exists. One named engineer is responsible for production at any given time, even if "the process" is just a chat channel and goodwill.
on-call · incident-response
Production Smoke Tests
A short suite of read-only tests runs against production after every deploy and on a schedule, catching gross regressions before users do.
smoke-tests · production-testing
Public Status Page
A status page tells customers when something is wrong without needing them to ask. Owned by SRE/ops, updated during incidents.
status-page · communication

Proactive

Alerting & Routing
Alerts fire on symptoms users feel (latency, error rate, queue depth) — not causes. Routing sends them to the right team without paging everyone.
alerting · pagerduty · observability
Blameless Postmortems
Every Sev1/Sev2 incident produces a postmortem within 5 days. Focus is on systemic causes, not individual mistakes. Action items have owners and due dates.
postmortems · incident-response · culture
Disaster Recovery Drills
You actually test backups and failover, on a schedule. "Restore from backup" is a practised skill, not a hypothetical capability.
disaster-recovery · backup · failover
Health Checks & Graceful Shutdown
Every service exposes /healthz and /readyz. Orchestrators (Kubernetes, ECS, Cloudflare) use them to route traffic and drain gracefully.
health-checks · kubernetes · graceful-shutdown
Metrics & Dashboards
Application + infra metrics flow into Prometheus / Datadog / Grafana. Every service has a default dashboard covering RED metrics (rate, errors, duration).
metrics · dashboards · prometheus · grafana
Paging Tool & Severity Levels
A paging tool (PagerDuty, Opsgenie, Better Stack) handles escalation and schedules. Incident severities are defined and used consistently.
pagerduty · paging · incident-response
Runbooks per Service
Every service has a runbook covering the top 5-10 alert conditions: what to check, common causes, how to mitigate. Linked directly from the alert.
runbooks · incident-response · documentation
Structured Logging + Correlation
Logs are JSON with a consistent schema; a trace/request ID flows from edge to database. Correlating across services is a single search.
logging · correlation · tracing

Engineered

Capacity Planning & Autoscaling
Headroom is measured, growth forecasts drive infra decisions, and autoscaling responds to demand without manual intervention. Cost is part of the conversation.
capacity · autoscaling · planning
Chaos Engineering Experiments
Hypothesised failures (latency, errors, node loss) are injected against staging — and eventually production — under controlled conditions. Every experiment has a hypothesis, blast radius, and abort criteria.
chaos-engineering · chaos-monkey · resilience
Distributed Tracing
OpenTelemetry traces flow from the edge through every service. A single trace ID gives you a full request waterfall, latency breakdown, and span attributes.
tracing · opentelemetry · observability
Error Budgets Drive Release Pace
When the error budget is healthy, you ship faster. When it's exhausted, release pace slows and reliability work prioritises. The policy is written down and respected.
error-budget · slo · release-management
Game Days
Scheduled, scenario-driven exercises where a team responds to a simulated incident in real time. Surfaces broken runbooks, communication gaps, and tooling that nobody actually knows how to use.
game-days · chaos-engineering · incident-response
Incident Commander Role
Sev1/Sev2 incidents have a named IC running the response. Comms, technical lead, and scribe roles are explicit. Conference bridges and chat channels are spun up automatically.
incident-commander · incident-response
SLIs & SLOs
Each user-facing service has 2-3 SLIs (latency, availability, correctness) with target SLOs the business has signed off on. Dashboards show SLO burn rate at a glance.
slo · sli · reliability

Autonomous

AI-Assisted Triage
During an incident, an LLM-backed assistant proposes likely causes by correlating recent deploys, related metrics, and prior postmortems. Cuts MTTR for known failure modes.
aiops · incident-response · ai
Anomaly Detection & AIOps
Statistical or ML-based detection spots deviations from normal behaviour before they cross human-set thresholds. Reduces both alert noise and mean-time-to-detect.
aiops · anomaly-detection · observability
Continuous Chaos in Production
Chaos experiments run automatically against production on a low-blast-radius schedule. The system is continuously verified, not just at game-day frequency.
chaos-engineering · continuous-chaos · resilience
Resilience as Code
Reliability policies — circuit-breaker thresholds, retry budgets, SLO targets, deploy gates — live in version control and apply uniformly via policy-as-code (OPA, Kyverno) or a service mesh.
policy-as-code · opa · service-mesh · reliability
Self-Healing Systems
Common failure modes are auto-remediated — failed pods restart, leaked memory triggers a recycle, stale caches refresh, retries with exponential backoff are the default. Humans only see the rare class of incident.
self-healing · automation · resilience

Reactive

Proactive

Engineered

Autonomous

Observability

Reliability

Incident

Chaos

Reactive

Basic Uptime Monitoring

Centralised Logging

Informal On-call Rotation

Production Smoke Tests

Public Status Page

Proactive

Alerting & Routing

Blameless Postmortems

Disaster Recovery Drills

Health Checks & Graceful Shutdown

Metrics & Dashboards

Paging Tool & Severity Levels

Runbooks per Service

Structured Logging + Correlation

Engineered

Capacity Planning & Autoscaling

Chaos Engineering Experiments

Distributed Tracing

Error Budgets Drive Release Pace

Game Days

Incident Commander Role

SLIs & SLOs

Autonomous

AI-Assisted Triage

Anomaly Detection & AIOps

Continuous Chaos in Production

Resilience as Code

Self-Healing Systems

Other tech trees

Engineering Organisation Maturity

CTO Career Path

Platform Engineering Maturity

Security & Compliance Maturity

Cloud Infrastructure Maturity

Testing & QA Maturity

Engineering Excellence

Cloud Native Journey

Data & Analytics Maturity

Frontend Maturity

CNCF Adoption

Frequently Asked Questions

What is a technology tree?

How do I use the tech tree for my organisation?