Skip to main content

The Art of CTO Technology Tree is an interactive AoE-style progression map that visualises maturity across engineering domains — from ad-hoc practices to elite capability — with actionable steps, effort estimates, and cross-domain dependencies.

Tech Tree · Engineering

SRE & Resilience

Advance your reliability practice from reactive break-fix to continuous chaos engineering. Each node represents a concrete capability — monitoring, SLOs, on-call discipline, fault injection — with realistic effort estimates and cross-track dependencies.

Maturity tiers
  1. Reactive

    You find out about outages from customers. Engineers are paged ad-hoc, with no formal runbooks, and incident learnings rarely make it back into the system.

  2. Proactive

    Monitoring, alerting, and a real on-call rotation exist. Incidents follow a process; postmortems are written but inconsistently acted on.

  3. Engineered

    Reliability is owned. SLOs and error budgets govern release pace, distributed tracing makes root-causing fast, and chaos experiments validate assumptions before production breaks them.

  4. Autonomous

    The system catches and corrects most failures before humans notice. Continuous chaos runs in production, AIOps spots anomalies, and reliability is encoded as policy.

Tracks

  • Observability

    Metrics, logs, traces, and the telemetry pipeline. Answers "what is the system doing right now, and why?"

  • Reliability

    SLIs, SLOs, error budgets, capacity planning, and the architectural patterns (retries, circuit breakers, graceful degradation) that keep the system honest.

  • Incident

    On-call rotations, response process, severity classification, communication, and postmortems. Turns surprises into learning.

  • Chaos

    Deliberate fault injection — disaster-recovery drills, game days, latency/network failures, full chaos engineering — to verify resilience instead of hoping for it.

All capabilities (25)

Reactive

  • Basic Uptime Monitoring

    External pings on production URLs catch hard outages within a few minutes. Notifications land in a chat channel rather than relying on customer reports.

    monitoring · uptime · observability

  • Centralised Logging

    Every service streams logs to one searchable place. SSHing into machines to tail logs becomes a fallback, not the default.

    logging · observability

  • Informal On-call Rotation

    A rota exists. One named engineer is responsible for production at any given time, even if "the process" is just a chat channel and goodwill.

    on-call · incident-response

  • Production Smoke Tests

    A short suite of read-only tests runs against production after every deploy and on a schedule, catching gross regressions before users do.

    smoke-tests · production-testing

  • Public Status Page

    A status page tells customers when something is wrong without needing them to ask. Owned by SRE/ops, updated during incidents.

    status-page · communication

Proactive

  • Alerting & Routing

    Alerts fire on symptoms users feel (latency, error rate, queue depth) — not causes. Routing sends them to the right team without paging everyone.

    alerting · pagerduty · observability

  • Blameless Postmortems

    Every Sev1/Sev2 incident produces a postmortem within 5 days. Focus is on systemic causes, not individual mistakes. Action items have owners and due dates.

    postmortems · incident-response · culture

  • Disaster Recovery Drills

    You actually test backups and failover, on a schedule. "Restore from backup" is a practised skill, not a hypothetical capability.

    disaster-recovery · backup · failover

  • Health Checks & Graceful Shutdown

    Every service exposes /healthz and /readyz. Orchestrators (Kubernetes, ECS, Cloudflare) use them to route traffic and drain gracefully.

    health-checks · kubernetes · graceful-shutdown

  • Metrics & Dashboards

    Application + infra metrics flow into Prometheus / Datadog / Grafana. Every service has a default dashboard covering RED metrics (rate, errors, duration).

    metrics · dashboards · prometheus · grafana

  • Paging Tool & Severity Levels

    A paging tool (PagerDuty, Opsgenie, Better Stack) handles escalation and schedules. Incident severities are defined and used consistently.

    pagerduty · paging · incident-response

  • Runbooks per Service

    Every service has a runbook covering the top 5-10 alert conditions: what to check, common causes, how to mitigate. Linked directly from the alert.

    runbooks · incident-response · documentation

  • Structured Logging + Correlation

    Logs are JSON with a consistent schema; a trace/request ID flows from edge to database. Correlating across services is a single search.

    logging · correlation · tracing

Engineered

  • Capacity Planning & Autoscaling

    Headroom is measured, growth forecasts drive infra decisions, and autoscaling responds to demand without manual intervention. Cost is part of the conversation.

    capacity · autoscaling · planning

  • Chaos Engineering Experiments

    Hypothesised failures (latency, errors, node loss) are injected against staging — and eventually production — under controlled conditions. Every experiment has a hypothesis, blast radius, and abort criteria.

    chaos-engineering · chaos-monkey · resilience

  • Distributed Tracing

    OpenTelemetry traces flow from the edge through every service. A single trace ID gives you a full request waterfall, latency breakdown, and span attributes.

    tracing · opentelemetry · observability

  • Error Budgets Drive Release Pace

    When the error budget is healthy, you ship faster. When it's exhausted, release pace slows and reliability work prioritises. The policy is written down and respected.

    error-budget · slo · release-management

  • Game Days

    Scheduled, scenario-driven exercises where a team responds to a simulated incident in real time. Surfaces broken runbooks, communication gaps, and tooling that nobody actually knows how to use.

    game-days · chaos-engineering · incident-response

  • Incident Commander Role

    Sev1/Sev2 incidents have a named IC running the response. Comms, technical lead, and scribe roles are explicit. Conference bridges and chat channels are spun up automatically.

    incident-commander · incident-response

  • SLIs & SLOs

    Each user-facing service has 2-3 SLIs (latency, availability, correctness) with target SLOs the business has signed off on. Dashboards show SLO burn rate at a glance.

    slo · sli · reliability

Autonomous

  • AI-Assisted Triage

    During an incident, an LLM-backed assistant proposes likely causes by correlating recent deploys, related metrics, and prior postmortems. Cuts MTTR for known failure modes.

    aiops · incident-response · ai

  • Anomaly Detection & AIOps

    Statistical or ML-based detection spots deviations from normal behaviour before they cross human-set thresholds. Reduces both alert noise and mean-time-to-detect.

    aiops · anomaly-detection · observability

  • Continuous Chaos in Production

    Chaos experiments run automatically against production on a low-blast-radius schedule. The system is continuously verified, not just at game-day frequency.

    chaos-engineering · continuous-chaos · resilience

  • Resilience as Code

    Reliability policies — circuit-breaker thresholds, retry budgets, SLO targets, deploy gates — live in version control and apply uniformly via policy-as-code (OPA, Kyverno) or a service mesh.

    policy-as-code · opa · service-mesh · reliability

  • Self-Healing Systems

    Common failure modes are auto-remediated — failed pods restart, leaked memory triggers a recycle, stale caches refresh, retries with exponential backoff are the default. Humans only see the rare class of incident.

    self-healing · automation · resilience

Interactive view

Other tech trees

Frequently Asked Questions

What is a technology tree?

A technology tree (tech tree) is a visual progression map inspired by strategy games like Age of Empires. It shows capabilities organised by domain (columns) and maturity level (rows), with dependency lines showing what must be achieved before advancing. Each node includes effort estimates, actionable steps, and links to relevant tools.

How do I use the tech tree for my organisation?

Select an organisational tree (like Engineering Org Maturity or Security & Compliance), then mark nodes as completed based on your current state. The tree automatically highlights what is available to work on next based on prerequisites. Click any available node to see the concrete steps required to achieve it.