Operational resilience: SPOFs, testing, FCA, DORA, APRA

Operational resilience for CTOs: find SPOFs, test failure, and meet FCA, DORA, and APRA expectations

17 Jan 2025 is the date DORA started to apply across the EU financial sector, and 31 Mar 2025 is the UK deadline for firms to meet impact tolerances for important business services. APRA CPS 230 lands on 1 Jul 2025.

Those dates matter because they turn resilience from “good engineering” into “show me the evidence, right now” work. CTOs need a repeatable way to find single points of failure, test them safely, and prove the business can keep serving customers.

Operational resilience and regulatory expectations: FCA, DORA, APRA, and the rest

Operational resilience isn’t uptime. It’s the ability to keep delivering a business service through disruption, recover, and then get better because you learned something.

UK regulators define operational resilience as the ability to prevent, adapt and respond to, recover, and learn from operational disruption, across the Bank of England, PRA, and FCA regime (Riskonnect summary of UK basics). APRA uses a plain definition too. It calls it the ability to withstand and recover from shocks, where a shock threatens or disrupts business services (APRA COVID-19 resilience lessons).

DORA tightens the scope to ICT disruption and third party ICT risk, and it applies from 17 Jan 2025 (EIOPA DORA overview). That difference bites global firms. The Institute of International Finance points out that the UK focuses on “important business services” delivered to an external user, while DORA focuses on “critical or important functions” which can include internal functions like regulatory reporting (IIF staff paper, Dec 2024).

If you operate across regions, you end up translating one engineering reality into several regulatory vocabularies. In practice, that translation layer becomes a product you maintain.

Here’s the model I use with peers. I call it the SPOF to Service Evidence Chain.

Business service: What a customer or market participant experiences.
Impact tolerance: The maximum tolerable outage or degradation.
Service map: People, process, tech, data, and third parties.
SPOFs and weak links: The parts that can break the service.
Tests and incidents: Evidence that you can survive disruption.
Fixes and governance: Proof you learn and reduce risk over time.

That chain is what regulators are asking for, even if the labels change.

How to find SPOFs in modern systems, not just in infrastructure

Most CTOs I talk to can list their “big” SPOFs in 10 minutes. The ones that hurt are the quiet SPOFs, the stuff hiding in identity, data pipelines, vendor contracts, and human workflows.

Here’s a definition your team can actually use.

A SPOF is any component, dependency, or decision path that can stop an important business service inside its impact tolerance.

That includes tech and people, and yes, it includes “we only have one person who can fix that.”

The SPOF taxonomy I use in reviews

Use this list as a checklist during service mapping workshops.

Identity SPOFs: One IdP tenant, one MFA provider, one break glass path.
Network SPOFs: One transit gateway, one DNS provider, one IP allowlist.
Compute SPOFs: One region, one cluster, one autoscaling control plane.
Data SPOFs: One primary database, one schema owner, one backup account.
Messaging SPOFs: One Kafka cluster, one topic with no replay plan.
Secrets SPOFs: One KMS key, one Vault cluster, one rotation job.
Observability SPOFs: One metrics backend, one pager routing service.
Deployment SPOFs: One CI runner pool, one artifact registry.
Third party SPOFs: One payment processor, one KYC vendor, one cloud.
People SPOFs: One engineer who knows the batch job, one SRE on call.

A lot of resilience programs die because they stop at “multi AZ” and declare victory. That’s not how outages work, and it’s not where regulators stop asking questions.

A concrete scenario: the “healthy app, dead business service” outage

A retail bank runs its mobile app across two regions. It passes synthetic checks. Then a certificate rotation fails in the outbound proxy used for card tokenization. The app still loads, login still works, and the API health endpoint stays green. But card payments fail for 38 minutes.

From a customer view, the “make a card payment” service is down. From a platform view, the app is up.

This is why the UK “important business services” framing is so useful. It forces you to measure what the user experiences, not what your load balancer reports.

If you want a clean place to track this, use Command Center (/command-center) as your system of record for services, SLOs, incidents, and known risks. It gives you one view for the board and one view for engineers, without running two parallel spreadsheets.

How to test SPOFs: chaos engineering, resilience testing, and proof

You can’t spreadsheet your way into resilience. You need controlled failure in environments that behave like production.

AWS Well Architected says this plainly. It recommends running chaos experiments regularly in or near production, and it stresses hypotheses, steady state metrics, and rollback plans (AWS Reliability Pillar REL12-BP04).

Chaos engineering isn’t a stunt. It’s a discipline. You define steady state, inject a fault, and compare outcomes.

What should you test first?

Start with the failures that blow through impact tolerances fastest. Ask one question: which dependency can fail and cause customer harm in under 15 minutes? Then answer it with incident data and architecture reality, not opinions.

I like this ordering.

Region loss: Proves your routing, data replication, and runbooks.
Database failure modes: Primary down, replica lag, read only, lock storms.
Identity and access: IdP outage, token signing key rotation, MFA failure.
Third party timeouts: Payment rails, fraud scoring, KYC, SMS.
Deployment failures: Bad config, bad feature flag, bad schema migration.

The Nagarro guide lists common failure scenarios like accidental data deletion, deployment failure, and datacenter failure, and it frames resilience as something you test continuously, not once (Nagarro resilience testing and chaos engineering).

The “Impact Tolerance Experiment” template

This is the artifact I want every service team to produce. It’s link worthy, and it holds up in audits because it’s specific.

Impact Tolerance Experiment (ITE)

Service: “Card payment authorization”
Impact tolerance: 15 minutes of failed auths, 0.5 percent error budget burn per hour
Steady state metrics: p95 latency under 300 ms, 5xx under 0.01 percent, auth success over 99.5 percent
Fault injected: Block outbound calls to tokenization vendor for 10 minutes
Expected behavior: Fail over to secondary vendor in under 60 seconds, queue retries, show user message
Customer canary: Synthetic payment flow from 3 regions
Rollback: Automated network rule revert, plus manual runbook
Result: Pass or fail, with timestamps
Fixes: Ticket list with owners and dates

AWS gives an example of steady state metrics like less than 0.01 percent increase in 5xx errors, plus a client impact measure and synthetic monitors (AWS REL12-BP04). Use those numbers as a starting point, then tune them to your service and your impact tolerance.

Build vs buy for chaos tooling

A lot of teams try to build a fault injection platform. It can work. It’s also a long road, and you’ll be maintaining it forever.

Gremlin estimates 14 to 18 months of focused engineering time to build and maintain a sophisticated fault injection platform, and it cites an average cost of $100,000 per hour of downtime (Gremlin build vs buy, Oct 2024). That’s usually enough to get Finance to stop treating chaos testing like a “nice to have.”

Here’s a simple decision matrix you can reuse.

Decision factor	Build internal tooling	Buy a chaos tool
Time to first experiment	8 to 16 weeks for basic scripts	1 to 4 weeks with onboarding
Safety controls	You must design guardrails	Guardrails ship with the product
Audit evidence	You must build reporting	Reports and logs are built in
Coverage	Starts narrow	Starts broad, grows with vendor
Lock in risk	Low	Medium
Talent cost	2 to 5 engineers part time	1 to 2 engineers for adoption

If you buy, treat it like any other critical vendor. Put it through third party risk review and exit planning.

If you want to tie this to a broader vendor strategy, use our Build vs Buy Matrix (/tools/build-vs-buy-matrix) to document the decision and the trade.

Chaos testing creates fear if teams think it’s a trap. You need a simple rule set, and you need to mean it.

The goal is to find system gaps, not blame people.
The team that owns the service owns the experiment.
The on call engineer can stop the test at any time.

Then back it up with a blameless review process. Our incident postmortem guide (/tools/incident-postmortem) gives a structure that works well with regulators too, since it shows learning and follow up.

Operational resilience compliance: what regulators ask for and how to answer

Regulators don’t want your architecture diagram. They want proof you can keep serving customers when things break, and proof you’re closing the gaps you already know about.

Riskonnect’s 2025 roundup calls out a practical starting point. Assess which DORA elements already exist, find gaps, and build an action plan. It also stresses understanding your business continuity, infosec, IT disaster recovery, crisis management, crisis communications, regulatory reporting, and third party risk programs (Riskonnect 2025 deadlines roundup).

APRA’s COVID-19 paper is a good reminder that shocks aren’t only cyber. Lockdowns and border closures forced remote work changes in weeks, not years (APRA COVID-19 resilience lessons). That’s a people and process shock as much as a tech shock.

A quick regulatory map for CTOs

You’ll still need legal and risk partners, but this helps frame the engineering work.

UK FCA, PRA, Bank of England: Focus on important business services, impact tolerances, mapping, scenario testing, and self assessment (Riskonnect UK basics).
EU DORA: Focus on ICT risk management, incident reporting, and ICT third party oversight. It applies from 17 Jan 2025 (EIOPA DORA overview).
Australia APRA CPS 230: Deadline 1 Jul 2025, with a stronger operational risk and service provider lens (Riskonnect 2025 deadlines roundup).
Guernsey GFSC: Often aligns with UK style expectations for governance and continuity, and it expects evidence of control for outsourced services. Treat it as UK plus tighter proportionality arguments.

The IIF paper also calls out a real pain point. Different definitions across jurisdictions create complex regimes for cross border firms, even when the goal is the same (IIF staff paper, Dec 2024).

So you need one internal control set, then a mapping layer per regulator.

Metrics that stand up in board packs and exams

Most compliance frameworks don’t prescribe exact metrics. You still need numbers that show you’re in control and getting better.

TechTarget gives examples of resilience KPIs and points out that resilience failures can hit market value. It cites CrowdStrike stock dropping about 11 percent after a 2024 outage, and Meta losing $50 billion in market value after a 2021 outage (TechTarget resilience metrics and executive impact).

For CTO reporting, I like a small set.

Service level: SLO attainment, error budget burn, p95 latency.
Recovery: RTO and RPO per service, plus actual recovery times.
Testing: Number of ITEs run per quarter, pass rate, mean time to rollback.
Third party: Top 10 vendor dependencies by customer impact, plus exit test status.

Bryghtpath calls out classic metrics like RTO, RPO, and incident response time, and it points to ISO 22316:2017 as a reference for organizational resilience measurement (Bryghtpath operational resilience metrics).

If you want to operationalize this, wire it into an Engineering Metrics Dashboard (/tools/engineering-metrics-dashboard) so resilience work competes fairly with feature work.

CTO action plan: immediate fixes, policy, and architecture principles

This is the part most posts skip. They talk about resilience, but they don’t tell you what to do on Monday.

Immediate actions (next 30 days)

Pick 3 important business services. Choose the ones with the highest customer harm and regulatory exposure.
Set impact tolerances. Put numbers on maximum outage and maximum degraded performance.
Build a service map. Include people, process, tech, data, and third parties.
List top 10 SPOFs per service. Include identity, data, and vendor dependencies.
Run one ITE per service. Keep it small, time boxed, and safe.

Track all of this in one place. Command Center (/command-center) works well for service inventories, risk registers, and incident links.

Policy framework (next 90 days)

Evidence standard: Every control needs an artifact. Use runbooks, test logs, and postmortems.
Third party tiering: Tier vendors by customer impact, not by spend.
Change gates: Require rollback plans for any change that can breach impact tolerance.
Incident reporting: Align severity levels to regulatory reporting triggers.

This is also where enterprise architecture earns its keep. Use ArchiMate Modeler (/tools/archimate) to keep service maps current, since stale diagrams fail audits.

Architecture principles (next 6 to 12 months)

Design for partial failure: Assume dependencies time out, not fail fast.
Make failover boring: Automate it, then test it quarterly.
Separate control planes: Keep identity, secrets, and observability from sharing one blast radius.
Prove recoverability: Backups aren’t real until you restore in a timed drill.

If you do only one thing, do this: treat every “we have redundancy” claim as unproven until a test proves it.

Bigger picture: resilience is now a leadership system, not a tech program

Operational resilience is becoming a benchmark for executive performance, not just an SRE concern. The market punishes outages, and regulators punish weak control. That pushes CTOs into a job that’s as much coordination as it is architecture. You’re running a program across engineering, risk, legal, comms, and vendors.

APRA’s COVID-19 lesson is the clearest example. Remote work, border closures, and sudden process changes hit in weeks, not years (APRA COVID-19 resilience lessons). That’s the pattern for the next decade. Shocks come from geopolitics, supply chains, and vendor concentration, not only from bugs.

So here’s the test I use. If your top payment flow degraded for 30 minutes tomorrow, could you show a regulator the service map, the impact tolerance, the last test result, and the fix backlog in under two hours?

Operational resilience for CTOs: find SPOFs, test failure, and meet FCA, DORA, and APRA expectations

Operational resilience for CTOs: find SPOFs, test failure, and meet FCA, DORA, and APRA expectations

Operational resilience and regulatory expectations: FCA, DORA, APRA, and the rest

How to find SPOFs in modern systems, not just in infrastructure

The SPOF taxonomy I use in reviews

A concrete scenario: the “healthy app, dead business service” outage

How to test SPOFs: chaos engineering, resilience testing, and proof

What should you test first?

The “Impact Tolerance Experiment” template

Build vs buy for chaos tooling

Operational resilience compliance: what regulators ask for and how to answer

A quick regulatory map for CTOs

Metrics that stand up in board packs and exams

CTO action plan: immediate fixes, policy, and architecture principles

Immediate actions (next 30 days)

Policy framework (next 90 days)

Architecture principles (next 6 to 12 months)

Bigger picture: resilience is now a leadership system, not a tech program

Sources

Want more insights like this?

Related Content

When AI Meets Real-World Liability: Reliability, Transparency, and Governance Become Product Requirements

Mid Week Summary: Operational Resilience, Compliance Tooling, and the New Reality of AI-Accelerated Security

Agentic Systems Are Colliding with Regulated, 24x7 Markets: Why Evals + Governance Become the New Architecture

AI Enters Its Audit-Ready Era: Governance, Safety Testing, and “Prove-It” Observability

The New Ops Stack: Governed AI Automation + “Human Infrastructure” for Reliability at Scale

Operational resilience for CTOs: find SPOFs, test failure, and meet FCA, DORA, and APRA expectations

Operational resilience and regulatory expectations: FCA, DORA, APRA, and the rest

How to find SPOFs in modern systems, not just in infrastructure

The SPOF taxonomy I use in reviews

A concrete scenario: the “healthy app, dead business service” outage

How to test SPOFs: chaos engineering, resilience testing, and proof

What should you test first?

The “Impact Tolerance Experiment” template

Build vs buy for chaos tooling

Leadership reality: chaos fails without social safety

Operational resilience compliance: what regulators ask for and how to answer

A quick regulatory map for CTOs

Metrics that stand up in board packs and exams

CTO action plan: immediate fixes, policy, and architecture principles

Immediate actions (next 30 days)

Policy framework (next 90 days)

Architecture principles (next 6 to 12 months)

Bigger picture: resilience is now a leadership system, not a tech program

Sources

Want more insights like this?

Related Content

When AI Meets Real-World Liability: Reliability, Transparency, and Governance Become Product Requirements

Mid Week Summary: Operational Resilience, Compliance Tooling, and the New Reality of AI-Accelerated Security

Agentic Systems Are Colliding with Regulated, 24x7 Markets: Why Evals + Governance Become the New Architecture

AI Enters Its Audit-Ready Era: Governance, Safety Testing, and “Prove-It” Observability

The New Ops Stack: Governed AI Automation + “Human Infrastructure” for Reliability at Scale