STAMP Framework for Resilience: Operational Resilience Tool

STAMP framework for resilience: an operational resilience assessment tool guide for FCA and DORA

DORA applies from 17 January 2025, and it pulls more tech firms into scope, including some crypto and ICT third parties. UK firms also hit a hard milestone on 31 March 2025 for the FCA and PRA operational resilience rules. Those dates matter because regulators now ask for service maps, impact tolerances, and evidence, not just security controls.

Clyde & Co lays out DORA’s scope expansion and its focus on ICT third-party risk and testing. That’s where a lot of startups feel the squeeze first, because your product might be fine, but your dependency chain isn’t. See Clyde & Co’s DORA overview.

My thesis is simple: CTOs need a fast operational resilience assessment tool that turns regulatory language into engineering work. The STAMP Framework does that by forcing five concrete conversations, then capturing evidence you can defend in a partner review, an audit, or a regulator conversation.

What is the STAMP operational resilience framework?

STAMP is a structured way to assess operational resilience across five dimensions: Services, Tolerances, Architecture, Monitoring, and Proof. It lines up with the core expectations in FCA operational resilience and the EU Digital Operational Resilience Act, including resilience testing and ICT third-party risk.

Here’s a paraphrased definition teams can repeat internally:

STAMP is an operational resilience assessment tool that checks five things: which services matter, how much disruption is acceptable, whether the architecture can meet that bar, how teams detect breaches, and what evidence proves it.

STAMP works because it matches how regulators think about harm. The FCA regime is about avoiding “intolerable harm” and continuing to deliver important business services. It’s not a promise that nothing will ever go down. Covington summarizes the FCA’s framing and the idea that this is ongoing refinement, testing, and governance, not a one-time project. See Covington’s FCA findings one year on.

A practical STAMP assessment breaks into these components:

Services: define important business services and their boundaries.
Tolerances: set impact tolerances that describe “maximum tolerable disruption.”
Architecture: map dependencies and find single points of failure.
Monitoring: detect when a service is nearing or breaching tolerance.
Proof: keep evidence from tests, incidents, and governance decisions.

The framing statement: STAMP turns resilience into a product backlog with receipts.

FCA operational resilience: how to define important business services mapping

Most Series A and B teams start in the wrong place. They start with systems. Regulators start with services delivered to external users.

The Bank of England and PRA guidance makes the expectation explicit: firms should map the people, processes, technology, facilities, and information needed to deliver each important business service. Mapping should help identify vulnerabilities and support testing against impact tolerances. See BoE and PRA impact tolerances paper (PDF).

A service definition that works in a startup

Use this rule:

An important business service is an externally delivered outcome.
It has a clear start and end.
It has a measurable unit.

Examples for a fintech with 40 engineers:

“Card payments authorization for UK retail customers.”
“Faster Payments outbound transfers for UK customers.”
“Customer identity verification and account opening.”

Non-examples:

“Kubernetes cluster uptime.”
“Data platform.”
“On-call.”

PwC’s impact tolerance paper calls out the same distinction. Business services are provided to an external end user, not internal functions like IT or payroll. See PwC on setting and testing impact tolerances (PDF).

The minimum viable service map

A service map doesn’t need 200 boxes. It needs enough detail to find the failure that breaks the tolerance.

For each service, capture:

Customer journey steps: 5 to 12 steps, written in plain language.
Systems: APIs, queues, databases, and third-party services.
Data stores: where state lives, and what “loss” means.
People: which team owns each step, and who is on-call.
Manual workarounds: spreadsheets count, and regulators know.

Protecht’s mapping series makes the point that mapping is the bridge between identifying services and doing scenario testing. See Protecht on mapping important business services.

A leadership trap: “We’re too small for this”

Small teams still have complex dependency chains. A 25-person engineering org can depend on AWS, Stripe, Twilio, Datadog, Cloudflare, and a KYC vendor. That’s already a multi-party service.

So the right question is: how small can the map be and still be honest? The answer is “small enough to keep current, big enough to find the break.”

If the org needs a place to store these maps and track owners, this is where Command Center fits. It gives a single view of services, risks, incidents, and migrations. Link: track service health and risk in Command Center.

Impact tolerance assessment: setting numbers teams can test

Impact tolerances define the maximum disruption a firm will accept for an important business service. They often use time, volume, and data loss. Regulators expect firms to set tolerances, then test “severe but plausible” scenarios.

The BoE and PRA paper uses that exact idea: firms should remain within impact tolerances in severe but plausible scenarios, and they should map and test to support that. See BoE and PRA impact tolerances paper (PDF).

A simple tolerance template

For each service, set three numbers:

Maximum outage duration: for full loss of service.
Maximum degraded duration: for partial loss, like 30 percent failure rate.
Maximum data loss: in minutes of data, or in transactions.

Write them as statements:

“We will not tolerate more than 2 hours of complete outage for outbound payments.”
“We will not tolerate more than 24 hours of degraded KYC processing above 15 percent failure rate.”
“We will not tolerate more than 5 minutes of confirmed data loss for ledger entries.”

The Investment Association’s material shows how firms document impact tolerance statements and rationale, and how they use them to calibrate thresholds across services. See Impact Tolerances: Appetite for Disruption (PDF).

How to pick the number without guessing

If you’re picking tolerances by gut feel, you’ll either over-promise (and create an impossible engineering plan) or under-promise (and fail partner diligence). I’ve had better results using three inputs:

Harm: customer impact, market impact, and firm viability.
Operational reality: current RTO, RPO, and manual fallback.
Regulatory posture: what the board will sign.

Riskonnect’s 2025 deadlines roundup gives a practical sequence that matches this: refine important business services, understand where intolerable harm hits, then validate through exercises, then mature analysis of single points of failure. See Riskonnect’s operational resilience deadlines guide.

A decision matrix: tolerance ambition vs engineering cost

Use this matrix in planning. It stops teams from setting tolerances that quietly imply a full rewrite.

Tolerance target	Typical engineering work	Cost profile for 10-100 engineers	When it fits
24 to 72 hours	Runbooks, manual fallback, vendor escalation paths	Low to medium	Early product, low systemic risk
4 to 12 hours	Multi-region read paths, queue replay, tested restores	Medium	Payments, onboarding, customer access
Under 1 hour	Active-active, automated failover, strict change control	High	Core ledger, market-critical services
Near zero data loss	Synchronous replication, strong consistency trade-offs	High	Ledger and settlement systems

This matrix is useful because it forces a board-level trade. It also gives engineering a clean way to say, “If you want that tolerance, here’s what it costs.”

Teams can pair this with our Cloud Cost Estimator to price the jump from single-region to multi-region. Link: estimate multi-region cost with Cloud Cost Estimator.

STAMP framework for resilience: Architecture and third-party risk under DORA

DORA’s scope and deadlines push third-party risk into the engineering backlog. DORA applies from 17 January 2025, and it includes ICT risk management, incident reporting, resilience testing, and third-party risk management. Riskonnect lists these high-level areas and the same effective date. See Riskonnect on DORA requirements.

The official DORA tracking site also notes the oversight framework for critical ICT third-party providers and the register of information submissions due by 30 April 2025. See digital-operational-resilience-act.com on RTS and third-party oversight.

Architecture: map the failure, not the diagram

For each important business service, teams need to identify the break points that cause a tolerance breach. This isn’t about drawing a pretty architecture diagram. It’s about answering, “What breaks first, and what happens next?”

Common break points in startups:

Single region dependency: one AWS region, one database primary.
Single queue: one Kafka cluster, one SQS queue without replay.
Shared admin plane: one IAM account, one CI system.
Hidden coupling: shared database tables across services.
Third-party choke point: one KYC vendor, one SMS provider.

The BoE and PRA paper calls out mapping across third parties as necessary, even when suppliers resist sharing details. That’s the hard part for startups with low negotiating power. See BoE and PRA impact tolerances paper (PDF).

Third-party risk: what engineering leaders can do this quarter

DORA expects firms to manage ICT third-party risk, including contract terms, due diligence, and testing. Clyde & Co highlights DORA’s focus on ICT third-party service providers and board accountability themes shared with NIS2. See Clyde & Co on DORA and third parties.

If you want something you can ship this quarter, build a third-party register that engineering can actually maintain. For a 60-engineer org, I’d include:

Vendor name and service: “Twilio SMS for OTP.”
Service owner: one internal accountable person.
Failure mode: “OTP delivery delayed or blocked.”
Fallback: email OTP, passkeys, call center.
Contract hooks: incident notification window, audit rights, sub-outsourcing terms.

If teams need a structure for make-or-buy choices that affect resilience, use our Build vs Buy Matrix. Link: make vendor choices with the Build vs Buy Matrix.

Monitoring and Proof: turning resilience into evidence regulators accept

Monitoring and proof are where many teams fall down. They build decent systems, but they can’t show that those systems meet tolerances. In a regulator or bank partner review, “trust us” doesn’t count.

Covington’s summary of FCA observations points to operational resilience as an ongoing process of testing and governance. That implies evidence and repeatability, not a one-off tabletop. See Covington’s FCA findings one year on.

Monitoring: measure the tolerance, not the server

For each service, define:

Service level indicators tied to harm: payment success rate, authorization latency, onboarding completion time.
Breach conditions tied to tolerance: “complete outage clock starts at 5 minutes of 100 percent failure.”
Escalation path: who gets paged, and when the business gets pulled in.

A common pattern:

Customer-facing SLI: percent of successful payment submissions.
Dependency SLI: queue lag, database replication lag.
Manual SLI: backlog size for manual review.

Teams can track these in an engineering metrics view, then connect them to incident outcomes. Link: track delivery and reliability in Engineering Metrics Dashboard.

Proof: the “resilience evidence pack” checklist

Regulators and auditors ask for a self-assessment and supporting evidence. The early FCA consultation and later guidance describe a self-assessment document that includes services, impact tolerances, mapping, testing, and lessons learned. See The Investment Association consultation summary (PDF).

Use this checklist as the minimum evidence pack per service:

Service definition: one page, with boundaries and owners.
Impact tolerance statement: numbers, rationale, approval date.
Service map: dependencies, third parties, and manual steps.
Scenario tests: at least 2 per year per critical service.
Test results: what broke, time to recover, gaps found.
Remediation plan: backlog items with owners and dates.
Incident links: postmortems that show learning and change.

For postmortems, use a consistent template so evidence stays readable. Link: run blameless reviews with Incident Postmortem.

Enterprise implications for Series A and B CTOs

Early-stage CTOs often assume resilience rules are “later.” That assumption fails the moment you hit enterprise sales, bank partnerships, or EU expansion.

Sales cycles now include resilience due diligence. A bank partner will ask for important business services mapping and impact tolerance assessment. They’ll also ask how third parties are managed under DORA-style expectations.
Shadow dependencies become board risks. A single vendor outage can breach a tolerance and trigger customer harm. Without a service map, teams can’t even name the dependency with confidence.
Engineering priorities shift from features to proof. Teams need test artifacts, monitoring tied to tolerances, and documented decisions. That changes how staff time gets allocated across quarters, even if nobody wants it to.
Multi-regime scope creates duplicated work. Sidley notes that firms subject to both UK and EU regimes may need separate scoping and implementation projects, since the regimes diverge in various ways. See Sidley on UK rules and DORA overlap.

CTO recommendations: how to run STAMP in a 30-day sprint

STAMP works best as a short, time-boxed program with clear outputs. Treat it like an internal product launch, with a real definition of done.

Immediate actions

Pick 3 services: choose the ones that would cause the most customer harm. Write one-sentence definitions and owners.
Set draft tolerances: pick numbers that match current reality, then mark the gaps.
Map dependencies: capture systems and third parties in a single diagram per service.
Run one severe scenario: simulate a region outage or a vendor outage. Measure time to restore.
Create the evidence pack: store it in one place, with dates and approvals.

Policy framework

Service ownership: assign one accountable leader per important business service.
Change control for critical paths: require risk review for changes that affect tolerance.
Third-party register: keep a living list with owners, failure modes, and fallback.
Testing cadence: schedule scenario tests and restore tests on the calendar.

Architecture principles

Design to the tolerance: pick RTO and RPO targets that match impact tolerances.
Remove single points of failure: start with identity, payments, and data stores.
Prefer simple fallback: manual processing beats a half-built active-active setup.
Instrument the customer journey: measure what users feel, not what servers report.

Teams that want to document architecture and dependencies in a consistent way can use ArchiMate Modeler for service maps and dependency views. Link: model service dependencies with ArchiMate Modeler.

Bigger picture: resilience is becoming a product feature

Operational resilience rules are changing how tech leaders talk about reliability. It’s no longer an SRE-only conversation. Boards care, partners care, and procurement teams are getting sharper questions from their own risk people.

For startups, the win isn’t compliance theater. The win is faster incident response, clearer ownership, and fewer surprise outages that derail fundraising or enterprise deals. The catch is boring: you have to write things down, and you have to keep them current.

One gut-check question I like: if a regulator or a bank partner asked for your top three important business services mapping and impact tolerances, could your team answer in one day?

Use the tool

Use STAMP Operational Resilience to run a structured assessment across Services, Tolerances, Architecture, Monitoring, and Proof. Link: https://theartofcto.com/tools/stamp-framework

STAMP Framework for Resilience: A Practical Operational Resilience Assessment Guide for FCA and DORA