On-call rotation planner: fair on-call scheduling

On-call rotation planner: how to build a fair, sustainable schedule

In a 2026 survey write-up, 46% of SREs said they handled more than 5 incidents in 30 days, and 23% handled 6 to 10. That load breaks teams when it stacks on top of feature work and meetings. The schedule is usually the first place you see the cracks. A fair rotation cuts fatigue, improves response, and keeps good people from walking. This guide lays out an on-call rotation planner mindset, plus how to apply The Art of CTO On-Call Rotation Optimizer to build a schedule people can actually live with.

What an on-call schedule optimizer does, and what it must consider

An on-call schedule optimizer is a planning method plus a set of rules. It produces a rotation that spreads the pain, covers time zones, and avoids back-to-back fatigue. The Art of CTO On-Call Rotation Optimizer does this by weighing coverage windows, rest, and team size limits, then proposing a schedule that stays sustainable.

Most CTOs I talk to treat on-call like a calendar problem. It’s not. It’s system design mixed with human limits.

A workable rotation has a few parts:

Coverage model: 24 by 7, business hours, or follow-the-sun.
Roles: primary, secondary, and sometimes incident commander.
Handoff: a fixed time and a written checklist.
Overrides: PTO, holidays, and launch weeks.
Load limits: a pager budget and a rule for when to stop the line.

Google’s SRE workbook describes a team with a pager budget of two incidents per shift that drifted to five per shift for a year. One third of shifts exceeded the budget, follow-up work collapsed, and engineers left the team. If you’ve run teams in the 10 to 100 engineer range, you’ve seen some version of this. The schedule won’t fix alert noise, but it can stop the noise from landing on the same person every time. See Google SRE workbook on on-call and pager load.

Here’s the framing I use with Series A CTOs: a fair schedule is one of the cheapest reliability investments you can make.

How to design a fair on-call rotation (fair on-call scheduling)

Fairness isn’t “everyone gets the same number of weeks.” Fairness is “everyone gets the same disruption over time.” That includes nights, weekends, and the ugly incidents that ruin sleep.

Here’s a link-worthy definition teams can adopt.

Fair Pager Load Framework: A rotation is fair when it balances four things across engineers over a rolling 8 to 12 weeks: time on-call, after-hours pages, sleep disruption, and recovery time.

This forces an uncomfortable truth. Two engineers can each do one week on-call, but one gets 12 pages at 2 a.m. and the other gets none. The calendar looks fair. The lived experience isn’t.

incident.io calls out the same risk in plain terms. Teams need to watch who’s getting hit at unsociable hours, and make swaps easy when someone’s having a rough stretch. That’s fairness in practice, not fairness on paper. See incident.io’s guide to on-call schedules.

Pager budgets work because they create a shared limit. Google’s example uses “two paging incidents per shift” as a budget, then shows what happens when the team runs at five. That gap becomes the forcing function for engineering work.

A simple pager budget for a Series A product team:

Target: 0 to 2 pages per 12-hour night window.
Yellow: 3 to 4 pages, require a next-day review.
Red: 5 plus pages, trigger a stop-the-line rule.

Stop-the-line means:

Freeze non-critical deploys for 24 hours.
Assign two engineers to fix the top paging source.
Add a temporary secondary for the next shift.

This is also where internal tooling pays off. Use our Engineering Metrics Dashboard to track response time and deployment health, then correlate spikes with pages. Link: engineering metrics dashboard for DORA and performance.

Rotate weekends and holidays as a separate fairness pool

Weekends and holidays create the strongest “this is unfair” stories. Treat them as their own pool instead of pretending they’re the same as a random Tuesday.

Rules that work in practice:

Weekend parity: each engineer gets the same count per quarter.
Holiday parity: each engineer gets the same count per year.
No double hit: no weekend immediately after a holiday shift.

SolarWinds notes that global teams need swap options for local holidays like Diwali or Thanksgiving, and that globally common holidays should be treated like weekend coverage with relaxed response expectations. See SolarWinds on on-call rotation best practices and holidays.

Make the schedule predictable, not just fair

Predictability reduces stress even when the load stays the same. Posting the schedule 6 to 8 weeks ahead changes behavior. People plan travel, childcare, and sleep. They also stop feeling like the company owns their evenings by default.

A scheduling study cited by Celayix points to the harm of short notice. The Shift Project found nearly 75% of retail and food-service workers get less than two weeks’ notice, and around 25% get assigned on-call shifts. That unpredictability links to higher burnout. Software teams aren’t retail, but the brain doesn’t care. See Celayix on creating and managing on-call rotations.

Minimum team size for sustainable on-call (and what to do when you’re under it)

A sustainable primary rotation needs 4 to 5 engineers. Below that, people go on-call every 3 weeks or less, and fatigue stacks fast. Uptime Labs calls out “fewer than four engineers” as a structural failure mode for on-call burnout, and it also gives a follow-the-sun minimum of 9 to 15 engineers across three locations. See Uptime Labs on reducing on-call burnout.

A decision matrix for small teams

Use this matrix when the team is under 5.

Team reality	Rotation risk	Best scheduling move	Org move that fixes it
2 engineers own a service	Extreme. No real backup.	Alternate days only as a short bridge	Merge ownership with another team or buy a managed service
3 engineers	High. PTO breaks coverage.	Weekly primary plus a manager as backup	Reduce alert volume, add runbooks, hire 1 more
4 engineers	Medium. Still tight in holidays.	Weekly primary, weekly secondary, strict handoff	Add a shadow rotation for training, then expand
5 to 8 engineers	Manageable	Weekly primary, weekly secondary, weekend pool	Split services by domain, not by org chart

Atlassian notes that teams of three or more often do well with weekly rotations, and it also stresses the need for backups because people miss pages. See Atlassian on better on-call scheduling.

What to do when you have 3 engineers and 24 by 7 expectations

This is the common Series A trap. Sales closes a customer with uptime clauses, and the team tries to “be mature” by adding 24 by 7 on-call.

Three moves work better than heroics:

Shared on-call with an adjacent team: split by service boundary, not by title.
Reduce scope: 24 by 7 for the customer-facing API, business hours for internal tools.
Pay down paging sources: treat the top three alerts as a sprint goal.

Rootly recommends dedicated coverage plans for predictable spikes like launches, Black Friday, and large migrations. Same advice applies to small teams. If a migration week is coming, schedule extra coverage and cut other commitments. See Rootly on schedules and rotations.

Engineering on-call best practices that make the schedule survivable

A schedule is only half the system. The other half is how incidents get handled, escalated, and learned from.

Use primary and secondary, but define paging rules

Primary and secondary reduces missed pages and spreads cognitive load. It also creates a new failure mode: teams start paging the secondary for everything, and now you’ve doubled the blast radius of every alert.

Paging rules that keep the secondary useful:

Primary owns ack and triage.
Secondary only gets paged after 5 minutes with no ack, or when the runbook says “needs two people.”
Manager on-call is an escalation path for customer comms, not for debugging.

TaskCall’s examples show how teams model layers and escalation policies, including primary to secondary patterns and different weekday versus weekend schedules. See TaskCall on schedule examples and rotation templates.

Build a handoff checklist that takes 10 minutes

Handoffs fail when they feel like paperwork. Keep it short, keep it consistent, and make it something people can do even when they’re tired.

Handoff checklist:

Open incidents: links, owners, next action.
Known risks: noisy alert, degraded dependency, planned job.
Deploy status: last deploy time, rollback plan.
Customer watchlist: top accounts with active issues.

SolarWinds describes the start-of-shift routine as a summary of main incidents and things to observe. That’s the right shape. See SolarWinds on shift composition and handover.

Protect recovery time after a bad night

If someone gets paged at 2 a.m., they should not lead a design review at 10 a.m. That’s how you get bad decisions, brittle code, and a second incident.

A simple recovery policy:

Two plus after-hours pages means a late start the next day.
Any page between 1 a.m. and 5 a.m. means no meetings before noon.
Red pager budget shift means a comp day within 7 days.

This is a leadership move, not a scheduling move. You’re telling the team that on-call counts as real work.

Run blameless postmortems, but keep them short

Postmortems reduce repeat pages. They also die when they turn into essays that nobody reads.

A good Series A standard:

30 minutes within 48 hours.
One owner for follow-ups.
Three follow-ups max, each with a due date.

Use our template to keep it consistent: incident postmortem guide and template.

CTO recommendations: how to use an SRE on-call rotation tool in a 10 to 100 engineer company

This is where the On-Call Rotation Optimizer fits. It helps teams design a schedule that respects time zones, fatigue limits, and team size constraints. It also gives CTOs a neutral way to talk about fairness without making it personal.

Immediate actions (this week)

Inventory coverage needs. List systems that need 24 by 7, and cut the rest.
Pick a rotation unit. Weekly rotations work for most teams of 3 plus, per Atlassian’s guidance.
Add a secondary. Define the paging rules in writing.
Set a pager budget. Start with two pages per night window, then measure.
Publish 8 weeks ahead. Treat schedule changes like production changes.

Track the work in one place. Command Center is built for this kind of portfolio view across incidents, risks, and capacity. Link: Command Center for incidents, risks, and capacity.

Policy framework (what to write down)

Fairness rules. Weekend and holiday parity, plus a no double hit rule.
Recovery rules. Late start, meeting protection, and comp time triggers.
Escalation rules. When to page secondary, when to page a manager.
Change control rules. No deploy windows for holidays, and extra coverage for launches.

Squadcast calls out “no deploy” periods during holidays and weekends as a practical way to reduce on-call pressure, along with runbooks and better tooling. See Squadcast on strategies to reduce on-call burnout.

Architecture principles (so the schedule stays stable)

Reduce blast radius. Smaller services and safer deploys mean fewer 2 a.m. incidents.
Make alerts actionable. Every page needs a runbook step and an owner.
Automate the easy fixes. Auto-remediation for disk full and stuck queues pays back fast.

For build vs buy decisions around incident tooling, use our matrix: Build vs Buy Matrix for vendor decisions.

For cost trade-offs, use: cloud cost estimator for reliability spend.

Bigger picture: on-call is a retention strategy, not just an incident strategy

Early-stage companies compete on speed, but they lose on-call engineers on sleep debt. Uptime Labs ties burnout to structural failures like small rotations, high alert noise, and lack of training space for juniors. That matches what I’ve seen. The schedule is the visible part of the system, so it becomes the lightning rod.

Rootly frames schedule design as shaping team health and burnout risk through fairness, predictability, and balanced coverage. That’s the right mental model for leadership. A fair schedule also changes engineering behavior. Teams stop shipping noisy alerts because they know they’ll be the ones carrying them.

Ask yourself one question: if your best engineer quit after three months of pager pain, would the company still hit its roadmap?

Use the tool: On-Call Rotation Optimizer

On-Call Rotation Planner Guide: How to Build a Fair, Sustainable Schedule

On-call rotation planner: how to build a fair, sustainable schedule

What an on-call schedule optimizer does, and what it must consider

How to design a fair on-call rotation (fair on-call scheduling)

Rotate weekends and holidays as a separate fairness pool

Make the schedule predictable, not just fair

Minimum team size for sustainable on-call (and what to do when you’re under it)

A decision matrix for small teams

What to do when you have 3 engineers and 24 by 7 expectations

Engineering on-call best practices that make the schedule survivable

Use primary and secondary, but define paging rules

Build a handoff checklist that takes 10 minutes

Protect recovery time after a bad night

Run blameless postmortems, but keep them short

CTO recommendations: how to use an SRE on-call rotation tool in a 10 to 100 engineer company

Immediate actions (this week)

Policy framework (what to write down)

Architecture principles (so the schedule stays stable)

Bigger picture: on-call is a retention strategy, not just an incident strategy

Sources

Want more insights like this?

Related Content

Incident Severity Classification Tool Guide: SEV Level Definitions That Don’t Collapse Under Pressure

Build vs Buy Decision Framework: How to Use a Weighted Matrix Without Regretting It

Engineering Headcount Forecasting Guide: An Engineering Headcount Calculator for Series A and B CTOs

Cloud cost calculator guide: how to estimate AWS vs Azure vs GCP spend before it hits your invoice

Engineering Team Scaling Calculator Guide: Build a Hiring Plan That Matches Your Roadmap

On-call rotation planner: how to build a fair, sustainable schedule

What an on-call schedule optimizer does, and what it must consider

How to design a fair on-call rotation (fair on-call scheduling)

Use the Fair Pager Load Framework

Set a pager budget, then treat it like an SLO

Rotate weekends and holidays as a separate fairness pool

Make the schedule predictable, not just fair

Minimum team size for sustainable on-call (and what to do when you’re under it)

A decision matrix for small teams

What to do when you have 3 engineers and 24 by 7 expectations

Engineering on-call best practices that make the schedule survivable

Use primary and secondary, but define paging rules

Build a handoff checklist that takes 10 minutes

Protect recovery time after a bad night

Run blameless postmortems, but keep them short

CTO recommendations: how to use an SRE on-call rotation tool in a 10 to 100 engineer company

Immediate actions (this week)

Policy framework (what to write down)

Architecture principles (so the schedule stays stable)

Bigger picture: on-call is a retention strategy, not just an incident strategy

Sources

Want more insights like this?

Related Content

Incident Severity Classification Tool Guide: SEV Level Definitions That Don’t Collapse Under Pressure

Build vs Buy Decision Framework: How to Use a Weighted Matrix Without Regretting It

Engineering Headcount Forecasting Guide: An Engineering Headcount Calculator for Series A and B CTOs

Cloud cost calculator guide: how to estimate AWS vs Azure vs GCP spend before it hits your invoice

Engineering Team Scaling Calculator Guide: Build a Hiring Plan That Matches Your Roadmap