On-Call Rotation Planner Guide: How to Build a Fair, Sustainable Schedule
On-call rotation planner: how to build a fair, sustainable schedule

On-call rotation planner: how to build a fair, sustainable schedule
In a 2026 survey write-up, 46% of SREs said they handled more than 5 incidents in 30 days, and 23% handled 6 to 10. That load breaks teams when it stacks on top of feature work and meetings. The schedule is usually the first place you see the cracks. A fair rotation cuts fatigue, improves response, and keeps good people from walking. This guide lays out an on-call rotation planner mindset, plus how to apply The Art of CTO On-Call Rotation Optimizer to build a schedule people can actually live with.
What an on-call schedule optimizer does, and what it must consider
An on-call schedule optimizer is a planning method plus a set of rules. It produces a rotation that spreads the pain, covers time zones, and avoids back-to-back fatigue. The Art of CTO On-Call Rotation Optimizer does this by weighing coverage windows, rest, and team size limits, then proposing a schedule that stays sustainable.
Most CTOs I talk to treat on-call like a calendar problem. Itâs not. Itâs system design mixed with human limits.
A workable rotation has a few parts:
- Coverage model: 24 by 7, business hours, or follow-the-sun.
- Roles: primary, secondary, and sometimes incident commander.
- Handoff: a fixed time and a written checklist.
- Overrides: PTO, holidays, and launch weeks.
- Load limits: a pager budget and a rule for when to stop the line.
Googleâs SRE workbook describes a team with a pager budget of two incidents per shift that drifted to five per shift for a year. One third of shifts exceeded the budget, follow-up work collapsed, and engineers left the team. If youâve run teams in the 10 to 100 engineer range, youâve seen some version of this. The schedule wonât fix alert noise, but it can stop the noise from landing on the same person every time. See Google SRE workbook on on-call and pager load.
Hereâs the framing I use with Series A CTOs: a fair schedule is one of the cheapest reliability investments you can make.
How to design a fair on-call rotation (fair on-call scheduling)
Fairness isnât âeveryone gets the same number of weeks.â Fairness is âeveryone gets the same disruption over time.â That includes nights, weekends, and the ugly incidents that ruin sleep.
Use the Fair Pager Load Framework
Hereâs a link-worthy definition teams can adopt.
Fair Pager Load Framework: A rotation is fair when it balances four things across engineers over a rolling 8 to 12 weeks: time on-call, after-hours pages, sleep disruption, and recovery time.
This forces an uncomfortable truth. Two engineers can each do one week on-call, but one gets 12 pages at 2 a.m. and the other gets none. The calendar looks fair. The lived experience isnât.
incident.io calls out the same risk in plain terms. Teams need to watch whoâs getting hit at unsociable hours, and make swaps easy when someoneâs having a rough stretch. Thatâs fairness in practice, not fairness on paper. See incident.ioâs guide to on-call schedules.
Set a pager budget, then treat it like an SLO
Pager budgets work because they create a shared limit. Googleâs example uses âtwo paging incidents per shiftâ as a budget, then shows what happens when the team runs at five. That gap becomes the forcing function for engineering work.
A simple pager budget for a Series A product team:
- Target: 0 to 2 pages per 12-hour night window.
- Yellow: 3 to 4 pages, require a next-day review.
- Red: 5 plus pages, trigger a stop-the-line rule.
Stop-the-line means:
- Freeze non-critical deploys for 24 hours.
- Assign two engineers to fix the top paging source.
- Add a temporary secondary for the next shift.
This is also where internal tooling pays off. Use our Engineering Metrics Dashboard to track response time and deployment health, then correlate spikes with pages. Link: engineering metrics dashboard for DORA and performance.
Rotate weekends and holidays as a separate fairness pool
Weekends and holidays create the strongest âthis is unfairâ stories. Treat them as their own pool instead of pretending theyâre the same as a random Tuesday.
Rules that work in practice:
- Weekend parity: each engineer gets the same count per quarter.
- Holiday parity: each engineer gets the same count per year.
- No double hit: no weekend immediately after a holiday shift.
SolarWinds notes that global teams need swap options for local holidays like Diwali or Thanksgiving, and that globally common holidays should be treated like weekend coverage with relaxed response expectations. See SolarWinds on on-call rotation best practices and holidays.
Make the schedule predictable, not just fair
Predictability reduces stress even when the load stays the same. Posting the schedule 6 to 8 weeks ahead changes behavior. People plan travel, childcare, and sleep. They also stop feeling like the company owns their evenings by default.
A scheduling study cited by Celayix points to the harm of short notice. The Shift Project found nearly 75% of retail and food-service workers get less than two weeksâ notice, and around 25% get assigned on-call shifts. That unpredictability links to higher burnout. Software teams arenât retail, but the brain doesnât care. See Celayix on creating and managing on-call rotations.
Minimum team size for sustainable on-call (and what to do when youâre under it)
A sustainable primary rotation needs 4 to 5 engineers. Below that, people go on-call every 3 weeks or less, and fatigue stacks fast. Uptime Labs calls out âfewer than four engineersâ as a structural failure mode for on-call burnout, and it also gives a follow-the-sun minimum of 9 to 15 engineers across three locations. See Uptime Labs on reducing on-call burnout.
A decision matrix for small teams
Use this matrix when the team is under 5.
| Team reality | Rotation risk | Best scheduling move | Org move that fixes it |
|---|---|---|---|
| 2 engineers own a service | Extreme. No real backup. | Alternate days only as a short bridge | Merge ownership with another team or buy a managed service |
| 3 engineers | High. PTO breaks coverage. | Weekly primary plus a manager as backup | Reduce alert volume, add runbooks, hire 1 more |
| 4 engineers | Medium. Still tight in holidays. | Weekly primary, weekly secondary, strict handoff | Add a shadow rotation for training, then expand |
| 5 to 8 engineers | Manageable | Weekly primary, weekly secondary, weekend pool | Split services by domain, not by org chart |
Atlassian notes that teams of three or more often do well with weekly rotations, and it also stresses the need for backups because people miss pages. See Atlassian on better on-call scheduling.
What to do when you have 3 engineers and 24 by 7 expectations
This is the common Series A trap. Sales closes a customer with uptime clauses, and the team tries to âbe matureâ by adding 24 by 7 on-call.
Three moves work better than heroics:
- Shared on-call with an adjacent team: split by service boundary, not by title.
- Reduce scope: 24 by 7 for the customer-facing API, business hours for internal tools.
- Pay down paging sources: treat the top three alerts as a sprint goal.
Rootly recommends dedicated coverage plans for predictable spikes like launches, Black Friday, and large migrations. Same advice applies to small teams. If a migration week is coming, schedule extra coverage and cut other commitments. See Rootly on schedules and rotations.
Engineering on-call best practices that make the schedule survivable
A schedule is only half the system. The other half is how incidents get handled, escalated, and learned from.
Use primary and secondary, but define paging rules
Primary and secondary reduces missed pages and spreads cognitive load. It also creates a new failure mode: teams start paging the secondary for everything, and now youâve doubled the blast radius of every alert.
Paging rules that keep the secondary useful:
- Primary owns ack and triage.
- Secondary only gets paged after 5 minutes with no ack, or when the runbook says âneeds two people.â
- Manager on-call is an escalation path for customer comms, not for debugging.
TaskCallâs examples show how teams model layers and escalation policies, including primary to secondary patterns and different weekday versus weekend schedules. See TaskCall on schedule examples and rotation templates.
Build a handoff checklist that takes 10 minutes
Handoffs fail when they feel like paperwork. Keep it short, keep it consistent, and make it something people can do even when theyâre tired.
Handoff checklist:
- Open incidents: links, owners, next action.
- Known risks: noisy alert, degraded dependency, planned job.
- Deploy status: last deploy time, rollback plan.
- Customer watchlist: top accounts with active issues.
SolarWinds describes the start-of-shift routine as a summary of main incidents and things to observe. Thatâs the right shape. See SolarWinds on shift composition and handover.
Protect recovery time after a bad night
If someone gets paged at 2 a.m., they should not lead a design review at 10 a.m. Thatâs how you get bad decisions, brittle code, and a second incident.
A simple recovery policy:
- Two plus after-hours pages means a late start the next day.
- Any page between 1 a.m. and 5 a.m. means no meetings before noon.
- Red pager budget shift means a comp day within 7 days.
This is a leadership move, not a scheduling move. Youâre telling the team that on-call counts as real work.
Run blameless postmortems, but keep them short
Postmortems reduce repeat pages. They also die when they turn into essays that nobody reads.
A good Series A standard:
- 30 minutes within 48 hours.
- One owner for follow-ups.
- Three follow-ups max, each with a due date.
Use our template to keep it consistent: incident postmortem guide and template.
CTO recommendations: how to use an SRE on-call rotation tool in a 10 to 100 engineer company
This is where the On-Call Rotation Optimizer fits. It helps teams design a schedule that respects time zones, fatigue limits, and team size constraints. It also gives CTOs a neutral way to talk about fairness without making it personal.
Immediate actions (this week)
- Inventory coverage needs. List systems that need 24 by 7, and cut the rest.
- Pick a rotation unit. Weekly rotations work for most teams of 3 plus, per Atlassianâs guidance.
- Add a secondary. Define the paging rules in writing.
- Set a pager budget. Start with two pages per night window, then measure.
- Publish 8 weeks ahead. Treat schedule changes like production changes.
Track the work in one place. Command Center is built for this kind of portfolio view across incidents, risks, and capacity. Link: Command Center for incidents, risks, and capacity.
Policy framework (what to write down)
- Fairness rules. Weekend and holiday parity, plus a no double hit rule.
- Recovery rules. Late start, meeting protection, and comp time triggers.
- Escalation rules. When to page secondary, when to page a manager.
- Change control rules. No deploy windows for holidays, and extra coverage for launches.
Squadcast calls out âno deployâ periods during holidays and weekends as a practical way to reduce on-call pressure, along with runbooks and better tooling. See Squadcast on strategies to reduce on-call burnout.
Architecture principles (so the schedule stays stable)
- Reduce blast radius. Smaller services and safer deploys mean fewer 2 a.m. incidents.
- Make alerts actionable. Every page needs a runbook step and an owner.
- Automate the easy fixes. Auto-remediation for disk full and stuck queues pays back fast.
For build vs buy decisions around incident tooling, use our matrix: Build vs Buy Matrix for vendor decisions.
For cost trade-offs, use: cloud cost estimator for reliability spend.
Bigger picture: on-call is a retention strategy, not just an incident strategy
Early-stage companies compete on speed, but they lose on-call engineers on sleep debt. Uptime Labs ties burnout to structural failures like small rotations, high alert noise, and lack of training space for juniors. That matches what Iâve seen. The schedule is the visible part of the system, so it becomes the lightning rod.
Rootly frames schedule design as shaping team health and burnout risk through fairness, predictability, and balanced coverage. Thatâs the right mental model for leadership. A fair schedule also changes engineering behavior. Teams stop shipping noisy alerts because they know theyâll be the ones carrying them.
Ask yourself one question: if your best engineer quit after three months of pager pain, would the company still hit its roadmap?
Use the tool: On-Call Rotation Optimizer
Sources
- Uptime Labs, How to Reduce On-Call Burnout in SRE Teams: 8 Structural Fixes
- Rootly, On-call Software: Schedules and Rotations
- incident.io, The ultimate guide to on-call schedules
- Atlassian, A better approach to on-call scheduling
- Google SRE Workbook, What it means being on-call
- SolarWinds, On-Call Rotation: Tutorial and Best Practices
- Celayix, How to Create and Manage an On Call Rotation Schedule
- TaskCall, On-Call Schedule Examples and Rotation Templates
- Squadcast, Conquering On-Call Burnout: Essential Strategies for Tech Teams