Hangfire in production: how CTOs run background jobs without turning the database into a queue
Hangfire in production: how CTOs run background jobs without turning the database into a queue

Table of Contents
Hangfire in production: how CTOs run background jobs without turning the database into a queue
Hangfire has over 4,397 commits in its main repo, and it ships a built-in dashboard that a lot of teams expose on day one. That convenience is the trap. Background jobs touch billing, email, data pipelines, and customer trust. A production stance has to start in sprint one, not after the first “why didn’t invoices go out?” escalation.
The thesis is simple: treat Hangfire like a core system, not a helper library. That mindset avoids the two classic failures I see over and over, silent job loss and runaway retries.
Hangfire background jobs: what Hangfire is, and what it is not
The primary search query for this piece is “Hangfire background jobs”. Most pages show how to enqueue a job. The missing part is how to run Hangfire as a service with clear ownership, scaling rules, and failure budgets.
Hangfire is a .NET background job system that persists job state in a shared store. Workers pull jobs, execute them, and record outcomes. Hangfire supports fire-and-forget, delayed, recurring, and continuation jobs, and it can run inside a web app or in a separate worker process. The official project describes it as “an easy way to perform background job processing in .NET and .NET Core applications” with no Windows Service required, backed by persistent storage like SQL Server or Redis, and with a built-in dashboard for monitoring and control (Hangfire GitHub repo).
Hangfire’s core building blocks look like this:
- Storage: SQL Server, Redis, and other backends. Job metadata lives in storage, not in memory (DEV Community overview of storage).
- Client API:
BackgroundJob.Enqueue,BackgroundJob.Schedule, andRecurringJob.AddOrUpdatefor job creation (DEV Community example). - Server: worker threads that poll storage, lock jobs, and execute them.
- Dashboard: a UI to view queues, failures, retries, and job history (Hangfire overview).
- Reliability model: “at least once” execution, with automatic retries after failures (Hangfire site reliability note).
Hangfire is not a streaming system. Hangfire is not a low-latency scheduler. Hangfire is not a replacement for a message broker when you need strict ordering, exactly-once semantics, or cross-language consumers.
Hangfire works best as a durable job runner for business workflows that need retries, visibility, and operator control.
How Hangfire works under load: retries, polling, and the “at least once” contract
Most CTOs I talk to struggle with one thing: teams treat background jobs as “out of band,” then act surprised when jobs become the product.
Automatic retries change your failure mode
Hangfire retries jobs after exceptions, shutdowns, and process crashes (Hangfire Core overview on retries). Retries are great right up until they turn a transient bug into repeated side effects.
Hangfire also states an “at least once” processing model once a job is created successfully (Hangfire reliability statement). “At least once” means duplicates happen. Duplicates show up during worker crashes, timeouts, and deploy restarts.
CTO rule: every job that touches money, inventory, or customer messaging must be idempotent.
A practical pattern:
- Idempotency key: store a unique key per business action, like
invoiceId:2026-06-28:send. - Write-ahead record: insert a row that marks intent, then do the side effect, then mark done.
- Dedup check: exit early if the row already shows done.
Want a quick team exercise? Run a chaos drill. Kill the worker process mid-job and watch what duplicates you get.
Our guide to incident postmortems fits well here, because background job failures often hide for days. Link: our guide to incident postmortems (and how to stop repeat failures) at /tools/incident-postmortem.
Polling frequency and the 1-minute floor
Hangfire recurring jobs have a practical scheduling floor. A Hangfire maintainer noted that Hangfire “cannot run a job every 10 seconds, the minimal resolution is 1 minute,” and lower resolution increases overhead due to constant database polling (Hangfire discussion on 1-minute resolution).
That detail should drive architecture decisions:
- Every 10 seconds work belongs in a long-running loop, a hosted service, or a stream processor.
- Every 1 minute work fits Hangfire recurring jobs.
Product will still ask for “near real time” updates. Near real time usually needs event-driven design, not tighter cron.
BoldSign’s architecture guide frames the trade-offs clearly: hosted services for continuous loops, Hangfire for user-triggered workflows with dashboards and retries, and Quartz.NET for strict calendar scheduling and clustering (BoldSign background jobs comparison).
Worker count is a scaling knob, and a foot-gun
Hangfire servers run a pool of worker threads. A lot of teams crank worker count up, then blame Hangfire when SQL Server starts smoking.
A common tuning snippet sets worker count based on CPU cores, like Environment.ProcessorCount * 5 (DEV Community .NET 9 guide snippet). That can work for short CPU-light jobs. That same setting will crush your database if each job runs heavy queries.
CTO rule: scale workers based on downstream capacity, not on CPU.
A simple starting point for a single service:
- Default queue: 10 to 20 workers.
- Email queue: 5 workers, rate-limited.
- Billing queue: 2 to 5 workers, strict idempotency.
Then measure. Guessing is how you end up with a retry storm at 2 a.m.
Our Engineering Metrics Dashboard can help you track deploy frequency and lead time, but you also need job metrics. Link: our Engineering Metrics Dashboard for DORA metrics and delivery health at /tools/engineering-metrics-dashboard.
Hangfire storage choices: SQL Server vs Redis, and what your SRE team will feel
The secondary query many leaders search is “Hangfire SQL Server vs Redis”. The answer isn’t just speed. The answer is what you’re signing up to run, debug, and page on.
SQL Server storage is the default, and it’s easy to overload
SQL Server storage works well for moderate throughput and teams that already run SQL Server as a core dependency. Most orgs start here because the app already has a SQL database.
The failure mode is predictable. Teams use the same SQL Server for OLTP and job storage, then a retry storm turns into lock contention and slow customer requests.
If you keep SQL Server storage:
- Put Hangfire tables in a dedicated database on the same instance, or a separate instance.
- Set clear limits on retries for high-impact jobs.
- Add dashboards and alerts for queue depth and failure rate.
Redis storage can cut latency and raise throughput
Hangfire’s Redis documentation states that Redis storage processed jobs “more than 4x” faster than SQL Server storage on the author’s development machine for empty jobs (Hangfire Redis docs). The same page notes Redis storage uses BRPOPLPUSH to keep job fetch latency low.
That “empty jobs” benchmark isn’t your workload. The benchmark still tells you something useful: SQL polling overhead adds up.
Redis trade-offs CTOs need to say out loud in the design review:
- Redis adds a new tier to operate.
- Redis persistence settings matter, or you risk job loss during node failure.
- Redis clustering and failover need testing, not hope.
Hangfire Pro also sells Hangfire.Pro.Redis as a high-performance storage option, and the Pro page positions it as “super-fast Redis as job storage” (Hangfire Pro overview).
A decision matrix you can reuse with your team
Here’s a link-worthy element you can paste into an RFC.
The CTO Hangfire Storage Decision Matrix
| Requirement | SQL Server storage | Redis storage |
|---|---|---|
| Team already runs SQL Server HA | Strong fit | Adds a new system |
| High job throughput, short jobs | DB polling becomes a limit | Strong fit, lower latency |
| Strict audit and reporting needs | Easy with SQL queries | Needs extra export |
| Ops team comfort | Often high | Varies |
| Risk of retry storms impacting OLTP | Higher if shared instance | Lower if isolated |
A clean rule:
- Pick SQL Server for low to medium throughput and simple ops.
- Pick Redis when queue depth grows, latency matters, or SQL becomes the bottleneck.
Our Cloud Cost Estimator can help you model the infra delta between “one bigger SQL box” and “SQL plus Redis.” Link: our Cloud Cost Estimator for FinOps planning at /tools/cloud-cost-estimator.
Hangfire architecture patterns CTOs should standardize: web app, worker service, and multi-service setups
The next common query is “Hangfire architecture”. Hangfire’s own overview says you can start in-process, then split into separate processes or servers later, with distributed locks for coordination (Hangfire scale-out overview). That statement is true. The migration still needs a plan, because the hard part is ownership and blast radius.
Pattern 1: In-process Hangfire in the web app
In-process is fine for:
- Teams under 10 engineers.
- Low job volume, like a few thousand jobs per day.
- Jobs that can tolerate deploy pauses.
In-process fails when:
- Jobs compete with request threads for CPU and memory.
- Deploys interrupt job execution.
- A memory leak in a job takes down the API.
A Hangfire maintainer also noted that memory use depends on your job code, not on IIS vs a separate process (Hangfire discussion on memory). That’s the right framing. A separate process buys isolation and independent deploy cadence. A separate process does not fix bad job behavior.
Pattern 2: Dedicated worker service per domain
A dedicated worker service is the default “grown up” pattern.
- API enqueues jobs.
- Worker service runs Hangfire server.
- Storage sits behind both.
Benefits:
- You can deploy API and workers separately.
- You can scale workers without scaling the API.
- You can set queue-specific worker counts.
The leadership move is ownership. Assign a single team to own the worker service, the queues, and the on-call rotation. If nobody owns the queue, the queue owns you.
Our Command Center tool maps well here. You can track queue health as a first-class risk and capacity item. Link: Command Center for tracking incidents, risks, and migrations at /command-center.
Pattern 3: Multi-service, shared storage, and a single dashboard
Many orgs end up with multiple services producing jobs, and multiple worker pools consuming them.
Hangfire’s dashboard only needs a connection to the Hangfire database, but job display and manual triggers work best when the dashboard app references the job interfaces. A Hangfire discussion suggests putting job interfaces in a shared assembly, so clients can enqueue and servers can execute cleanly (Hangfire discussion on single dashboard).
That shared assembly becomes a coupling point.
Another Hangfire thread calls out the scaling pain of a central job library that every project depends on, and suggests an evolution: dashboards per service, and service-to-service calls to kick off work across boundaries (Hangfire discussion on separation of concerns).
I agree with the direction. Shared job libraries feel clean at 3 services. Shared job libraries feel like a monolith at 12 services.
A pragmatic compromise:
- Keep a shared package for cross-cutting primitives only, like idempotency helpers and job base classes.
- Keep job definitions close to the domain service.
- Use queues to isolate domains, like
billing,email,exports.
Cross-service workflows deserve explicit integration:
- Service A finishes Step 1.
- Service A calls Service B’s API to enqueue Step 2.
That pattern makes ownership obvious and reduces “mystery jobs” that run in the wrong place.
Enterprise implications: why Hangfire becomes a CTO problem
Background jobs look like an implementation detail until they break. Then the CEO asks why invoices did not send.
-
Retry storms can become customer incidents. A single bad deploy can throw exceptions, trigger retries, and flood downstream systems. Hangfire’s automatic retries are a feature, but the feature amplifies bugs (Hangfire retries overview).
-
Shared storage becomes a hidden coupling. A shared SQL Server for OLTP and Hangfire turns job spikes into API latency. A shared Redis cluster can do the same.
-
Dashboards change your security posture. The Hangfire dashboard is powerful. The same DEV guide shows securing the dashboard with an authorization filter and role checks (DEV Community dashboard security example). A public dashboard is an incident waiting to happen.
-
Org design decides reliability. Teams treat jobs as “backend chores,” so nobody owns queue depth, failure rates, or job SLAs. The result is slow recovery and finger-pointing.
CTO recommendations for Hangfire: a production checklist, policy, and architecture principles
Here’s the part you can act on next week.
Immediate actions (next 7 days)
- Inventory: list every recurring job, its schedule, and its owner. Put the list in a repo.
- Secure the dashboard: require auth and restrict to admin roles, using a dashboard authorization filter pattern (DEV Community dashboard security example).
- Set retry caps: set explicit retry attempts for high-impact jobs, and fail fast on non-transient errors.
- Add queue depth alerts: alert on queue depth, failure rate, and oldest job age. Start with one alert per queue.
Policy framework (what you standardize across teams)
- Idempotency rule: every job that triggers an external side effect must accept an idempotency key.
- Queue taxonomy: define 3 to 6 queues by domain and risk, like
billing,email,etl,exports. - Ownership and on-call: assign each queue to a team, with a named escalation path.
A quotable definition you can use in your engineering handbook:
A background job is production traffic that arrives late.
Architecture principles (how you keep it sane at 50 to 200 engineers)
- Isolation: run Hangfire servers in a worker service for critical domains.
- Storage separation: don’t share the OLTP database with Hangfire tables once job volume grows.
- Rate limits: cap concurrency per queue based on downstream limits, not CPU.
- Cross-service workflows: use service APIs or messaging for cross-domain steps, not a shared mega job library (Hangfire separation of concerns discussion).
If you want a tool to support governance, model the worker services and dependencies in an architecture diagram. Our ArchiMate Modeler helps teams keep those maps current. Link: ArchiMate Modeler for architecture documentation at /tools/archimate.
And if the question becomes “should we build our own job runner,” use a formal decision record. Our Build vs Buy Matrix gives you a structure for that trade. Link: Build vs Buy Matrix for make-or-buy decisions at /tools/build-vs-buy-matrix.
Bigger picture: background jobs are where systems and people meet
Hangfire sits at the intersection of system design and org design. The system side is storage, retries, and worker pools. The people side is ownership, on-call, and the discipline to make jobs idempotent.
Here’s what I’ve seen in growing orgs: as teams split into more services, background work grows faster than request traffic. Product adds exports, reports, notifications, and compliance tasks. Each one becomes a job, and each one needs an owner.
So the question to ask is simple: which team owns “jobs as a product” in your org, and what SLO do they publish for queue delay?
Sources
- Hangfire official site
- Background Job Scheduling in .NET using Hangfire, DEV Community
- Hangfire GitHub repository
- Hangfire discussion: guidelines around running background jobs from ASP.NET
- Telerik blog: Creating Background Routines with Hangfire in ASP.NET Core
- Hangfire Core overview
- Hangfire discussion: single dashboard, multiple projects
- Hangfire discussion: separation of concerns
- BoldSign: ASP.NET Core Background Jobs architecture and scheduling
- Mastering Hangfire in .NET 9, DEV Community
- Hangfire docs: Using Redis
- Hangfire Pro overview