June 14, 2026 · 3 min read

Background jobs that survive production

  • backend
  • queues
  • background-jobs
  • reliability
  • nestjs

The fastest way to make a product feel slow — and fragile — is to do everything inside the request. The user clicks a button, and somewhere behind that click you're charging a card, emailing a receipt, calling a shipping API, and updating an analytics record. Any one of those can be slow or flaky, and the user is left watching a spinner, hostage to the slowest dependency you have.

The fix is old and boring and it works: get that work off the request path and into background jobs. Here's how I build them so they survive contact with production.

Keep the request path fast

A request should do the smallest amount of work needed to answer the user, then hand everything else to a queue. Accept the order, write the record, enqueue "fulfil this order," and return. The worker picks it up and does the slow, integration-heavy part on its own time.

This is the pattern behind the backend work on Trove Gifting: user-facing actions stay quick, while longer-running fulfilment and third-party calls run as background work where a slow vendor API can't stall the person clicking the button.

A job will run more than once — plan for it

Here's the rule that trips people up: at-least-once delivery is the norm. A worker crashes mid-job and the message is redelivered. A retry fires after a timeout even though the original actually succeeded. If your handler isn't safe to run twice, you'll double-charge, double-email, or double-ship.

The answer is idempotency. Give each job a stable key and make the handler check-then-act, or upsert instead of insert:

  • Before doing the work, look for a record that says "already done for this key."
  • Do the work and write that record in the same transaction, so a crash can't leave you half-done.
  • If the key's already there, acknowledge the message and stop.

Now a redelivery is harmless.

Retries, backoff, and a dead-letter queue

Transient failures (a timed-out API, a brief network blip) should retry — but with exponential backoff, so you don't hammer a struggling dependency. Permanent failures (malformed data, a deleted record) should not retry forever; after N attempts they belong in a dead-letter queue where they're parked for inspection instead of clogging the pipeline.

That split — retry the transient, quarantine the poison — is what keeps a queue flowing under real-world failure.

You can't fix what you can't see

A background system that fails silently is worse than no system, because you find out from a customer instead of a dashboard. So every queue needs observability: structured logs per job, a metric for backlog depth, and an alert when the failure rate or the dead-letter count climbs. The goal is simple — the system tells you something's wrong before a user does.

The payoff

Move slow work off the request path, make every handler idempotent, retry the transient and quarantine the poison, and watch the whole thing. You get a product that stays fast for users and a backend that degrades gracefully instead of falling over.

That's the bar I build backend systems and APIs to — reliable under real load, not just on the happy path.

Have work that needs to run reliably in the background? Start a project brief and tell me what you're building.

FAQ

When should work move to a background job?
Anything slow, anything that calls a flaky third party, and anything that doesn't need to finish before the user gets a response — sending email, payment side-effects, report generation, syncing an integration. Keep the request fast; do the rest in a worker.
How do you stop a retried job from running twice?
Idempotency. Give each job a stable key and make the handler safe to run more than once — check-then-act against a record, or upsert instead of insert — so a retry has no extra effect.
What makes a queue production-grade?
Retries with backoff, a dead-letter queue for poison messages, idempotent handlers, and alerting when the backlog or failure rate climbs. If it can fail silently, eventually it will.

Written by

Arvin Kent Lazaga

AI-Native Adaptive Full-Stack Software Engineer · Remote from the Philippines. I build production web, mobile, and backend systems across different stacks using Claude Code, OpenAI Codex, and disciplined planning, review, and testing.

Need a production-grade backend, integration, or automation system?

Let's turn the workflow into reliable software.