Back to all articles
You're Still Writing Retry Logic in 2026. Netflix Stopped Years Ago.

You're Still Writing Retry Logic in 2026. Netflix Stopped Years Ago.

Durable execution is replacing your retry logic, saga patterns, and dead-letter queues. How Temporal became critical infrastructure for Netflix, Stripe,...

Human-architected research synthesized with the assistance of AI personas.
15 min read

โœจTL;DR / Executive Summary

Durable execution is replacing your retry logic, saga patterns, and dead-letter queues. How Temporal became critical infrastructure for Netflix, Stripe,...

๐Ÿ’ก TL;DR (Too Long; Didn't Read)

Key takeaways in 60 seconds:

  1. The Reliability Boilerplate Trap is real: Most backend teams spend 20โ€“40% of their engineering time writing retry logic, state machines, saga coordinators, and dead-letter queue handlers. This code is never the product โ€” it's plumbing โ€” and it's almost never truly reliable.
  2. Durable execution is the paradigm shift: Instead of scattering failure-handling across dozens of services, you write workflow code that looks like normal functions. The runtime guarantees your code will complete, even across crashes, deploys, and network partitions.
  3. Temporal is the frontrunner โ€” and the numbers prove it: 2,500+ companies, 7 million deployed clusters, and mission-critical adoption at Netflix, Stripe, Coinbase, Snap, and Twilio. Netflix reduced deployment failures from 4% to 0.0001%.
  4. The learning curve is brutal but the payoff is structural: Temporal requires a mental model shift away from task-based thinking. Teams report 10x development speed improvements โ€” features that took 20 weeks now ship in 2.

The Lie You Tell Yourself at 2 AM

Every backend engineer has written this code. You've wired up a retry loop with exponential backoff. You've built a state machine to track which step of your multi-service transaction succeeded and which failed. You've added a dead-letter queue for the messages that fell through the cracks, then written another service to process that dead-letter queue, and then โ€” inevitably โ€” discovered that the dead-letter processor itself can fail.

You've built a saga coordinator. Or maybe you've hand-rolled compensating transactions with a PostgreSQL table that tracks order_status through twelve possible states, six of which are transient error conditions that nobody fully documented.

Six months later, you have a bespoke orchestration layer that nobody fully understands, with edge cases that only surface in production at 2 AM.

Verified SourceTemporal Engineering Blog, 2024โ€“2026

The "reliability boilerplate trap" pattern, where orchestration maintenance exceeds product development time, is well-documented across distributed systems literature and Temporal's engineering blog.

I call this the Reliability Boilerplate Trap. And it's eating your engineering capacity alive.

What Durable Execution Actually Means

The concept is deceptively simple. Instead of writing defensive code that anticipates every possible failure, you write your business logic as a straightforward function โ€” and the runtime guarantees it will run to completion.

If a server crashes mid-execution, the workflow resumes on another machine from exactly where it left off. If an API call times out, the activity retries automatically. If a Kubernetes pod gets evicted during a deployment, no state is lost.

This is not magic. Under the hood, durable execution engines use event sourcing: every step of your workflow generates an event that gets persisted to a durable store. On recovery, the engine replays the event history to reconstruct state without re-executing side effects. Activities (the units of work that interact with the outside world) are only executed once; their results are memoized in the event log.

The critical insight: your workflow function is deterministic. It always produces the same sequence of commands given the same event history. The Temporal server doesn't execute your code โ€” your workers do. The server is a state machine that orchestrates, persists, and recovers.

This is a fundamentally different programming model from task queues, cron jobs, or message-driven architectures. You're not building around failure; you're writing code as if failure doesn't exist, and the platform handles the rest.

The Three Concepts You Need

Temporal's model reduces to three abstractions. If you understand these, you understand 90% of the platform.

Workflows are deterministic functions that define your business logic. Think of them as orchestrators. A workflow says: "First charge the customer, then reserve inventory, then send the confirmation email, then schedule shipping." Workflows can run for seconds or for months. They can sleep, wait for external signals, spawn child workflows, and coordinate across services.

Activities are the functions that interact with the outside world โ€” API calls, database writes, file uploads. Activities are not deterministic; they can fail, timeout, and be retried. Temporal wraps them with configurable retry policies, timeouts, and heartbeat monitoring.

Workers are your application processes that poll the Temporal server for tasks and execute workflow and activity code. You run as many workers as you need for throughput and redundancy โ€” they're stateless and horizontally scalable.

typescript
// This is a real Temporal workflow in TypeScript. // It looks like normal async code because it IS normal async code. import { proxyActivities, sleep } from '@temporalio/workflow'; import type * as activities from './activities'; const { chargePayment, reserveInventory, sendConfirmation, scheduleShipping } = proxyActivities<typeof activities>({ startToCloseTimeout: '30s', retry: { maximumAttempts: 5 }, }); export async function processOrder(orderId: string, amount: number): Promise<void> { // Step 1: Charge the customer const paymentId = await chargePayment(orderId, amount); // Step 2: Reserve inventory await reserveInventory(orderId); // Step 3: Send confirmation await sendConfirmation(orderId, paymentId); // Step 4: Wait 30 days, then trigger shipping survey await sleep('30 days'); // Yes, really. Survives server restarts. await scheduleShipping(orderId); }

Notice what's missing: no retry logic, no state machine, no dead-letter queue, no database status column, no saga coordinator. The sleep('30 days') call will survive server restarts, deployments, and even infrastructure migrations. The Temporal server tracks the timer and wakes the workflow after exactly 30 days.

Verified SourceTemporal Official Documentation

Temporal SDK examples and programming model based on official documentation. TypeScript SDK is production-ready per Temporal's official SDK support matrix.

Netflix: From 4% Failure to 0.0001%

Netflix's adoption story is the canonical proof that durable execution works at extreme scale.

Reportedbyteiota.com, 2026-01-23

Netflix's Temporal adoption and failure rate reduction from 4% to near-zero documented in engineering case studies.

Netflix uses Spinnaker for the vast majority of its software deployments. Before Temporal, approximately 4% of deployments failed due to transient cloud operation failures. That number sounds small, but at Netflix's scale โ€” millions of deployments across global infrastructure โ€” 4% meant complex pipelines that took days to complete could fail mid-flight, requiring engineers to re-run entire pipelines from scratch.

The engineering team described the impact as "detrimental to engineering productivity in a non-trivial way." Teams with long, complex deployment pipelines were disproportionately affected.

After migrating to Temporal, Netflix reported that deployment failures due to transient infrastructure issues were "virtually eliminated." The platform allowed them to remove years of accumulated homegrown orchestration and retry logic. Temporal has since become "increasingly critical" to Netflix's infrastructure, used by teams from their Open Connect CDN operators to their Live reliability engineering group.

The pattern is consistent across the industry:

Stripe processes payments through Temporal workflows. Coinbase migrated their entire transaction pipeline and was confident enough to build their own Ruby SDK. Snap routes every Story through Temporal. Twilio delivers every message via Temporal-orchestrated workflows.

Verified SourceTemporal Enterprise Case Studies

Enterprise adoption claims (Stripe, Coinbase, Snap, Twilio) sourced from Temporal's official case studies and customer testimonials.

These aren't experiments. They're mission-critical production workloads handling billions of operations daily.

The Honest Assessment: When Temporal Hurts

Athena is not here to sell you a product. Let me be direct about where durable execution falls short โ€” or where the cost outweighs the benefit.

The learning curve is brutal. Temporal requires what community forums honestly describe as a "complete mental model shift" from task-based systems like Celery, Sidekiq, or traditional message queues. Your workflow code must be deterministic: no random numbers, no reading the current time directly, no non-deterministic library calls inside workflows. This trips up every team in the first weeks.

It's not a database replacement. Temporal persists workflow state, not your application data. You still need your PostgreSQL, your Redis, your domain models. Temporal orchestrates the process, not the data.

It doesn't replace Kafka. This is the most common misconception. Kafka handles event streaming and real-time data flow. Temporal orchestrates workflows that consume those events and coordinate multi-step business processes. Companies like Netflix and Coinbase run both, using each where it fits.

Small teams may not need it. If your backend is a monolith with a handful of background jobs, Temporal's operational complexity (running the Temporal Server cluster, managing event histories, versioning workflows) may be overkill. The honest rule of thumb: if you're spending less than 20% of your engineering time on reliability plumbing, you probably don't need Temporal yet.

The operational cost is real. Self-hosting Temporal requires running a multi-component server (Frontend, History, Matching, and Worker services) backed by a durable database (Cassandra or MySQL/PostgreSQL). Temporal Cloud exists as a managed alternative, but it's not cheap for high-volume workloads.

The Landscape: Temporal vs. The Field

Temporal is not the only durable execution engine, and pretending otherwise would be intellectually dishonest. Here's how the landscape looks in March 2026:

AWS Step Functions are the default choice for teams already deep in the AWS ecosystem. They're serverless, managed, and well-integrated with Lambda. But they use a JSON-based state machine definition (Amazon States Language) that becomes unwieldy for complex workflows, and they're locked to AWS.

Azure Durable Functions offer a code-first approach similar to Temporal but within the Azure ecosystem. The programming model is solid, but portability is limited.

Restate is the most interesting newcomer. It takes a different architectural approach: instead of a central server, Restate acts as a lightweight proxy that intercepts function calls and provides durable execution guarantees with lower operational overhead than Temporal. It's worth evaluating for teams that want durable execution without running a full Temporal cluster.

Inngest targets the serverless and event-driven workflow space with a developer-friendly SDK and managed infrastructure. It's simpler than Temporal but less flexible for complex orchestration.

Temporal's advantage remains its polyglot support (Go, Java, TypeScript, Python, .NET, PHP), its battle-tested maturity at extreme scale, and the fact that it's MIT-licensed and can be self-hosted without vendor lock-in.

Verified SourceTemporal GitHub & Official Documentation

Temporal's MIT license, polyglot SDK support (Go, Java, TypeScript, Python, .NET, PHP), and self-hosting capability confirmed via official repository.

The Convergence Nobody's Talking About

Here's the thing that makes this article relevant beyond "just another infrastructure tool" โ€” and why I believe durable execution is the next standard infrastructure primitive.

AI agent orchestration is the same problem. Long-running agent workflows โ€” where an agent calls multiple APIs, waits for human approval, retries on model failures, maintains state across sessions โ€” are architecturally identical to the distributed workflow problem Temporal was designed for. Temporal's own site now lists "Develop agents that survive real-world chaos" as a primary use case.

This isn't hypothetical. Temporal already supports MCP (Model Context Protocol) integration for tool orchestration. The convergence of agentic AI and durable execution is happening now, and teams that have already adopted durable execution for their traditional backends will have a structural advantage when they need to orchestrate multi-step AI agent workflows.

We covered the agentic orchestration problem extensively in our OWASP Agentic Top 10 series โ€” specifically ASI07 (Multi-Agent Exploitation) and ASI08 (Cascading Failures). One of the core failure modes we documented is that agent orchestration systems lack the durability guarantees that Temporal provides by default. Agents crash, model calls timeout, external tools become unavailable โ€” and without durable execution, the entire workflow falls apart.

Similarly, the Trivy Cascade incident demonstrated how complex multi-step processes (build โ†’ scan โ†’ deploy) are vulnerable precisely because their orchestration assumes success. A durable execution model would have provided checkpointing, replay, and auditable event histories at every stage.

The prediction (and I'll put this on our Evidence Wall): By 2028, "durable execution engine" will be as standard a part of production infrastructure as "message queue" or "cache layer" is today. Gartner already projects 80% of large software engineering organizations will have platform teams by 2026. Those platform teams will standardize on durable execution as a core primitive โ€” not because it's trendy, but because the alternative is every team building (and failing to maintain) their own orchestration layer.

Getting Started Without Boiling the Ocean

If you're convinced that durable execution deserves evaluation, here's Athena's recommended adoption path:

Week 1: Local setup. Run Temporal in Docker Compose. The development server ships as a single temporal CLI binary โ€” you can have a local cluster running in under 5 minutes. Write a "Hello World" workflow and an activity that calls an external API.

bash
# Temporal CLI dev server โ€” zero configuration temporal server start-dev # Opens UI at http://localhost:8233

Week 2โ€“3: Migrate one workflow. Pick the ugliest retry-heavy process in your codebase โ€” the one with the state machine table and the dead-letter queue. Rewrite it as a Temporal workflow. This is where the mental model shift happens.

Month 2: Evaluate operations. Decide between self-hosting and Temporal Cloud based on your team's operational capacity and volume requirements. Run load tests. Measure the before/after on engineering time spent on reliability plumbing.

Month 3+: Expand or retreat. If the pilot worked, expand to more workflows. If it didn't, you've invested six weeks and learned something valuable about your team's actual distributed systems needs.

The key is not to attempt a full migration on day one. Temporal integrates incrementally โ€” your workers poll the server for tasks, so existing services don't need to be rewritten. You can adopt it one workflow at a time.

The Bigger Picture

Twenty years ago, every team wrote their own job scheduling system. Then cron became standard, and later tools like Sidekiq and Celery. Ten years ago, every team managed their own message queues. Then managed services like SQS and Kafka became table stakes.

We are at the same inflection point for workflow orchestration. The homegrown state machines, the artisanal saga coordinators, the handcrafted retry loops โ€” they're the equivalent of managing your own SMTP server in 2026. Technically possible. Strategically indefensible.

Durable execution isn't a silver bullet. It won't fix your domain model, it won't replace your database, and it won't magically make your API contracts coherent. But it will eliminate an entire category of engineering work โ€” the plumbing that nobody wants to build, nobody wants to maintain, and everybody discovers is broken at 2 AM.

The tools exist. The patterns are proven. The question is no longer whether durable execution works. It's how long you'll keep paying the reliability tax before you adopt it.


EXTERNAL SOURCES


Receive new articles

Subscribe to receive notifications about new articles directly to your email

We won't send spam. You can unsubscribe at any time.