Reliability Without the Rube Goldberg Machine: Why Finch Moved to Temporal
September 10, 2025
10 min
Our traditional task queue stack (Celery and Django) was struggling to model the long-running, complex reality of our business in personal injury law. State management was brittle, observability was a constant struggle, and reasoning about the system was becoming a nightmare. We moved to a workflow engine, Temporal, to get durable, observable, and stateful primitives. This allowed us to model our business logic as straightforward code, resulting in simpler infrastructure, more reliable systems, and higher developer velocity.
Problem: A state machine held together with hope
At Finch, we automate the operational side of personal injury law. This isn't about simple, fire-and-forget background jobs. Our processes mirror the real world, which is slow, asynchronous, and messy. A single case might require us to:
Request medical records and wait up to 30 days for a response.
Follow up with an insurance carrier every 72 hours until a claim is acknowledged.
Pause a process entirely until a paralegal provides manual approval in our UI.
Trying to build this on Celery felt like building a house of cards. The "state" of a process wasn't an explicit field in a database; it was an implicit result of brittle task chains, database flags, and cron jobs. When a paralegal asked, "What's the status of the Johnson claim?", an engineer had to become a detective, mentally replaying scattered logs and task logic to reconstruct a timeline.
We needed a system built for orchestration, not just execution.
Solution: A New Primitive for Orchestration
Our search led us to adopt a new architectural primitive: the workflow. We chose Temporal as our workflow engine because it treats long-running, stateful processes as first-class citizens. It provides four core concepts that directly address our pain points.
Workflows: A workflow is the stateful brain. It’s defined in code and encapsulates the entire business process, including its state. Temporal automatically persists this state, allowing a workflow to run for days, weeks, or months, surviving server restarts and code deploys as if nothing happened.
Activities: These are the connection to the outside world. An activity is a simple function where you put all your side effects - database calls, API requests, sending emails. Temporal executes these with built-in, configurable retries, making our external interactions robust by default.
Signals: These provide an API for reality. A signal is an external, asynchronous message you can send to a running workflow. This is how we let a workflow pause and wait for a human action or a webhook, eliminating the need for polling or complex database state.
Queries: These are read-only functions for safely inspecting a workflow's state at any time. This allows us to easily surface the exact status of any process in our UI, finally answering the "what's the status?" question with perfect accuracy.
Putting It All Together: A Real-World Example
With these primitives, we could model our "Request Medical Records" process cleanly. The entire lifecycle, which can take over a month, is now a single workflow.
At first glance, the workflow might sound simple: look up the provider’s contact information, send a records request, and wait up to 30 days for a response. In reality, Finch’s Temporal workflow is an active orchestration that supervises the request end-to-end, periodically checks progress, and escalates to a human when milestones aren’t met.
How it works:
Find & Validate Provider Contact Info: The workflow calls an activity to search and validate the provider’s details (address, fax, email, portal) and preferred channel.
Send the Request: It generates the request packet (with the correct authorizations/attachments) and transmits via the validated channel, logging method and timestamp.
Confirm Receipt: It looks for delivery confirmation (fax OK, email delivery/read, portal ack). If none arrives within a short window, it retries or switches channel when appropriate.
Periodic Status Checks: On a cadence (e.g., every 3–5 business days), the workflow wakes to check for status: received, in processing, records ready, or additional info needed. Each check is idempotent and recorded.
Milestone-Driven Escalation: If key milestones aren’t met (e.g., no receipt by Day 7, no progress by Day 21, hard rejection), the workflow escalates to a paralegal with full context and next-best actions (e.g., call script, alternate fax, portal message). Escalations surface in Slack and the case dashboard.
The result is a living, stateful loop that automates the boring parts while ensuring a human steps in precisely when judgment or persuasion is needed.
The true power is in coordinating parallel work. To get the provider's information, the workflow kicks off two processes at once: it launches an automated agent to search online while simultaneously creating a task for a human to enter the information in our UI. The workflow then simply waits for a signal indicating the information has been found, regardless of which process finds it first. Once the signal arrives, the workflow cancels the unnecessary process and moves on. This elegant coordination of automated and human work was previously unthinkable.
The Benefits: What We Unlocked
Adopting this model provided three clear benefits:
Extreme Reliability: Workflows are durable by default. The ability to sleep for 30 days and know it will resume correctly, even if the entire worker fleet restarts, is a superpower. The built-in retry logic for activities means we no longer worry about transient network failures.
First-Class Observability: The Temporal UI gives us a searchable, visual history of every single workflow execution. We can see the inputs, outputs, and stack traces for every activity. The "detective work" is gone. Debugging is no longer about parsing logs; it's about reading a story.
Increased Developer Velocity: Because the hard parts of distributed systems (state management, retries, timers) are handled by the platform, developers can focus on writing business logic. This new backbone allowed us to easily implement powerful patterns like Sagas and Human-in-the-Loop. We even built a lightweight, durable Pub/Sub system using workflows and signals, saving us from adding and managing new infrastructure like Kafka or RabbitMQ.
Effortless Hotfixes and Backfills: In a traditional system, if a long-running process fails on step 7 of 10 due to a bug, recovery is often a painful, manual ordeal. Because Temporal durably records the state of every successful step, we can deploy a hotfix to our code, find the failed workflow, and simply restart it from the exact point of failure. This turns complex operational emergencies into trivial tasks.
The Tradeoffs: What to know before you jump
The journey required a mindset shift and came with important considerations.
The Hurdle of Determinism: Because Temporal revives state by replaying history, workflow code must be deterministic (i.e., no direct I/O, random numbers, or calls to datetime.now()). All side effects must be in activities. This constraint forces good habits but takes time to internalize. It also means that changes to workflow logic must be carefully versioned to avoid breaking long-running processes.
The One-Way Door: Adopting a workflow engine like Temporal is a deep architectural commitment. Its SDKs and patterns become tightly coupled with your code. Migrating away would be a significant undertaking, so it's a decision that warrants careful consideration.
Conclusion
Moving to Temporal was more than a technology swap, it was a shift in our architectural mindset. By embracing the workflow as a central primitive, we're no longer wrestling with brittle infrastructure. Instead, we're building a resilient, observable foundation that allows us to automate ever more complex aspects of law, faster than we ever thought possible