Building Resilient APIs: Lessons from Flock's Insurance Platform
Alex ChenSenior Engineer
When you're building insurance software, reliability isn't a nice-to-have — it's the product. A drone operator standing on a rooftop, waiting for their pay-per-flight policy to activate before they can legally fly, has exactly zero tolerance for a 502 error. At Flock, our APIs sit at the heart of a regulatory and financial system where downtime has real-world consequences.
In this post, I'll walk through the resilience patterns we've adopted across our platform, the failures that forced us to adopt them, and the TypeScript patterns we now consider non-negotiable for any service we ship.
Flock's backend orchestrates a surprisingly complex network of external dependencies:
Underwriter APIs — legacy SOAP services that respond in 800ms on a good day
Payment processors — Stripe, GoCardless, and insurer-specific billing rails
Aviation data providers — real-time airspace, NOTAMs, weather
Regulatory databases — CAA, FAA, EASA
Our own 120+ internal microservices
Any of these can be slow, return errors, or go completely dark. The question isn't if they'll fail — it's whether our service will fail gracefully or cascade into a full outage.
The circuit breaker is the foundational resilience primitive. The idea is simple: if a downstream service starts failing repeatedly, stop hammering it and fail fast instead.
We use a state machine with three states: Closed (normal operation), Open (fail fast), and Half-Open (testing recovery). Here's our production implementation:
typescript
type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN'interface CircuitBreakerOptions { failureThreshold: number successThreshold: number timeout: number // ms to wait before trying HALF_OPEN volumeThreshold: number // min requests before tripping}export class CircuitBreaker<T> { private state: CircuitState = 'CLOSED' private failureCount = 0 private successCount = 0 private lastFailureTime: number | null = null private requestCount = 0 constructor( private readonly fn: (...args: unknown[]) => Promise<T>, private readonly options: CircuitBreakerOptions, ) {} async call(...args: unknown[]): Promise<T> { if (this.state === 'OPEN') { if (this.shouldAttemptReset()) { this.state = 'HALF_OPEN' } else { throw new CircuitOpenError( `Circuit breaker open. Last failure: ${this.lastFailureTime}`, ) } } this.requestCount++ try { const result = await this.fn(...args) this.onSuccess() return result } catch (error) { this.onFailure() throw error } } private onSuccess(): void { this.failureCount = 0 if (this.state === 'HALF_OPEN') { this.successCount++ if (this.successCount >= this.options.successThreshold) { this.state = 'CLOSED' this.successCount = 0 } } } private onFailure(): void { this.failureCount++ this.lastFailureTime = Date.now() if ( this.requestCount >= this.options.volumeThreshold && this.failureCount >= this.options.failureThreshold ) { this.state = 'OPEN' } } private shouldAttemptReset(): boolean { return ( this.lastFailureTime !== null && Date.now() - this.lastFailureTime >= this.options.timeout ) } getState(): CircuitState { return this.state }}
💡
Tip: Tune thresholds per service
Don't use the same circuit breaker settings for every integration. A payment processor might warrant a higher failure threshold before tripping (false positives are expensive) while a non-critical data enrichment service should trip quickly and have a graceful fallback.
Pattern 2: Retry with Exponential Backoff + Jitter#
Retries seem obvious — but naive retries can make things worse. If 1000 requests all fail at t=0 and all retry at t=100ms, you've just created a thundering herd. Adding jitter spreads the load:
typescript
interface RetryOptions { maxAttempts: number baseDelayMs: number maxDelayMs: number retryOn?: (error: unknown) => boolean}export async function withRetry<T>( fn: () => Promise<T>, options: RetryOptions,): Promise<T> { const { maxAttempts, baseDelayMs, maxDelayMs, retryOn } = options for (let attempt = 1; attempt <= maxAttempts; attempt++) { try { return await fn() } catch (error) { const isLastAttempt = attempt === maxAttempts const shouldRetry = retryOn ? retryOn(error) : isRetryableError(error) if (isLastAttempt || !shouldRetry) { throw error } // Exponential backoff with full jitter const exponentialDelay = Math.min( baseDelayMs * Math.pow(2, attempt - 1), maxDelayMs, ) const jitter = Math.random() * exponentialDelay const delay = Math.floor(jitter) await sleep(delay) } } // TypeScript needs this — unreachable in practice throw new Error('Retry loop exited unexpectedly')}function isRetryableError(error: unknown): boolean { if (error instanceof Error) { // Retry on network errors and 5xx, never on 4xx if ('statusCode' in error) { const status = (error as { statusCode: number }).statusCode return status >= 500 && status !== 501 } // Network-level errors return ['ECONNRESET', 'ENOTFOUND', 'ETIMEDOUT'].includes( (error as NodeJS.ErrnoException).code ?? '', ) } return false}const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms))
Named after the watertight compartments in a ship's hull, bulkheads prevent one slow dependency from exhausting your entire connection pool. We implement this using separate connection pools and worker queues per integration:
typescript
import PQueue from 'p-queue'// Separate queues per downstream serviceexport const queues = { underwriter: new PQueue({ concurrency: 10, timeout: 5000 }), aviation: new PQueue({ concurrency: 50, timeout: 2000 }), payments: new PQueue({ concurrency: 5, timeout: 30000 }),} as const// Usageexport async function getUnderwriterQuote(params: QuoteParams) { return queues.underwriter.add(async () => { return underwriterClient.getQuote(params) })}
This means a wave of slow underwriter requests can't starve our real-time aviation data lookups. Each integration has its own blast radius.
None of these patterns are useful if you can't observe them. Every circuit breaker state transition, every retry, every timeout gets emitted as a metric to Datadog:
This talk by Ines Montani on building robust ML systems in production shaped much of how we think about fault tolerance at the service level:
⚠️
Watch out for partial failures
The hardest category of failure to handle is when a request succeeds on the remote side but the response never makes it back — the payment is debited but you never got the confirmation. Always design for idempotency at every layer. Include a client-generated idempotency-key header and make your handlers safe to call twice.
Resilience engineering isn't glamorous — it's the unglamorous discipline of assuming everything will fail and designing accordingly. At Flock, this mindset has let us maintain 99.97% uptime on our policy API even as our external dependency count has grown to over 30 integrations.
The patterns in this post — circuit breakers, retried with jitter, bulkheads, and observability-first design — form the foundation of our reliability story. None of them are complex in isolation. The discipline is in applying them consistently, instrumenting them thoroughly, and iterating when the data shows something isn't working.
If you're building insurance infrastructure and want to trade notes, reach out — or consider joining us.