Building Resilient APIs: Lessons from Flock's Insurance Platform

When you're building insurance software, reliability isn't a nice-to-have — it's the product. A drone operator standing on a rooftop, waiting for their pay-per-flight policy to activate before they can legally fly, has exactly zero tolerance for a 502 error. At Flock, our APIs sit at the heart of a regulatory and financial system where downtime has real-world consequences.

In this post, I'll walk through the resilience patterns we've adopted across our platform, the failures that forced us to adopt them, and the TypeScript patterns we now consider non-negotiable for any service we ship.

The Problem Space

Flock's backend orchestrates a surprisingly complex network of external dependencies:

Underwriter APIs — legacy SOAP services that respond in 800ms on a good day
Payment processors — Stripe, GoCardless, and insurer-specific billing rails
Aviation data providers — real-time airspace, NOTAMs, weather
Regulatory databases — CAA, FAA, EASA
Our own 120+ internal microservices

Any of these can be slow, return errors, or go completely dark. The question isn't if they'll fail — it's whether our service will fail gracefully or cascade into a full outage.

Rendering diagram...

Pattern 1: The Circuit Breaker

The circuit breaker is the foundational resilience primitive. The idea is simple: if a downstream service starts failing repeatedly, stop hammering it and fail fast instead.

We use a state machine with three states: Closed (normal operation), Open (fail fast), and Half-Open (testing recovery). Here's our production implementation:

typescript

type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN'
 
interface CircuitBreakerOptions {
  failureThreshold: number
  successThreshold: number
  timeout: number // ms to wait before trying HALF_OPEN
  volumeThreshold: number // min requests before tripping
}
 
export class CircuitBreaker<T> {
  private state: CircuitState = 'CLOSED'
  private failureCount = 0
  private successCount = 0
  private lastFailureTime: number | null = null
  private requestCount = 0
 
  constructor(
    private readonly fn: (...args: unknown[]) => Promise<T>,
    private readonly options: CircuitBreakerOptions,
  ) {}
 
  async call(...args: unknown[]): Promise<T> {
    if (this.state === 'OPEN') {
      if (this.shouldAttemptReset()) {
        this.state = 'HALF_OPEN'
      } else {
        throw new CircuitOpenError(
          `Circuit breaker open. Last failure: ${this.lastFailureTime}`,
        )
      }
    }
 
    this.requestCount++
 
    try {
      const result = await this.fn(...args)
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      throw error
    }
  }
 
  private onSuccess(): void {
    this.failureCount = 0
    if (this.state === 'HALF_OPEN') {
      this.successCount++
      if (this.successCount >= this.options.successThreshold) {
        this.state = 'CLOSED'
        this.successCount = 0
      }
    }
  }
 
  private onFailure(): void {
    this.failureCount++
    this.lastFailureTime = Date.now()
 
    if (
      this.requestCount >= this.options.volumeThreshold &&
      this.failureCount >= this.options.failureThreshold
    ) {
      this.state = 'OPEN'
    }
  }
 
  private shouldAttemptReset(): boolean {
    return (
      this.lastFailureTime !== null &&
      Date.now() - this.lastFailureTime >= this.options.timeout
    )
  }
 
  getState(): CircuitState {
    return this.state
  }
}

Tip: Tune thresholds per service

Don't use the same circuit breaker settings for every integration. A payment processor might warrant a higher failure threshold before tripping (false positives are expensive) while a non-critical data enrichment service should trip quickly and have a graceful fallback.

Pattern 2: Retry with Exponential Backoff + Jitter

Retries seem obvious — but naive retries can make things worse. If 1000 requests all fail at t=0 and all retry at t=100ms, you've just created a thundering herd. Adding jitter spreads the load:

typescript

interface RetryOptions {
  maxAttempts: number
  baseDelayMs: number
  maxDelayMs: number
  retryOn?: (error: unknown) => boolean
}
 
export async function withRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions,
): Promise<T> {
  const { maxAttempts, baseDelayMs, maxDelayMs, retryOn } = options
 
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn()
    } catch (error) {
      const isLastAttempt = attempt === maxAttempts
      const shouldRetry = retryOn ? retryOn(error) : isRetryableError(error)
 
      if (isLastAttempt || !shouldRetry) {
        throw error
      }
 
      // Exponential backoff with full jitter
      const exponentialDelay = Math.min(
        baseDelayMs * Math.pow(2, attempt - 1),
        maxDelayMs,
      )
      const jitter = Math.random() * exponentialDelay
      const delay = Math.floor(jitter)
 
      await sleep(delay)
    }
  }
 
  // TypeScript needs this — unreachable in practice
  throw new Error('Retry loop exited unexpectedly')
}
 
function isRetryableError(error: unknown): boolean {
  if (error instanceof Error) {
    // Retry on network errors and 5xx, never on 4xx
    if ('statusCode' in error) {
      const status = (error as { statusCode: number }).statusCode
      return status >= 500 && status !== 501
    }
    // Network-level errors
    return ['ECONNRESET', 'ENOTFOUND', 'ETIMEDOUT'].includes(
      (error as NodeJS.ErrnoException).code ?? '',
    )
  }
  return false
}
 
const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms))

The Architecture Overview

Here's how these patterns compose in our policy issuance flow:

Rendering diagram...

Pattern 3: Bulkheads

Named after the watertight compartments in a ship's hull, bulkheads prevent one slow dependency from exhausting your entire connection pool. We implement this using separate connection pools and worker queues per integration:

typescript

import PQueue from 'p-queue'
 
// Separate queues per downstream service
export const queues = {
  underwriter: new PQueue({ concurrency: 10, timeout: 5000 }),
  aviation: new PQueue({ concurrency: 50, timeout: 2000 }),
  payments: new PQueue({ concurrency: 5, timeout: 30000 }),
} as const
 
// Usage
export async function getUnderwriterQuote(params: QuoteParams) {
  return queues.underwriter.add(async () => {
    return underwriterClient.getQuote(params)
  })
}

This means a wave of slow underwriter requests can't starve our real-time aviation data lookups. Each integration has its own blast radius.

Observability is Part of Resilience

None of these patterns are useful if you can't observe them. Every circuit breaker state transition, every retry, every timeout gets emitted as a metric to Datadog:

typescript

export function instrumentedCircuitBreaker<T>(
  name: string,
  fn: () => Promise<T>,
  options: CircuitBreakerOptions,
): () => Promise<T> {
  const breaker = new CircuitBreaker(fn, options)
 
  return async () => {
    const startTime = Date.now()
 
    try {
      const result = await breaker.call()
      const duration = Date.now() - startTime
 
      metrics.timing(`circuit_breaker.call.duration`, duration, {
        name,
        state: breaker.getState(),
        outcome: 'success',
      })
 
      return result
    } catch (error) {
      const duration = Date.now() - startTime
      const isOpen = error instanceof CircuitOpenError
 
      metrics.timing(`circuit_breaker.call.duration`, duration, {
        name,
        state: breaker.getState(),
        outcome: isOpen ? 'circuit_open' : 'error',
      })
 
      metrics.increment(`circuit_breaker.failures`, 1, { name })
 
      throw error
    }
  }
}

A Talk We Found Invaluable

This talk by Ines Montani on building robust ML systems in production shaped much of how we think about fault tolerance at the service level:

Watch out for partial failures

The hardest category of failure to handle is when a request succeeds on the remote side but the response never makes it back — the payment is debited but you never got the confirmation. Always design for idempotency at every layer. Include a client-generated idempotency-key header and make your handlers safe to call twice.

Composing the Patterns

In production, we rarely use these patterns in isolation. Here's a real example from our policy creation endpoint:

typescript

// policy-service/src/integrations/underwriter.ts
 
const underwriterBreaker = instrumentedCircuitBreaker(
  'underwriter-quote',
  async (params: QuoteParams) => {
    return withRetry(
      () => httpClient.post('/quotes', params, { timeout: 4000 }),
      {
        maxAttempts: 3,
        baseDelayMs: 100,
        maxDelayMs: 1000,
        retryOn: isRetryableError,
      },
    )
  },
  {
    failureThreshold: 5,
    successThreshold: 2,
    timeout: 30_000,
    volumeThreshold: 10,
  },
)
 
export async function getQuote(params: QuoteParams): Promise<Quote> {
  return queues.underwriter.add(async () => {
    try {
      const raw = await underwriterBreaker(params)
      return transformQuote(raw)
    } catch (error) {
      if (error instanceof CircuitOpenError) {
        // Return a cached/stale quote while circuit is open
        const cached = await quoteCache.get(params)
        if (cached) {
          logger.warn('Returning stale quote — underwriter circuit open', { params })
          return { ...cached, isStale: true }
        }
      }
      throw error
    }
  })
}

Results

After rolling out these patterns across our critical paths:

Metric	Before	After
P99 API latency	4,200ms	380ms
5xx error rate	0.8%	0.04%
Mean time to recovery	12 min	<90s
Underwriter timeout cascades	Weekly	Never

Conclusion

Resilience engineering isn't glamorous — it's the unglamorous discipline of assuming everything will fail and designing accordingly. At Flock, this mindset has let us maintain 99.97% uptime on our policy API even as our external dependency count has grown to over 30 integrations.

The patterns in this post — circuit breakers, retried with jitter, bulkheads, and observability-first design — form the foundation of our reliability story. None of them are complex in isolation. The discipline is in applying them consistently, instrumenting them thoroughly, and iterating when the data shows something isn't working.

If you're building insurance infrastructure and want to trade notes, reach out — or consider joining us.

Building Resilient APIs: Lessons from Flock's Insurance Platform

The Problem Space#

Pattern 1: The Circuit Breaker#

Pattern 2: Retry with Exponential Backoff + Jitter#

The Architecture Overview#

Pattern 3: Bulkheads#

Observability is Part of Resilience#

A Talk We Found Invaluable#

Composing the Patterns#

Results#

Conclusion#

Want to work on problems like these?

The Problem Space

Pattern 1: The Circuit Breaker

Pattern 2: Retry with Exponential Backoff + Jitter

The Architecture Overview

Pattern 3: Bulkheads

Observability is Part of Resilience

A Talk We Found Invaluable

Composing the Patterns

Results

Conclusion