Platform Engineering at Scale: How We Built Flock's Developer Platform

Sarah Mitchell
Sarah MitchellPlatform Engineer

Eighteen months ago, Flock's engineering team was spending roughly 40% of their time on infrastructure — provisioning environments, debugging CI pipelines, wrangling Kubernetes YAML, and waiting on manual deployment approvals. Feature velocity was suffering. Engineers were frustrated. Onboarding new hires took weeks.

Today, a new service goes from npx create-flock-service to production in under 4 hours — with monitoring, alerting, autoscaling, and security policies configured automatically. This is the story of how we got there.

The Starting Point

Before building a platform, we had to understand what was actually painful. We ran a developer experience survey and a series of friction log sessions, asking engineers to narrate everything that annoyed them while deploying a change. The top five complaints were:

  1. Environment setup — no paved path for local development
  2. Service scaffolding — everyone reinvented their own Helm chart
  3. Deployment complexity — ArgoCD was powerful but intimidating
  4. Observability gaps — each team instrumented metrics differently
  5. Secret management — passing secrets around via Slack (yikes)

These formed our initial backlog. We prioritised ruthlessly: only build things that save time for every engineer, every week.

The Platform Architecture

Rendering diagram...

The Service Scaffold: create-flock-service

The single highest-leverage investment was our service scaffold CLI. Instead of copying a service template and manually updating 47 files, engineers run one command:

bash
npx create-flock-service my-new-service \
  --type api \
  --language typescript \
  --team payments \
  --on-call sarah@flockcover.com

This generates:

  • A fully configured TypeScript project with our standard middleware stack
  • A Dockerfile with multi-stage builds and distroless runtime
  • Helm chart scaffolding pre-wired to our chart library
  • GitHub Actions workflows for CI, preview environments, and production deploys
  • Datadog monitors for latency, error rate, and saturation
  • A Backstage entity YAML for the service catalog

The key insight is that the scaffold is opinionated by default but escapable. We don't use a locked-down template — we use a template that wires into platform abstractions. If you need to change something, you can. But most of the time, you don't need to.

Don't build a platform nobody asked for

We made this mistake early on. We built a beautiful self-service UI for secret management, ran a company demo, and... nobody used it. Engineers had already worked around the problem with their own scripts. Always validate demand before building. Talk to your users — they're just down the hall.

GitOps with ArgoCD: The App-of-Apps Pattern

Our deployment model is fully GitOps. No human ever runs kubectl apply in production. Every change goes through Git, gets reviewed, and ArgoCD reconciles the cluster state to match.

We use the app-of-apps pattern to manage our 120+ services without a monolithic ArgoCD application:

yaml
# platform/argocd/apps/production.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: flock-production
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/flockcover/platform
    targetRevision: main
    path: deployments/production
    directory:
      recurse: true
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Each service gets its own ArgoCD application definition, which in turn references our shared Helm chart library. The deployments/production/ directory contains one YAML file per service — that's the only file an engineer needs to modify for a new service deployment.

Secret Management with Vault

The Slack-secrets problem was solved with HashiCorp Vault and the External Secrets Operator. Secrets live in Vault, synced automatically to Kubernetes secrets. Engineers never touch raw credentials:

typescript
// No more process.env.MY_SECRET_PLEASE_DM_SOMEONE_FOR_THIS
// Secrets are mounted automatically at pod start via ESO
 
import { createVaultClient } from '@flock/platform-sdk'
 
const vault = createVaultClient({
  role: process.env.VAULT_ROLE!, // injected by platform, not a secret itself
  mount: 'secret',
  path: `services/${process.env.SERVICE_NAME}`,
})
 
export async function getPaymentApiKey(): Promise<string> {
  const { data } = await vault.read('payment-api-key')
  return data.value
}
yaml
# k8s/external-secret.yaml — generated by scaffold, not hand-written
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: my-new-service-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: my-new-service-secrets
    creationPolicy: Owner
  data:
    - secretKey: PAYMENT_API_KEY
      remoteRef:
        key: services/my-new-service
        property: payment-api-key

Developer Environments with Tilt

Local development used to mean either running everything locally (memory-hungry) or pointing at a shared staging environment (broken by colleagues). We solved this with Tilt + our own dev cluster:

python
# Tiltfile
load('ext://namespace', 'namespace_create')
 
# Each engineer gets their own namespace
namespace_create('dev-sarah')
 
# Hot reload for TypeScript services
local_resource(
  'my-service-watch',
  serve_cmd='npm run dev',
  serve_dir='services/my-service',
  deps=['services/my-service/src'],
  labels=['dev'],
)
 
# Other services run as lightweight stubs
k8s_yaml(blob("""
apiVersion: v1
kind: ConfigMap
metadata:
  name: dev-overrides
data:
  UNDERWRITER_API_URL: "http://underwriter-stub:3000"
"""))
Rendering diagram...

Measuring Platform Success

The platform team runs on two metrics: DORA metrics and developer NPS.

After 12 months:

DORA MetricBeforeAfter
Deployment frequency2.1/day18.4/day
Lead time for changes3.2 days2.1 hours
Change failure rate8.3%1.2%
MTTR47 minutes8 minutes

Developer NPS went from +12 to +61.

DORA metrics aren't the goal

It's tempting to optimise DORA metrics directly — but they're a proxy for engineering effectiveness, not the thing itself. We use them as leading indicators and regularly sanity-check that the underlying quality of life for engineers matches what the numbers suggest. Run regular friction log sessions. The metrics won't tell you that onboarding still takes 3 days.

What's Next

Our current roadmap focuses on three areas:

1. Preview environments — right now, PRs deploy to a shared staging namespace. We're moving to per-PR ephemeral environments using Argo Workflows to spin them up and down.

2. Chaos engineering — we've been running Netflix's Chaos Monkey manually on an ad-hoc basis. We're integrating LitmusChaos into our CI pipeline so every service gets a basic resilience test before it ships.

3. AI-assisted runbooks — we're experimenting with connecting Datadog alerts to an LLM-powered runbook assistant that can suggest remediation steps based on the alert context and the service's historical incident data.

Watch: How Spotify Builds Internal Platforms

This talk from Spotify's engineering team deeply influenced our thinking on product-thinking for platform teams:

Conclusion

Building a developer platform is fundamentally a product problem, not an infrastructure problem. The best technology choices are worthless if engineers don't adopt them. Our success came from:

  1. Listening obsessively — friction logs over surveys
  2. Shipping early — a rough CLI beats a polished roadmap
  3. Measuring relentlessly — DORA metrics every sprint
  4. Staying humble — we killed three features that nobody used

The platform is never done. As Flock grows, the team's needs will change, and the platform will need to evolve with them. That's not a problem — that's the job.

If platform engineering is the kind of work that excites you, we're hiring.

About the author

Sarah Mitchell

Sarah Mitchell

Platform Engineer

We're hiring

Want to work on problems like these?

We're building the technology that powers the fleets insurance — from risk models to processing telemetry pipelines. Come build it with us.