Eighteen months ago, Flock's engineering team was spending roughly 40% of their time on infrastructure — provisioning environments, debugging CI pipelines, wrangling Kubernetes YAML, and waiting on manual deployment approvals. Feature velocity was suffering. Engineers were frustrated. Onboarding new hires took weeks.
Today, a new service goes from npx create-flock-service to production in under 4 hours — with monitoring, alerting, autoscaling, and security policies configured automatically. This is the story of how we got there.
The Starting Point
Before building a platform, we had to understand what was actually painful. We ran a developer experience survey and a series of friction log sessions, asking engineers to narrate everything that annoyed them while deploying a change. The top five complaints were:
- Environment setup — no paved path for local development
- Service scaffolding — everyone reinvented their own Helm chart
- Deployment complexity — ArgoCD was powerful but intimidating
- Observability gaps — each team instrumented metrics differently
- Secret management — passing secrets around via Slack (yikes)
These formed our initial backlog. We prioritised ruthlessly: only build things that save time for every engineer, every week.
The Platform Architecture
The Service Scaffold: create-flock-service
The single highest-leverage investment was our service scaffold CLI. Instead of copying a service template and manually updating 47 files, engineers run one command:
This generates:
- A fully configured TypeScript project with our standard middleware stack
- A
Dockerfilewith multi-stage builds and distroless runtime - Helm chart scaffolding pre-wired to our chart library
- GitHub Actions workflows for CI, preview environments, and production deploys
- Datadog monitors for latency, error rate, and saturation
- A Backstage entity YAML for the service catalog
The key insight is that the scaffold is opinionated by default but escapable. We don't use a locked-down template — we use a template that wires into platform abstractions. If you need to change something, you can. But most of the time, you don't need to.
Don't build a platform nobody asked for
We made this mistake early on. We built a beautiful self-service UI for secret management, ran a company demo, and... nobody used it. Engineers had already worked around the problem with their own scripts. Always validate demand before building. Talk to your users — they're just down the hall.
GitOps with ArgoCD: The App-of-Apps Pattern
Our deployment model is fully GitOps. No human ever runs kubectl apply in production. Every change goes through Git, gets reviewed, and ArgoCD reconciles the cluster state to match.
We use the app-of-apps pattern to manage our 120+ services without a monolithic ArgoCD application:
Each service gets its own ArgoCD application definition, which in turn references our shared Helm chart library. The deployments/production/ directory contains one YAML file per service — that's the only file an engineer needs to modify for a new service deployment.
Secret Management with Vault
The Slack-secrets problem was solved with HashiCorp Vault and the External Secrets Operator. Secrets live in Vault, synced automatically to Kubernetes secrets. Engineers never touch raw credentials:
Developer Environments with Tilt
Local development used to mean either running everything locally (memory-hungry) or pointing at a shared staging environment (broken by colleagues). We solved this with Tilt + our own dev cluster:
Measuring Platform Success
The platform team runs on two metrics: DORA metrics and developer NPS.
After 12 months:
| DORA Metric | Before | After |
|---|---|---|
| Deployment frequency | 2.1/day | 18.4/day |
| Lead time for changes | 3.2 days | 2.1 hours |
| Change failure rate | 8.3% | 1.2% |
| MTTR | 47 minutes | 8 minutes |
Developer NPS went from +12 to +61.
DORA metrics aren't the goal
It's tempting to optimise DORA metrics directly — but they're a proxy for engineering effectiveness, not the thing itself. We use them as leading indicators and regularly sanity-check that the underlying quality of life for engineers matches what the numbers suggest. Run regular friction log sessions. The metrics won't tell you that onboarding still takes 3 days.
What's Next
Our current roadmap focuses on three areas:
1. Preview environments — right now, PRs deploy to a shared staging namespace. We're moving to per-PR ephemeral environments using Argo Workflows to spin them up and down.
2. Chaos engineering — we've been running Netflix's Chaos Monkey manually on an ad-hoc basis. We're integrating LitmusChaos into our CI pipeline so every service gets a basic resilience test before it ships.
3. AI-assisted runbooks — we're experimenting with connecting Datadog alerts to an LLM-powered runbook assistant that can suggest remediation steps based on the alert context and the service's historical incident data.
Watch: How Spotify Builds Internal Platforms
This talk from Spotify's engineering team deeply influenced our thinking on product-thinking for platform teams:
Conclusion
Building a developer platform is fundamentally a product problem, not an infrastructure problem. The best technology choices are worthless if engineers don't adopt them. Our success came from:
- Listening obsessively — friction logs over surveys
- Shipping early — a rough CLI beats a polished roadmap
- Measuring relentlessly — DORA metrics every sprint
- Staying humble — we killed three features that nobody used
The platform is never done. As Flock grows, the team's needs will change, and the platform will need to evolve with them. That's not a problem — that's the job.
If platform engineering is the kind of work that excites you, we're hiring.
About the author
