Mar 26, 2026Engineering Blog / Data12 min read

Your data should outlast your stack

Jacob Holland
Jacob HollandPrincipal Data Engineer

1. The problem with Snowflake-as-everything

For a long time, Flock's data platform was Snowflake. Ingestion landed in Snowflake, dbt & dbt-cloud ran workloads against Snowflake, Looker sat on top of Snowflake. It worked — until it didn't.

The real issue isn't Snowflake itself. As the complex needs of the data platform grew, solutions implemented were hacky (using dbt models as event factories) or not fit for purpose (data landed via Fivetran is costly and we have no control over HOW data is landed). By creating open-source-based solutions for each step of the data flow we:

  • minimise cost and mitigate risk by avoiding vendor lock-in
  • democratise our data access to users & services (more on this below)
  • allow ourselves to use best-fit tools for present & future problems

We also had a harder architectural problem: we needed to serve multiple compute engines. Actuarial models run in Python. BI runs in Looker. Claims analytics runs ad-hoc in SQL. The telemetry pipeline runs in Databricks. Routing all of that through a single proprietary warehouse creates lock-in and query contention. We wanted open formats — tables that any engine could read without a Snowflake connector.

The answer was a lakehouse.

2. The Architecture

At a high level, data flows from sources through ingestion, gets orchestrated and transformed through a medallion layering model, and is served to consumers including BI tools, ML platforms, and AI agents.

The storage layer is S3 + Apache Iceberg, with AWS Glue Data Catalog as the metastore. Every table is an Iceberg table — open format, ACID compliant, time-travel capable, readable by Athena, Spark (Glue and Databricks), DuckDB, and Snowflake.

We chose Apache Iceberg without seriously trialling Delta Lake or Hudi — by the time we were making this decision, Iceberg had effectively won the open table format wars. Native support landed in Athena, Glue, Databricks, Snowflake, and DuckDB in quick succession; the major cloud vendors had converged on it; and the organisational tension between those who wanted Snowflake and those who wanted Databricks dissolved into a single answer. One storage format, readable by both but native to neither. The decision was less "which format should we evaluate?" and more "Iceberg was the last fighter standing in the ring" which made the choice obvious.

Rendering diagram...
Rendering diagram...

3. Ingestion: DLT on ECS Fargate

We use DLT (data load tool) for ingestion. DLT handles the mechanics of incremental loading, schema evolution, and state management so we don't have to. It runs as containerised tasks on ECS Fargate — no always-on servers, no Glue jobs; just ephemeral tasks that are spec'd to the specific task.

Each source gets a DLT pipeline. Postgres tables replicate incrementally via cursor-based loading or CDC. HubSpot loads via the DLT HubSpot source. External partner files drop into S3 and trigger pipelines via EventBridge. The real strength here is scalability: once a pipeline template exists for a source type, adding a new source is a configuration change, not an engineering project. A new API endpoint, a new Postgres table, a new partner file feed — each maps to a JSON config file that points the same containerised task at the new source. No new code, no new infrastructure.

A typical pipeline looks roughly like this:

python
import logging
import os
import sys
 
from dlt_pipeline.clients import DLTClient
from dlt_pipeline.config import ConfigLoader, Environment
from dlt_pipeline.services import PipelineRunner
 
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
 
def main() -> int:
    """Main entry point for the pipeline."""
    try:
        env_str = os.getenv("ENVIRONMENT", "local")
        try:
            environment = Environment(env_str)
        except ValueError:
            logger.warning(f"Invalid ENVIRONMENT value '{env_str}', defaulting to 'local'")
            environment = Environment.LOCAL
 
        endpoint_url = os.getenv("ENDPOINT_URL")
 
        logger.info(f"Starting DLT API Pipeline (Environment: {environment.value})")
 
        config_loader = ConfigLoader(environment=environment, endpoint_url=endpoint_url)
        config = config_loader.load_from_secrets_manager()
 
        dlt_client = DLTClient(
            config=config,
            environment=environment,
            endpoint_url=endpoint_url,
        )
 
        pipeline_runner = PipelineRunner(
            config=config,
            dlt_client=dlt_client,
        )
 
        load_info = pipeline_runner.run()
 
        logger.info("Pipeline completed successfully!")
        logger.info(f"Load summary: {load_info}")
 
        return 0
 
    except KeyboardInterrupt:
        logger.warning("Pipeline interrupted by user")
        return 130
 
    except Exception as e:
        logger.error(f"Pipeline failed with error: {e}", exc_info=True)
        return 1
 
if __name__ == "__main__":
    sys.exit(main())

We can scale this solution for future with new orchestration & configurations which can be set up in minutes to reliably replicate data.

One detail worth highlighting: the pipeline respects an ENVIRONMENT variable that lets you run it locally against a test endpoint before deploying to ECS. This is something Fivetran simply doesn't offer — your connector either works in production or it doesn't. Having a proper local dev loop means pipelines can be written, tested, and iterated on like any other piece of software before they touch live data.

The honest tradeoff is upfront investment. Fivetran can have a new source ingesting data in minutes — a DLT pipeline requires writing and testing code, building container infrastructure, and wiring orchestration. For a small team with limited sources, that cost may not pay off. For us, it did: as the number of sources grew and Fivetran costs scaled with them, the control and economics shifted in DLT's favour. New connectors now cost engineering minutes up-front, not recurring licence fees per source.

4. Orchestration: Airflow on MWAA

One architecture decision we've standardised on: nothing runs directly in MWAA. Every workload runs as a containerised ECS Fargate task invoked by Airflow. This keeps MWAA lightweight (no heavy Python dependencies) and lets us scale workloads independently.

Apache Airflow on MWAA orchestrates everything — dlt pipelines, dbt run sequences, data quality checks, Databricks jobs, and anything else that needs scheduling. MWAA gives us managed Airflow without the operational overhead of running our own cluster.

python
import logging
import sys
 
from airflow.decorators import dag
from airflow.utils.dates import days_ago
from dlt_runner.operators import DLTECSRunner
 
logging.basicConfig(
    level="DEBUG",
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)
logger = logging.getLogger(__name__)
 
default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 1,
}
 
@dag(
    dag_id="dlt_api_pipeline_connections",
    default_args=default_args,
    description="Daily API ingestion to Iceberg using DLT",
    schedule_interval=None,
    start_date=days_ago(1),
    catchup=False,
    tags=["dlt", "api", "iceberg"],
)
def dlt_api_pipeline_connections():
 
    runner = DLTECSRunner(
        task_definition="dlt-api-s3-task-definition",
        container_name="dlt-api-s3",
        security_group_tags={"Product": "dlt", "Workload": "api"},
    )
 
    execute_ecs_task = runner.create_task(
        task_id="execute_dlt_api_pipeline",
        pipeline_configuration="data/dlt/api-config",
        wait_for_completion=True,
    )
 
    execute_ecs_task
 
dag = dlt_api_pipeline_connections()

The configuration files are kept as secrets and read by the DAG above. They are simple JSON files used to point the application at the correct source, authenticate appropriately, and manage write-behaviour to Iceberg.

json
{
  "pipeline_name": "example_api",
  "dataset_name": "example_dataset",
  "table_name": "my_api_results",
  "base_url": "https://example-api.com",
  "bearer_token": "<your-token-here>",
  "write_disposition": "replace",
  "primary_key": "id",
  "buffer_max_items": 100,
  "file_max_items": 5000,
  "file_max_bytes": 5000000
}

5. Transformation: dbt-glue and the medallion layers

All transformation logic lives in dbt, using the dbt-glue adapter. This means our dbt models compile to Spark SQL and run against Glue, writing back to Iceberg tables on S3.

We follow a four-layer medallion model:

  • Raw (src_*): Landing zone. One table per source entity, minimal transformation, schema preserved as-is from the source.
  • Staging (stg_*): Light cleaning — column renaming, type casting, null handling. One staging model per source table.
  • Integration (int_*): Business logic joins and enrichment. Where we reconcile data from multiple sources, apply taxonomies, and build intermediate fact structures.
  • Business / Warehouse (wh_*): The serving layer. SCD Type 2 history tables, pre-aggregated metrics, and domain-specific views consumed by BI and the MCP server.

Example Staging Model

sql
{{
    config(
        enabled=true,
        tags=['api', 'staging'],
        incremental_strategy='insert_overwrite'
    )
}}
 
with source_data as (
 
    select
      id
    , upper(status) as status
    , _dlt_load_id as _loaded_at_ts
    , 'example_dataset.my_api_results' as _source
 
    from {{ source('example_dataset', 'my_api_results') }}
 
)
 
select
  id
, status
, _loaded_at_ts
, _source
, {{ dbt_utils.generate_surrogate_key([
    'id'
  , 'status'
]) }} as _hash_key
 
from source_data

Example Integration Model

The integration layer unions data from multiple staging models — in this case, two API sources that share the same event schema — into a single normalised table. This is where we reconcile naming conventions across sources and add a provider discriminator so downstream models know the origin of each record.

sql
{{
    config(
        enabled=true,
        tags=['api', 'integration'],
        incremental_strategy='insert_overwrite'
    )
}}
 
with provider_a as (
 
    select
      id
    , status
    , _loaded_at_ts
    , _source
    , _hash_key
    , 'provider_a' as provider
 
    from {{ ref('stg_example_dataset__provider_a') }}
 
),
 
provider_b as (
 
    select
      id
    , status
    , _loaded_at_ts
    , _source
    , _hash_key
    , 'provider_b' as provider
 
    from {{ ref('stg_example_dataset__provider_b') }}
 
),
 
unioned as (
 
    select * from provider_a
    union all
    select * from provider_b
 
)
 
select * from unioned

Example Warehouse Model

The warehouse layer reads from the integration model, deduplicates on the business key, and presents a clean curated table for downstream consumers. This is the only layer that Looker, Athena ad-hoc queries, and the MCP server should ever touch — never the raw or staging tables directly.

sql
{{
    config(
        enabled=true,
        tags=['api', 'warehouse'],
        incremental_strategy='insert_overwrite'
    )
}}
 
with integration as (
 
    select
      id
    , status
    , provider
    , _loaded_at_ts
    , _source
    , _hash_key
 
    from {{ ref('int_example_dataset__events') }}
 
),
 
deduplicated as (
 
    select
      *
    , row_number() over (
        partition by id
        order by _loaded_at_ts desc
    ) as _row_num
 
    from integration
 
),
 
final as (
 
    select
      id
    , status
    , provider
    , _loaded_at_ts
    , _source
    , _hash_key
 
    from deduplicated
    where _row_num = 1
 
)
 
select * from final

6. Making it queryable…

…by AI: the MCP server

This is the part that surprised us most in terms of leverage.

Once the lakehouse was in good shape — clean models, consistent schemas, domain-organised Glue databases — we built an MCP (Model Context Protocol) server on top of it that was only meant to serve queries on the telemetry domain.

Our MCP server exposes the lakehouse to Claude via a set of typed tools: domain manifests, schema introspection, sample data, and Athena query execution. The server runs on AWS Lambda behind a streaming HTTP endpoint.

Rendering diagram...

The result: an engineer can open a chat interface and ask "show me the claims with the highest incurred cost that are still open" and get a real, live answer from the datalake — no SQL client, no Athena console, no knowing which table to query.

This has changed how the data platform gets used. The MCP server now serves not just internal engineers but also non-technical stakeholders that can execute complex queries across domains and receive answers about the business.

…by DuckDB

For engineers who want a local SQL client over the lakehouse, DuckDB can attach directly to the Glue Iceberg REST catalog via your existing AWS SSO session — giving you a browsable, queryable view of the entire lakehouse from a single connection, no Athena console required.

7. What we learned

Open formats are worth the upfront investment. Iceberg on S3 with Glue Catalog is more complex to set up than Snowflake tables. But the payoff — any compute engine, no vendor lock-in, time travel for free — compounds over time. We can now run the same query from Athena, Databricks, DuckDB, and Snowflake external tables without any data movement.

Scalability comes from configuration, not code. The pattern of config-driven tasks on ECS Fargate means that adding a new data source, a new dbt project, or a new transformation workload is a configuration change rather than an infrastructure change. The same containerised task definitions handle ingestion and transformation — behaviour is controlled entirely by what gets passed in at runtime. Decoupling compute from orchestration is a secondary benefit here: MWAA stays lightweight, workloads are right-sized per task, and there's no resource contention between a heavy dbt run and a lightweight API poll.

DLT handles the hard parts of ingestion. Schema evolution, incremental state, write disposition logic — these are all solved problems in DLT. Writing your own incremental loading logic is a maintenance trap.

The MCP layer unlocks non-obvious use cases. We built the MCP server primarily for internal engineering use specific to our telemetry domain. It quickly became the obvious surface for stakeholders to query ANYTHING in the datalake and, as the datalake accumulates more data & domain knowledge, the value of the MCP layer increases.

Standardise your ingestion pattern earlier than feels necessary. We started DLT with bespoke connectors per source — each one slightly different, each one its own maintenance surface. We eventually landed on a much better pattern: a single base Docker image extended per workload, with behaviour driven entirely by configuration passed at runtime. The result is that adding a new source is genuinely just a config change. But getting there took iterative refactoring that, in hindsight, we could have front-loaded. If you're starting from scratch, design for the standardised pattern from day one, even when you only have two sources and the bespoke approach feels simpler.

8. The real point

The most important property of this architecture isn't what it does today — it's what it lets you change tomorrow.

Snowflake could implode tomorrow (or more likely become prohibitively expensive). Databricks could decide to pivot towards selling baked goods (or ship a feature that makes it the obvious choice for half your workloads). A new query engine could emerge that none of us have heard of yet. In any of those scenarios, the architecture described here lets you respond by turning a dial rather than re-engineering your data platform.

That's the real payoff of open formats and engine-agnostic storage: your data accumulates value independently of the tools sitting on top of it. The Iceberg tables on S3 don't care whether the next query comes from Athena, Databricks, DuckDB, Snowflake, or something that doesn't exist yet. You can scale any of those up, scale them down, or swap them out — without touching the data layer.

But the more interesting question is what you build on top of that foundation. A few things we're already seeing:

Natural language is already the query interface. The MCP layer isn't a prototype — it's in daily use by engineers and non-technical stakeholders alike. The domain definitions, the Glue catalog, the medallion model — all of it exists to make the lakehouse queryable by anyone who can describe what they want. The SQL is just an implementation detail.

Domain teams are becoming data owners. We're rolling out dedicated dbt projects to teams across the business, so the data engineering team is no longer the bottleneck for new models. The platform provides the infrastructure; the domain teams provide the knowledge.

Agents are the next layer. We're trialling data quality agents in the different domains — letting agents decide if the latest data is fit for ingestion and normalisation. Letting agents do spot-checking and surfacing problems before they become problems is a massive value-add and shows promise.

If you're building a data platform today, the question worth getting right isn't which tools you pick — it's whether your architecture frees you to make best-fit tooling choices for present & future problems.

About the author

Jacob Holland

Jacob Holland

Principal Data Engineer

We're hiring

Want to work on problems like these?

We're building the technology that powers the fleets insurance — from risk models to processing telemetry pipelines. Come build it with us.