From Telemetry to aFNOL: How We Built Automated Crash Detection

Flock insures commercial vehicle fleets. Every vehicle on a Flock policy has a telematics device that streams telemetry (accelerometer readings, GPS positions, speed) back to our platform. When that device detects a sudden impact, it writes a crash file to S3.

The problem: most of those crash files aren't real crashes. A pothole, a speed bump, a door slam. All produce g-force signatures that the device dutifully reports. For every genuine collision, there are dozens of false positives. But when a real crash happens, the business needs to know immediately. In insurance, the first notification of loss (FNOL) triggers the entire claims workflow: adjusters are assigned, liability is assessed, and repair networks are mobilised.

We needed a system that could take raw crash telemetry, separate real collisions from noise, verify the result, and broadcast a structured automatic FNOL (AFNOL) event, all with minimal human intervention and supervision. This post is the story of how we built it.

System Overview

The pipeline has three stages: classification, verification, and broadcast. Each stage is decoupled from the next via queues, so they can fail, retry, and scale independently.

Rendering diagram...

Classification. An ECS Fargate service polls the ML queue, fetches the crash file from S3, engineers 42 features from the telemetry, and runs an XGBoost model. Every prediction is archived to Firehose. High-confidence genuine crashes are forwarded to the agents queue.
Verification. A Lambda function picks up the message, runs a two-agent swarm on Bedrock (Crash Analyst → Judge), cross-references the prediction against telemetry quality and historical claims data in our datalake, and writes the enriched result to a second Firehose stream.
Broadcast. Agent results land in S3, trigger an EventBridge rule, and a small Lambda function publishes the verified crash event for downstream consumers: claims handlers, notification systems, dashboards.

The dual-Firehose design is deliberate. Every ML prediction is archived, including false positives. This gives us a full audit trail and a growing dataset for retraining. Only verified genuine crashes make it downstream.

The ML Model

What a crash file looks like

When a telematics device detects a sudden impact, it writes a crash file to S3. That file contains two streams: a high-frequency accelerometer trace sampled at approximately 100 Hz, and a GPS trace with speed, position, and fix quality at each point, as well as some device metadata.

The accelerometer stream is the signal. Hundreds of readings per second, each one an X/Y/Z measurement in raw g-units.

The first thing the feature extractor does is find crash time zero: the timestamp of maximum computed G-force magnitude across the entire trace. Everything else — all 42 features — is computed relative to that instant. A ±5-second window for G-force features, ±10 seconds for speed. The model never sees raw waveforms. It sees a compressed description of what happened around the peak.

Engineering 42 features from a crash file

Forty-two features sounds like a lot. In practice they fall into six groups, each one capturing a different aspect of the event.

G-force magnitude and shape. Peak G across each axis and as a vector magnitude, whether the peak qualifies as a spike or an extreme reading, how many samples crossed 3g, and how long the high-G period was sustained. These are the most direct signal: a real crash typically produces a large, brief, concentrated G-force event.

Speed and energy. Average speed before and after the crash window, the delta between them, deceleration rate, and whether the vehicle came to a stop. A vehicle doing 60 mph that drops to 0 in two seconds is a very different event from one that barely slows down. energy_loss_rate captures the rate at which kinetic energy dissipates.

Temporal. Hour of day, day of week, whether it's night, whether it's a weekend. Real crashes follow patterns. Severity may be correlated with time.

Data quality. This is the group that's easy to skip past and shouldn't be. data_quality_score is an integer from 0 to 4, counting how many of four conditions are met: accelerometer data present near the crash peak, speed stream present, neither stream is sparse. Files with poor coverage are a meaningful signal in themselves — a device with loose wiring that produces a false trigger often produces degraded telemetry too.

Composite heuristics. Two rule-based features, packaged as binary numerics so the model can use them like any other. crash_signature fires when peak G exceeds 2g, there's a significant speed drop, and the vehicle stopped afterwards — a textbook crash pattern. high_g_false_trigger_signal fires when there's a big G spike but the vehicle continues at speed with no meaningful velocity change — a pothole, a speed bump, a driver who hit a kerb.

These two features are the most deliberate ML-engineering choice in the whole system. They encode operational domain knowledge — things we already know about what crashes look like versus what they don't — directly into the feature space.

Training

The model is XGBoost, a gradient-boosted tree ensemble. The choice is deliberate: tabular data with engineered features, where interpretability and training speed matter more than representation learning.

Real crashes are rare. In our dataset, roughly one in forty events that look like a crash actually is one. Handling that imbalance without distorting the feature distribution was a design constraint. We use scale_pos_weight — a parameter that tells XGBoost to up-weight the minority class in the loss function — set to the ratio of negatives to positives in the training set, around 40. No synthetic oversampling, no SMOTE. SMOTE generates minority-class samples by interpolating between existing ones, which can introduce artefacts when the decision boundary is irregular. scale_pos_weight achieves the same rebalancing effect purely through the loss weighting.

Training runs as 5-fold stratified cross-validation over an 80% development set, with a 20% hold-out kept strictly separate for final evaluation. The stratification ensures each fold contains a proportional representation of genuine crashes — important when confirmed positive examples are scarce.

The precision decision

The model outputs a probability. Turning that into a routing decision requires a threshold. We set ours at 0.80.

That's a deliberate choice to optimise for precision. Every event that crosses the threshold gets forwarded to the agent swarm, which invokes Claude on Bedrock. The ML model's job is to be a reliable filter, not an exhaustive detector.

The consequence is that recall is lower than it could be. Events below 0.80 are archived to Firehose and discarded from the live workflow. Some genuine crashes will be in that bin. We accept that trade-off. The alternative — a lower threshold, more recalls, more false positives flooding the agents — creates a different and worse problem.

The production worker also applies a separate routing threshold as an environment variable, independent of the model's decision boundary. This lets us tune the forwarding behaviour without retraining.

Performance

Cross-validation over five folds gives the most honest picture, because the test set positive count is small enough that test-set metrics should be read with caution.

Precision is high — at the 0.80 threshold, the events the model forwards are overwhelmingly genuine. Recall is lower, which is a deliberate consequence of the precision-first design. ROC AUC is strong, indicating the model separates real crashes from noise well across the full range of thresholds.

The standard deviations across folds are wide. That's a function of having a limited number of confirmed positives spread across five folds. The model is doing well with limited signal, but the variance reflects the reality of rare-event classification.

Deploying the ML Service

The XGBoost model runs as a long-lived ECS Fargate service. Not a Lambda, not a batch job, but a continuously-running queue worker that polls for crash files, processes them, and publishes results.

Container architecture

The production image is deliberately minimal. We use a multi-stage Docker build: a slim Python base image, UV as the package manager, and only the inference code copied into the final image. Training dependencies (matplotlib, SMOTE, the full training pipeline) are excluded. The only ML dependency that matters at runtime is XGBoost, which needs libgomp1 for OpenMP support.

python

# The core loop in sqs_worker.py (simplified)
while True:
    messages = sqs.receive_message(
        QueueUrl=INPUT_QUEUE_URL,
        MaxNumberOfMessages=10,
        WaitTimeSeconds=20,
    )
    for message in messages.get("Messages", []):
        crash_data = s3.get_object(Bucket=bucket, Key=key)
        features = extract_features(crash_data)
        prediction = model.predict(features)
 
        # Every prediction goes to Firehose (audit trail)
        firehose.put_record(stream=ML_FIREHOSE, data=prediction)
 
        # Only high-confidence genuine crashes go to agents
        if prediction.is_genuine and prediction.confidence >= THRESHOLD:
            sqs.send_message(QueueUrl=AGENTS_QUEUE_URL, body=prediction)
 
        sqs.delete_message(QueueUrl=INPUT_QUEUE_URL, handle=message)

Right-sizing

The task definition is intentionally small. The workload is I/O-bound: most time is spent fetching crash files from S3 and publishing results to queues and Firehose. Feature extraction is single-threaded Python; XGBoost inference on 42 features is near-instant. We measured utilisation before committing to this sizing and it's well within headroom. If queue depth grows, horizontal scaling (more tasks) is the right lever, not bigger tasks.

Dual output

This is the design choice worth highlighting. The ML service writes to two destinations:

Firehose (every prediction). Archived to S3, GZIP-compressed, buffered in short intervals. This is the audit trail. It captures false positives, low-confidence predictions, and everything else the model sees. It's also the dataset we'll use to retrain the next model version.
Queue (high-confidence genuine crashes only). Forwarded to the agents queue for verification. The production and development confidence thresholds vary, arrived at iteratively to find the sweet spot between signal and noise.

The production threshold filters aggressively. Most predictions don't cross it. That's intentional. The ML service is a coarse filter; the agent swarm is the fine one.

The Agent Swarm

The ML model's job is to filter. The swarm's job is to verify. By the time an event reaches the agents, it has already crossed an 0.80 confidence threshold. The agents add the context the model can't see: how reliable is the telemetry in this specific file, and does this vehicle's history support the classification?

The swarm is built on the strands-agents framework and runs on AWS Bedrock using Claude Haiku. Two agents, a fixed sequence: Crash Analyst, then Judge.

The Crash Analyst has two jobs. First, it fetches the crash file from S3 and scores the data quality — how complete and consistent the 20-second window around the crash peak is. Second, it queries the datalake for this vehicle's crash-to-claim conversion rate: of all past detections with a similar g-force profile, how many actually led to a claim? It also checks how many crash events this vehicle has triggered today. The Crash Analyst doesn't make a verdict. It compiles its findings and hands off to the Judge.

The Judge applies a simple framework. Poor data quality means the telemetry is too unreliable to act on — NO_ACTION. Three or more detections from the same vehicle in a single day points to a device fault rather than a genuine crash cluster — NO_ACTION. For everything else, the crash history conversion rate is the primary signal: a high rate confirms the ML prediction, a low rate contradicts it.

The reason history takes precedence over ML confidence is deliberate. Every event that reaches the swarm already has a model confidence above 0.80. That score is effectively a constant at this stage — it can't differentiate between cases. What can differentiate them is whether this vehicle, with this kind of impact, has a track record of generating real claims or generating noise.

The handoff between agents is explicit in the prompts: the Crash Analyst's system prompt specifies exactly which fields to output, and the Judge's prompt uses the same names. Swarms fail in subtle ways when field names drift between agents; treating the contract as part of the prompt rather than an emergent behaviour keeps things predictable.

After the Judge reaches a decision, it archives the full enriched event to Firehose — including NO_ACTION verdicts. False positives with reasoning attached are signal. Discarding them would mean losing the audit trail exactly where things went wrong.

Wiring It Together

All of this lives in a single Terraform module. Here's how the pieces connect.

Rendering diagram...

Queues as the pressure valve

Every boundary between stages is a queue. This is the single most important infrastructure decision in the system. If the ML service produces a burst of predictions, messages queue up and the agents process at their own pace. If an agent invocation fails (Bedrock throttle, datalake query timeout), the message stays in the queue for retry. After repeated failures, it moves to a dead-letter queue for investigation. No crash event silently disappears.

The visibility timeout on both queues is tuned to exceed the processing timeout, preventing a message from becoming visible to another consumer while the current one is still processing it.

Lambda configuration

The agents Lambda runs as a Docker container, the same image used in development, pushed to a container registry on each deploy. The queue delivers messages in batches with partial batch failure reporting, so only the individual failures return to the queue for retry.

The Lambda runs on private subnets with no public IP. All outbound calls (S3, datalake, Bedrock, Firehose) route through VPC endpoints or the NAT gateway.

The blocklist

One pragmatic feature: some telematics devices are installed incorrectly (bad mount angle, loose wiring) and produce constant false crash files. A single badly-fitted device can generate dozens of events per day. Rather than let these flood the pipeline and burn LLM inference costs, we maintain a blocklist of vehicle registrations as an environment variable on the Lambda. The check happens in Python before the agent swarm is invoked. Deterministic, zero cost, instant rejection. It's not elegant, but it's effective.

Broadcasting Verified Crashes

The final stage gets verified crash events out to the rest of the business. Firehose writes agent results to S3, not directly to EventBridge, so we need one more hop.

Rendering diagram...

An EventBridge rule watches for new objects landing in the agent results prefix on S3 and routes them to a fanout queue. A small Lambda function reads the message, fetches the corresponding JSON from S3, and publishes it to an EventBridge bus that other teams subscribe to.

From there, claims handlers, notification systems, and dashboards can consume verified crash events without knowing anything about the detection pipeline that produced them. The event is the contract. The rest is an implementation detail.

Evaluating the System

The dual-Firehose design isn't just an audit trail — it's the evaluation mechanism. Every ML prediction is archived to S3, timestamped and labelled with its confidence score. Over time, those predictions can be joined against confirmed claims in the datalake to compute real-world precision and recall across live traffic. Evaluation isn't a separate step bolted on after the fact; it's a consequence of how the system was wired.

That matters because ground truth arrives late in insurance. A crash happens today; the claim might be filed days later. Retrospective analysis over the Firehose archive lets us measure how the model and agents are actually performing against confirmed outcomes, not just against a held-out test set from training time.

What We Learned

Right-size first, scale later. Neither the ML service nor the agents Lambda is under resource pressure. We started small, measured utilisation, and confirmed the sizing was sufficient rather than over-provisioning and optimising later. If queue depth grows, horizontal scaling is the right move.

Archive everything, not just the interesting stuff. The dual-Firehose design means every ML prediction is preserved, not just genuine crashes. False positives are training data for the next model version. Low-confidence predictions help calibrate the threshold. The cost of GZIP-compressed JSON on S3 is negligible compared to the cost of losing signal.

Queues between every stage are non-negotiable. Decoupling the ML service from the agent swarm via queues means either side can fail without taking down the other. Messages retry automatically. Dead-letter queues catch persistent failures. The visibility timeout prevents duplicate processing. It's not exciting infrastructure, but it's the reason the system runs reliably without on-call intervention.

Blocklists are a valid engineering tool. A handful of vehicle registrations are maintained in an environment variable. It's crude. It also saved us from wasting LLM inference on devices that produce hundreds of false crash events per week due to bad installations. Sometimes the right answer is a list in an env var, not a feature flag service.

What's Next

The system works, but all ML and agentic services are iterative by nature; so we're not done, we're just getting started. Auto-scaling the ECS ML service based on queue depth would handle traffic spikes without manual intervention. If we ever see resource requirements spike, this is what we'd implement. The ML model produces SHAP values and uncertainty estimates that the agent swarm doesn't currently use. Surfacing those would give the agents richer signal for borderline cases.

More broadly, this pipeline is the foundation for automating the entire first notification of loss workflow. The verified crash event is a starting point. What gets built on top of it is where the real business value will compound.

From Telemetry to aFNOL: How We Built Automated Crash Detection

System Overview#

The ML Model#

What a crash file looks like#

Engineering 42 features from a crash file#

Training#

The precision decision#

Performance#

Deploying the ML Service#

Container architecture#

Right-sizing#

Dual output#

The Agent Swarm#

Wiring It Together#

Queues as the pressure valve#

Lambda configuration#

The blocklist#

Broadcasting Verified Crashes#

Evaluating the System#

What We Learned#

What's Next#

Want to work on problems like these?

System Overview

The ML Model

What a crash file looks like

Engineering 42 features from a crash file

Training

The precision decision

Performance

Deploying the ML Service

Container architecture

Right-sizing

Dual output

The Agent Swarm

Wiring It Together

Queues as the pressure valve

Lambda configuration

The blocklist

Broadcasting Verified Crashes

Evaluating the System

What We Learned

What's Next