Vehicle Inspection Workflow Platform

The Problem

Openlane runs one of the largest automotive remarketing platforms in North America. Every vehicle moving through the marketplace — listed for inspection, validated, routed for auction — is tracked by a workflow system that coordinates state across inspection, inventory, VIN intelligence, and a constellation of downstream services.

By the time I took ownership of core services in this platform, the numbers had grown to a point where the system's weak spots were visible: 300K+ vehicle inspections a day, 3.5M+ events flowing through validation, integration, audit, and observability pipelines. Synchronous processing steps that were fine at lower volume were now contention points. Event consumers weren't replay-safe. The deployment process involved too much manual coordination for a system this critical.

The job was to make the platform reliable, observable, and deployable without ceremony — and to add ML-assisted signal to inspection decisions that had previously been purely rule-based.

Architecture

The system is a set of .NET Core services communicating over Apache Pulsar. The design rests on a few principles that shaped everything else.

Event propagation with replay safety. Every event carries a correlation ID, a schema-versioned payload, and enough metadata to replay it correctly if a consumer restarts mid-processing. Consumers checkpoint progress explicitly rather than relying on offset auto-commit. Failure-isolated retry paths mean one broken consumer doesn't block the others.

ML-assisted decision pipelines. The most interesting addition was bringing ML signal into inspection decisions. Workflow events are enriched with VIN intelligence (vehicle specs, recall history, known defect patterns), historical state-transition data, and real-time inspection signals. An ML model scores each vehicle for anomalous transitions, duplicate processing signals, and inconsistent attributes. High-risk vehicles are routed to manual-review queues; low-risk vehicles proceed through automated paths. This replaced a fragile set of hard-coded rules that required engineering involvement to update.

Idempotent consumers with failure isolation. Every consumer is idempotent: events carry deduplication keys, processing is transactional where it needs to be, and the dead-letter handling is explicit. Vendor failures are isolated at the consumer level — a degraded third-party integration doesn't cascade into invalid inspection state transitions.

Terraform-managed canary rollouts. New logic ships behind Terraform-managed feature flags. Traffic is promoted through stages (1% → 10% → 50% → 100%) with Honeycomb-based health checks at each gate. If error rates or latency deviate from baseline during a stage, rollback is automatic.

Key Decisions

Apache Pulsar over Kafka. Pulsar was already Openlane's standard messaging layer, which made the choice straightforward operationally. That said, Pulsar's multi-tenancy model and built-in tiered storage are genuinely useful at this scale: topics can be namespaced by workflow domain, and event history is retained without managing separate archival infrastructure.

ML for anomaly detection, not classification. The ML layer doesn't try to approve or reject inspections — it generates risk scores that feed rule engines and review queues. Keeping humans in the loop for high-risk decisions made the system easier to audit, easier to tune, and easier to get organizational buy-in for.

Idempotency as a first-class constraint. At 3.5M events/day, the question isn't whether duplicates will happen — Pulsar redelivers, consumers crash, deploys reprocess. The question is whether your system handles them correctly. Every consumer was designed idempotent from the start, not retrofitted.

Results

300K+ daily vehicle inspections processed reliably with no SLA misses
3.5M+ events/day flowing through validation, integration, audit, and observability pipelines
35% reduction in workflow latency after moving synchronous steps to Pulsar-backed async processing
70% improvement in VIN decode p95 via Redis caching with TTL-based invalidation and request coalescing
45% reduction in MTTD after rebuilding observability around Honeycomb distributed tracing and structured events
ML-assisted risk scoring routed anomalous vehicles to review queues, replacing a brittle set of hand-maintained rules

What I'd Do Differently

The ML feature pipeline was built incrementally, which meant the feature engineering logic ended up scattered across multiple services. A centralized feature store — even a lightweight one — would have made it easier to add new signals and to reproduce training data for model retraining.

On the Pulsar topology: I'd design the partition count and consumer group structure upfront based on projected throughput. We ended up resizing partitions reactively, which required coordination across teams and a maintenance window.