Hi, I'm

Srividya Bandari

Software Engineer — Backend & Distributed Systems

Building Distributed Data Processing Pipelines at Scale

I build scalable backend systems and distributed data processing pipelines that handle millions of events a day — with a focus on throughput, correctness, and cost.

What I Build

Backend Systems

APIs, service architectures, and the data layers behind them. I care about correctness, latency, and cost — not in that order, but all three at once.

  • RESTful & event-driven APIs
  • Microservice decomposition
  • Idempotency & exactly-once semantics
  • PostgreSQL, Redis, DynamoDB

Data Infrastructure

Pipelines that move and transform data at volume, without drift, duplication, or silent failures that only show up in the dashboard.

  • Kafka-backed streaming pipelines
  • Deduplication & event sourcing
  • Batch + real-time hybrid processing
  • Observability from day one

AI + Developer Tools

Backend infrastructure for LLM-powered products — retrieval pipelines, API wrappers, evaluation loops, and the boring glue that makes them reliable.

  • RAG pipelines & vector search
  • ML-assisted decision pipelines
  • Retrieval quality & evaluation tooling
  • Developer-facing internal APIs

About

I'm a software engineer with 4+ years of experience building distributed backend services, event-driven workflows, real-time streaming systems, and ML-assisted decision pipelines. My work spans the full production stack — from event propagation and state management to caching layers, streaming infrastructure, and the observability tooling that keeps it all running.

I currently work at Openlane, where I own core .NET Core services in a distributed inspection workflow platform processing 300K+ daily vehicle inspections and 3.5M+ events/day. I've designed Apache Pulsar-based event systems with replay-safe consumers, built ML-assisted decision pipelines that generate risk scores for rule engines and review queues, and led production rollouts using Terraform-managed canary deployments that reduced MTTD by 45%.

Before Openlane, I interned at Mayo Clinic as a Research Software Engineer, building multi-view CNN training pipelines in PyTorch for medical imaging — raising model accuracy from 87% to 95% and cutting false-negative rate in half. Earlier at Infosys, I built a real-time DDoS detection pipeline on Kafka and Flink that ingested 2M+ network flow records per second and cut detect-to-mitigate latency from 25s to under 8s.

I care about boring, well-instrumented systems. Observability before optimization. Idempotency before clever retry logic. The systems I'm proudest of are the ones that don't page anyone at 3 a.m.

Languages

C#JavaPythonGoSQLTypeScript

Backend

ASP.NET CoreSpring BootgRPCREST APIsMicroservicesDistributed Systems

Streaming & Messaging

Apache PulsarKafkaFlinkRabbitMQEvent-driven Architecture

ML & AI

PyTorchTensorFlowSageMakerFAISSRAG Pipelines

Databases & Caching

PostgreSQLRedisMongoDBSQL ServerOpenSearch

Cloud & DevOps

AWSAzureDockerKubernetesTerraformGitHub Actions

Observability

HoneycombDatadogGrafanaPrometheusOpenTelemetrySplunk

Experience

Software Engineer

· Openlane
July 2023 – Present  ·  Austin, TX
  • Owned core services in a distributed .NET Core inspection workflow platform processing 300K+ daily vehicle inspections and 3.5M+ validation, integration, audit, and observability events/day across inspection, inventory, VIN intelligence, and downstream systems.
  • Designed Apache Pulsar-based event propagation with replay-safe consumers, correlation IDs, schema-versioned payloads, and failure-isolated retry paths, reducing workflow latency by 35% while improving real-time state consistency.
  • Built ML-assisted inspection decision pipelines by enriching workflow events with VIN intelligence, vehicle metadata, inspection signals, and historical state-transition features, generating risk scores used by rule engines and manual-review queues to detect anomalous transitions, duplicate processing signals, and inconsistent vehicle attributes.
  • Implemented idempotent event consumers with deduplication keys, exponential backoff, dead-letter handling, and vendor-failure isolation, reducing duplicate processing risk and preventing invalid inspection state transitions during downstream degradation.
  • Optimized high-volume VIN decoding and reference-data lookups using Redis-backed caching, TTL-based invalidation, and request coalescing, improving p95 latency by 70% while reducing repeated third-party vendor calls.
  • Led production rollout with Terraform-managed feature flags, staged enablement, canary deployments, rollback paths, and Honeycomb distributed tracing, reducing MTTD by 45% and improving release safety across critical inspection workflows.

Research Software Engineer Intern

· Mayo Clinic
Jan 2022 – July 2022  ·  Phoenix, AZ
  • Designed a modular multi-view CNN training and inference pipeline in PyTorch, fusing image embeddings with structured metadata features to support reproducible model development across medical-imaging datasets.
  • Improved model performance through class-imbalance handling, loss-function tuning, threshold calibration, and 1K+ hard-case error analysis, raising accuracy from 87% to 95% while reducing false-negative rate from 13% to 5%.
  • Added PyTorch hooks, tensor-level tracing, and Datadog dashboards for feature extraction, inference latency, and failure diagnostics, reducing MTTR from 2 days to 4 hours and increasing trace-event coverage from 40% to 100%.

Software Engineer

· Infosys
Jan 2020 – Aug 2021  ·  Hyderabad, India
  • Built a real-time DDoS detection pipeline using Kafka, stateful Flink, and Java on Kubernetes, ingesting 2M+ network flow records/sec and cutting detect-to-mitigate latency from 25s to under 8s.
  • Implemented a Go/Java mitigation orchestration service over gRPC to push staged BGP Flowspec, RTBH, and ACL actions with audit logging, operator approval, and rollback safeguards, automating 90%+ of approved mitigations.
  • Developed adaptive baselining and heavy-hitter detection with Count-Min Sketch, EWMA, RocksDB-backed Flink state, and Redis caching, reducing compute and memory footprint by 40% with no loss in precision or recall.
  • Drove observability and reliability with Prometheus, Grafana, OpenTelemetry, Splunk, SLO dashboards, and chaos drills, achieving 99.97% availability and reducing mean time to mitigation from 11s to 5s.
  • Implemented secure multi-tenant rollout workflows on AWS using EKS, S3, CloudWatch, IAM, KMS, Terraform, Argo Rollouts, and feature flags, enabling tenant-level isolation, staged releases, and safer rollback across client environments.

Let's build something together.

I'm open to backend engineering roles, infrastructure work, and interesting distributed systems problems. If you've got something worth building, I'd love to hear about it.