All Projects

Real-Time DDoS Detection Pipeline

Kafka + Flink pipeline ingesting 2M+ network flow records/sec, cutting detect-to-mitigate latency from 25s to under 8s

KafkaApache FlinkJavaGogRPCRedisRocksDBAWS

The Problem

The system I joined at Infosys was protecting a set of large enterprise clients from volumetric network attacks. Detection was happening, but slowly — the pipeline that ingested raw network flow records, classified traffic, and triggered mitigations had end-to-end latency in the 25-second range. At volumetric attack scales, 25 seconds is a long time. Client SLAs required mitigations to be in place under 8 seconds from detection.

There were two separate problems. First, the detection pipeline itself wasn't built for the throughput we were seeing: 2M+ network flow records per second during attack peaks, with bursty spikes significantly higher. The existing architecture processed records in micro-batches with too much I/O on the hot path. Second, the mitigation system was manual — engineers received alerts and pushed BGP and ACL changes by hand, which added human latency and room for error.

The work was to fix both: rebuild the detection pipeline for real throughput, and build an orchestration service that automated the mitigation path with appropriate safeguards.

Architecture

Detection pipeline: Kafka + stateful Flink. Network flow records are ingested from collection agents onto Kafka topics partitioned by source IP prefix. A Flink streaming job reads from these topics and maintains stateful per-IP-prefix windows using RocksDB-backed state. For each window, the job computes adaptive baselines using EWMA and runs heavy-hitter detection using Count-Min Sketch — a probabilistic data structure that gives accurate frequency estimates with bounded memory.

When a prefix's traffic crosses an anomaly threshold relative to its baseline, the Flink job emits a detection event downstream. The window sizes are tunable per-client based on their attack history and traffic profile.

Mitigation orchestration: Go + Java over gRPC. Detection events flow to a mitigation orchestration service that translates them into network-level actions: BGP Flowspec rules (for traffic shaping and black-holing at upstream providers), RTBH (Remotely Triggered Black Hole) routes, and ACL updates on edge devices. The service is implemented as a gRPC server with client implementations in both Go (for the control plane) and Java (for integration with existing internal tooling).

Mitigations are staged — small BGP communities first, then RTBH if the attack persists, then ACLs as a last resort. Each stage requires an audit log entry. High-impact mitigations (full RTBH on a large prefix) require operator approval before execution. Every mitigation is reversible: rollback paths are pre-computed at dispatch time and can be triggered with a single API call.

Observability. The full pipeline is instrumented with Prometheus metrics, Grafana dashboards, OpenTelemetry traces, and Splunk log aggregation. SLO dashboards track detect-to-mitigate latency percentiles, false positive rate, and pipeline availability. We ran regular chaos drills — killing Flink task managers, partition leaders, orchestration service nodes — to validate that recovery paths worked and latency SLOs held under component failures.

Key Decisions

Count-Min Sketch for heavy-hitter detection. A naive approach would maintain exact per-IP counters, but at 2M+ records/sec with millions of distinct source IPs during attacks, exact counting requires memory that scales with the attack surface. Count-Min Sketch gives us frequency estimates with bounded memory and configurable error rate. The error rate is small enough that false positives at the detection level are caught by the threshold tuning; the occasional missed detection is acceptable given the false-positive cost.

EWMA baselines, not static thresholds. Static thresholds don't work when client traffic profiles have daily, weekly, and seasonal patterns. EWMA-based baselines adapt to normal traffic variation, which means detection fires on anomalies relative to expected traffic rather than absolute volume. This substantially reduced false positive rates compared to the previous static-threshold approach.

Staged mitigations with operator approval for high-impact actions. Full RTBH on a large IP prefix can affect legitimate traffic — the blast radius is real. Requiring operator approval for those actions adds latency to the worst-case path, but the tradeoff was worth it. The automation handles 90%+ of mitigations without human involvement; the approval gate applies only to the small fraction where the risk of collateral damage is high.

Results

  • 2M+ network flow records/sec ingested and classified in steady state; sustained through attack-peak burst traffic
  • Detect-to-mitigate latency cut from 25s to under 8s — well within client SLA requirements
  • 90%+ of approved mitigations automated end-to-end, eliminating manual BGP/ACL changes for the common case
  • 40% reduction in compute and memory footprint from Count-Min Sketch + EWMA vs. the previous exact-count approach
  • 99.97% pipeline availability sustained over the measurement period, validated through regular chaos drills
  • Mean time to mitigation reduced from 11s to 5s after operator tooling improvements and automation

What I'd Do Differently

The RocksDB state backend worked well but required careful tuning — compaction settings, block cache sizes, write buffer configuration — that wasn't obvious upfront and had meaningful performance impact. I'd budget more time for state backend performance testing before production deployment.

The operator approval flow was bolted onto the orchestration service late in the project. It worked, but the UX was rough — approvals happened through a CLI that required knowing the right flags. A simple web UI for the approval workflow would have reduced operator errors and made the audit trail easier to navigate.