Context
NetVista, a network observability company, collects BGP update messages from routers and public collectors to power near-real-time routing analytics for enterprise customers. Today, route data is ingested as hourly flat files and processed in batch, which makes it difficult to detect route leaks, prefix hijacks, and flapping events quickly.
You need to design a data pipeline that ingests BGP protocol telemetry, validates and enriches it, and serves both real-time operational dashboards and historical analytics.
Scale Requirements
- Sources: 2,500 routers and 40 external BGP collectors
- Peak throughput: 180K BGP UPDATE messages/sec, 25K withdrawals/sec
- Message size: 0.8-2.5 KB per event after normalization
- Latency target: < 30 seconds from receipt to queryable analytics tables
- Daily volume: ~12 TB raw JSON/Avro, 4 TB compressed Parquet
- Retention: 30 days raw, 13 months aggregated route metrics
Requirements
- Ingest BGP announcements, withdrawals, and session-state events from multiple regions with ordered processing per peer.
- Normalize protocol fields such as
peer_asn, prefix, next_hop, as_path, med, local_pref, and community into a canonical schema.
- Detect duplicates, malformed prefixes, invalid ASN values, and out-of-order events.
- Produce real-time derived datasets for route changes, prefix reachability, peer instability, and AS-path changes.
- Support replay/backfill for a missed collector window without duplicating downstream records.
- Expose curated tables for analysts in a warehouse and low-latency aggregates for operational dashboards.
- Define monitoring, alerting, and failure recovery for ingestion, processing, and warehouse loads.
Constraints
- Primary cloud is AWS; existing platform uses Amazon MSK, S3, Airflow, and Snowflake.
- Team size is 3 data engineers and 1 SRE; operational complexity should stay moderate.
- Budget cap is $35K/month incremental spend.
- Route telemetry may contain customer-identifiable IP allocations, so access controls and audit logging are required.