Context
NetOpsCloud manages configuration backups and compliance checks for 8,000 enterprise network devices across Cisco, Juniper, and Arista environments. Today, engineers run ad hoc Python scripts from laptops to pull configs over SSH, parse command output, and upload files to S3; the scripts are hard to read, difficult to test, and frequently fail during retries.
You need to design a production-grade data pipeline that turns network automation scripts into a readable, testable, and observable batch ETL system. The pipeline should collect device state, normalize vendor-specific outputs, store raw and curated records, and support safe re-runs without duplicate data.
Scale Requirements
- Devices: 8,000 active devices, growing to 20,000 in 12 months
- Collection frequency: Every 15 minutes for critical devices, hourly for standard devices
- Payload size: 200KB-2MB raw text per device snapshot
- Daily volume: ~1.5-3 TB raw command output
- Latency target: Snapshot available in curated warehouse tables within 10 minutes of scheduled run
- Retention: 90 days raw snapshots, 2 years normalized inventory/compliance history
Requirements
- Design an orchestrated batch pipeline to collect configs and operational state from devices over SSH/API.
- Make the extraction and parsing code modular, readable, and unit-testable across vendors.
- Ensure idempotent re-runs for partial failures, duplicate scheduler triggers, and backfills.
- Store both raw command output and normalized tables for interfaces, routes, software versions, and compliance findings.
- Add automated data quality checks for missing devices, parse failures, schema drift, and stale snapshots.
- Describe CI/CD, test strategy, and how you would separate pure parsing logic from side effects such as network calls and storage writes.
- Provide monitoring, alerting, and recovery procedures for device timeouts, parser regressions, and warehouse load failures.
Constraints
- Infrastructure must run on AWS using managed services where practical.
- Security requires encrypted secrets, audit logging, and no plaintext credentials in code.
- Budget allows moderate batch compute but not a 24/7 large streaming cluster.
- Some devices are rate-limited and can only be queried once per collection window.