Context
You’re on the Data Platform team at PayWave, a fintech payments processor that settles card and ACH transactions for ~120K merchants. PayWave runs a mixed workload of batch and streaming data pipelines that power:
- Fraud detection (near-real-time features; regulatory and loss exposure)
- Disputes/chargebacks (daily reporting; strict reconciliation)
- Finance & revenue recognition (SOX controls; auditability)
The platform operates ~1,000 compute nodes (Kubernetes worker nodes plus a smaller pool of VM-based Spark executors) across 3 AWS regions. These nodes run: Kafka consumers, Spark Structured Streaming jobs, Airflow task workers, dbt jobs, and custom Python services. Over the last quarter, the team has experienced repeated incidents where “the same pipeline code” behaves differently depending on which node it lands on.
Root cause analysis shows configuration drift across the fleet: different JVM versions, inconsistent Spark configs, mismatched Python dependency sets, varying environment variables, and ad-hoc hotfixes applied during incidents. Drift has caused:
- Silent data correctness issues (e.g., timezone parsing differences) impacting settlement reports
- Increased duplicate loads into Snowflake due to inconsistent retry/idempotency settings
- Streaming lag spikes when a subset of nodes runs with lower Kafka fetch sizes
Leadership asks you to design a data engineering–owned drift management system that continuously detects drift, quantifies impact, and can safely remediate or quarantine nodes—without taking the platform down.
Current Architecture (as-is)
| Layer | Tech | Notes / Pain Points |
|---|
| Orchestration | Apache Airflow 2.x (EKS) | 20K DAG runs/day; workers autoscale; tasks land on any node |
| Streaming | Kafka (MSK) + Spark Structured Streaming | ~250K events/sec peak across topics |
| Batch compute | Spark on EKS + a legacy EMR cluster | Mixed runtimes; inconsistent Spark defaults |
| Transform | dbt Core | Runs in ephemeral pods; dependency drift across images |
| Warehouse | Snowflake | Raw + curated; strict reconciliation requirements |
| Observability | CloudWatch + Datadog | Metrics exist but not tied to “config state” |
Scale Requirements
- Fleet size: 1,000 nodes (800 k8s workers, 200 VM/EMR nodes)
- Pipelines: ~600 batch jobs/day, ~40 streaming jobs, ~20K Airflow task instances/day
- Throughput: 250K events/sec peak streaming ingest; ~60 TB/day raw landing
- Freshness SLAs:
- Streaming feature tables: P95 < 2 minutes end-to-end
- Batch finance models: complete by 6:00 AM local (3 regions)
- Drift detection latency: detect material drift within 10 minutes
- Remediation safety: no more than 1% of tasks/jobs can be disrupted during remediation windows
- Auditability: retain drift history and remediation actions for 1 year (SOX)
Data Characteristics (what you must manage)
You need to treat “configuration state” as first-class data.
Example configuration dimensions
- Runtime: JVM version, Python version, OS packages
- Spark:
spark.sql.session.timeZone, spark.sql.shuffle.partitions, spark.streaming.backpressure.enabled
- Kafka consumer:
max.poll.interval.ms, fetch.min.bytes, enable.auto.commit
- Airflow: worker concurrency, retry settings, task timeouts
- dbt: adapter versions, macro packages, profiles
- Secrets/env: env vars, feature flags, region-specific endpoints
Quality issues you must anticipate
- Partial reporting (node unreachable, agent down)
- Rapid churn (autoscaling nodes, ephemeral pods)
- Non-deterministic drift (hotfix applied then reverted)
- “Acceptable divergence” (region-specific endpoints, instance-type-specific tuning)
Requirements
Functional
- Define a canonical desired state for pipeline runtimes and key configs (by workload type: Airflow worker, Spark streaming, Spark batch, dbt runner).
- Continuously collect actual state from all nodes and persist it as an append-only history.
- Detect drift and classify severity (e.g., informational vs. SLA-risk vs. correctness-risk).
- Correlate drift with data symptoms (consumer lag, duplicate loads, DQ failures, query anomalies) to prioritize remediation.
- Automate remediation for safe classes of drift (e.g., rolling restart, cordon/drain node, redeploy daemonset, rebuild image) with guardrails.
- Quarantine nodes with high-risk drift so new tasks don’t schedule there (k8s taints/labels; EMR node decommission).
- Provide dashboards and audit logs: who/what changed, when, impact, and resolution.
Non-functional
- Idempotent drift ingestion and remediation actions (safe retries).
- Low overhead: drift agents must use <1% CPU and minimal network.
- Security: least privilege; no broad SSH; secrets never logged.
- Multi-region: must work independently per region but roll up globally.
Constraints
- PayWave is standardized on AWS + EKS + MSK + Snowflake.
- Team skills: strong in Airflow, Spark, dbt, Snowflake; moderate Kubernetes; limited Terraform bandwidth.
- Budget: incremental infra spend capped at $15K/month.
- Compliance: SOX—must be able to demonstrate controlled changes and drift remediation.
What to Produce in Your Answer
- A target architecture for drift detection + remediation as a data pipeline.
- A data model for storing desired vs. actual state, drift events, and remediation actions.
- How you’d implement idempotency, late/partial reports, and autoscaling churn.
- Monitoring/alerting and SLOs.
- Failure modes and safe rollback strategies.
- How you’d roll this out incrementally without disrupting production.