Context
You’re joining the Data Platform team at PayWave, a fintech that processes card payments and fraud signals for 25M monthly active cardholders across the US and EU. PayWave’s analytics and risk models depend on a lakehouse-style platform: raw and curated data lands in Amazon S3, and business-facing datasets are served from Snowflake. The company is under pressure from Finance to stop “surprise” cloud spend spikes, and from Risk/Compliance to prove that retention and deletion policies are consistently enforced.
Today, PayWave ingests data from (1) payment authorization events from microservices, (2) device telemetry from mobile SDKs, (3) chargeback/dispute updates from external networks, and (4) reference data (merchant catalog, BIN tables). The platform currently runs a mix of Kafka → Spark Structured Streaming for near-real-time ingestion and Airflow → Spark batch for daily backfills and compaction. Storage costs have grown 2.5× in the last year due to schema evolution, duplicate events, and keeping “just in case” copies in multiple layers.
Leadership asks you to create an 18-month capacity plan and propose guardrails so the platform can scale without breaking SLAs or budget. You must account for the realities of production pipelines: late-arriving data, reprocessing/backfills, compaction, multiple table layers, and compliance retention.
Current Architecture (Simplified)
| Layer | Location | Format | Notes |
|---|
| Raw landing | S3 s3://paywave-raw/ | JSON | Immutable, partitioned by event_date |
| Bronze | S3 s3://paywave-bronze/ | Parquet (Snappy) | Deduped + schema-normalized |
| Silver | S3 s3://paywave-silver/ | Parquet | Enriched, joins to reference data |
| Gold | Snowflake | Tables | Aggregates for dashboards + ML features |
| Orchestration | Airflow 2.x | DAGs | Backfills, compaction, dbt runs |
Scale Requirements & Data Characteristics
Ingestion volumes
- Auth events: avg 70K events/sec, peak 180K/sec (holiday spikes)
- Device telemetry: avg 25K events/sec, peak 60K/sec
- Disputes/chargebacks: 8M records/day, bursty, updates over time
- Reference data: ~30 GB/day changes
Storage and retention targets
- Raw retention: 180 days (audit requirement)
- Curated (bronze/silver): 2 years for analytics
- Gold in Snowflake: “indefinite” but must support tiering/archival strategy
- GDPR/CCPA deletion: delete a user’s data within 72 hours of request
Data quality realities
- Duplicates: ~0.4% of auth events (retries, at-least-once delivery)
- Late arrivals: 2% arrive > 2 hours late; 0.2% arrive > 3 days late (network delays)
- Schema evolution: new optional fields weekly; occasional breaking changes from upstream teams
Pipeline behaviors that affect capacity
- Streaming writes create many small files; daily compaction rewrites partitions.
- Backfills happen ~2×/month and can rewrite up to 14 days of data.
- Bronze and Silver layers currently store overlapping columns; some teams also keep “debug copies.”
Your Task
Design a capacity planning approach and produce a plan that Finance and Engineering can sign off on.
1) Quantitative capacity plan (must show your math)
Provide an 18-month forecast for storage growth and monthly cost drivers. At minimum, include:
- Daily ingest volume estimate (raw) using the provided event rates and payload sizes.
- Expected compression ratios when converting JSON → Parquet/Snappy (state assumptions).
- Overhead from:
- duplicates
- late-arriving data handling (watermarks + late partitions)
- compaction rewrites (temporary storage + rewrite amplification)
- backfills (rewrite amplification)
- multi-layer storage (raw/bronze/silver/gold)
- Retention policy impact (rolling windows) and how it changes steady-state storage.
- A buffer strategy (e.g., 20–30%) and how you justify it.
2) Storage architecture decisions
Explain what you would change (if anything) to reduce growth while keeping reliability:
- Partitioning strategy (e.g.,
event_date, event_hour, merchant_id, region) and trade-offs.
- File sizing targets (e.g., 128–512 MB Parquet) and compaction cadence.
- Whether you would introduce Iceberg/Delta/Hudi for table management, and why.
- How you would avoid storing the same data 3 times across layers (or justify why it’s necessary).
3) Operational guardrails
Propose concrete guardrails to prevent runaway storage:
- dataset-level quotas and alerts
- lifecycle policies (S3 Intelligent-Tiering / Glacier)
- schema governance (compatibility rules, deprecation)
- “reprocessing budget” (limits on backfills/rewrites)
4) Compliance and deletion
Describe how retention and deletion affect capacity planning:
- How do you enforce retention in S3 and Snowflake?
- How do you handle deletion in Parquet-based lakes (rewrites) without exploding costs?
- What audit evidence would you produce for regulators?
5) Monitoring and reporting
Define the metrics and dashboards you’d build so the plan stays accurate over time:
- storage by dataset/layer
- rewrite amplification
- late-arrival rates
- cost per TB ingested and cost per query (Snowflake)
Constraints
- Cloud: AWS + Snowflake (no migration off Snowflake this year).
- Team: 5 data engineers; moderate Spark expertise; limited on-call bandwidth.
- Budget: storage cost growth must be < 15% QoQ; avoid large one-time platform rewrites.
- Availability: ingestion must tolerate an AZ outage; no data loss.
Deliverables (what you’d present in the interview)
- A capacity model (assumptions + formulas) and an 18-month forecast.
- A recommended storage/retention/compaction strategy with trade-offs.
- A monitoring + alerting plan and failure modes.
Your answer will be evaluated on correctness of the capacity model, realism of assumptions, and whether the plan holds up under late data, backfills, and compliance requirements.