Capacity Plan Lakehouse Storage Growth

Context

You’re joining the Data Platform team at PayWave, a fintech that processes card payments and fraud signals for 25M monthly active cardholders across the US and EU. PayWave’s analytics and risk models depend on a lakehouse-style platform: raw and curated data lands in Amazon S3, and business-facing datasets are served from Snowflake. The company is under pressure from Finance to stop “surprise” cloud spend spikes, and from Risk/Compliance to prove that retention and deletion policies are consistently enforced.

Today, PayWave ingests data from (1) payment authorization events from microservices, (2) device telemetry from mobile SDKs, (3) chargeback/dispute updates from external networks, and (4) reference data (merchant catalog, BIN tables). The platform currently runs a mix of Kafka → Spark Structured Streaming for near-real-time ingestion and Airflow → Spark batch for daily backfills and compaction. Storage costs have grown 2.5× in the last year due to schema evolution, duplicate events, and keeping “just in case” copies in multiple layers.

Leadership asks you to create an 18-month capacity plan and propose guardrails so the platform can scale without breaking SLAs or budget. You must account for the realities of production pipelines: late-arriving data, reprocessing/backfills, compaction, multiple table layers, and compliance retention.

Current Architecture (Simplified)

Layer	Location	Format	Notes
Raw landing	S3 `s3://paywave-raw/`	JSON	Immutable, partitioned by `event_date`
Bronze	S3 `s3://paywave-bronze/`	Parquet (Snappy)	Deduped + schema-normalized
Silver	S3 `s3://paywave-silver/`	Parquet	Enriched, joins to reference data
Gold	Snowflake	Tables	Aggregates for dashboards + ML features
Orchestration	Airflow 2.x	DAGs	Backfills, compaction, dbt runs

Scale Requirements & Data Characteristics

Ingestion volumes

Auth events: avg 70K events/sec, peak 180K/sec (holiday spikes)
- Avg payload 1.2 KB JSON
Device telemetry: avg 25K events/sec, peak 60K/sec
- Avg payload 0.8 KB JSON
Disputes/chargebacks: 8M records/day, bursty, updates over time
Reference data: ~30 GB/day changes

Storage and retention targets

Raw retention: 180 days (audit requirement)
Curated (bronze/silver): 2 years for analytics
Gold in Snowflake: “indefinite” but must support tiering/archival strategy
GDPR/CCPA deletion: delete a user’s data within 72 hours of request

Data quality realities

Duplicates: ~0.4% of auth events (retries, at-least-once delivery)
Late arrivals: 2% arrive > 2 hours late; 0.2% arrive > 3 days late (network delays)
Schema evolution: new optional fields weekly; occasional breaking changes from upstream teams

Pipeline behaviors that affect capacity

Streaming writes create many small files; daily compaction rewrites partitions.
Backfills happen ~2×/month and can rewrite up to 14 days of data.
Bronze and Silver layers currently store overlapping columns; some teams also keep “debug copies.”

Your Task

Design a capacity planning approach and produce a plan that Finance and Engineering can sign off on.

1) Quantitative capacity plan (must show your math)

Provide an 18-month forecast for storage growth and monthly cost drivers. At minimum, include:

Daily ingest volume estimate (raw) using the provided event rates and payload sizes.
Expected compression ratios when converting JSON → Parquet/Snappy (state assumptions).
Overhead from:
- duplicates
- late-arriving data handling (watermarks + late partitions)
- compaction rewrites (temporary storage + rewrite amplification)
- backfills (rewrite amplification)
- multi-layer storage (raw/bronze/silver/gold)
Retention policy impact (rolling windows) and how it changes steady-state storage.
A buffer strategy (e.g., 20–30%) and how you justify it.

2) Storage architecture decisions

Explain what you would change (if anything) to reduce growth while keeping reliability:

Partitioning strategy (e.g., event_date, event_hour, merchant_id, region) and trade-offs.
File sizing targets (e.g., 128–512 MB Parquet) and compaction cadence.
Whether you would introduce Iceberg/Delta/Hudi for table management, and why.
How you would avoid storing the same data 3 times across layers (or justify why it’s necessary).

3) Operational guardrails

Propose concrete guardrails to prevent runaway storage:

dataset-level quotas and alerts
lifecycle policies (S3 Intelligent-Tiering / Glacier)
schema governance (compatibility rules, deprecation)
“reprocessing budget” (limits on backfills/rewrites)

4) Compliance and deletion

Describe how retention and deletion affect capacity planning:

How do you enforce retention in S3 and Snowflake?
How do you handle deletion in Parquet-based lakes (rewrites) without exploding costs?
What audit evidence would you produce for regulators?

5) Monitoring and reporting

Define the metrics and dashboards you’d build so the plan stays accurate over time:

storage by dataset/layer
rewrite amplification
late-arrival rates
cost per TB ingested and cost per query (Snowflake)

Constraints

Cloud: AWS + Snowflake (no migration off Snowflake this year).
Team: 5 data engineers; moderate Spark expertise; limited on-call bandwidth.
Budget: storage cost growth must be < 15% QoQ; avoid large one-time platform rewrites.
Availability: ingestion must tolerate an AZ outage; no data loss.

Deliverables (what you’d present in the interview)

A capacity model (assumptions + formulas) and an 18-month forecast.
A recommended storage/retention/compaction strategy with trade-offs.
A monitoring + alerting plan and failure modes.

Your answer will be evaluated on correctness of the capacity model, realism of assumptions, and whether the plan holds up under late data, backfills, and compliance requirements.

Context

Current Architecture (Simplified)

Layer	Location	Format	Notes
Raw landing	S3 `s3://paywave-raw/`	JSON	Immutable, partitioned by `event_date`
Bronze	S3 `s3://paywave-bronze/`	Parquet (Snappy)	Deduped + schema-normalized
Silver	S3 `s3://paywave-silver/`	Parquet	Enriched, joins to reference data
Gold	Snowflake	Tables	Aggregates for dashboards + ML features
Orchestration	Airflow 2.x	DAGs	Backfills, compaction, dbt runs

Scale Requirements & Data Characteristics

Ingestion volumes

Auth events: avg 70K events/sec, peak 180K/sec (holiday spikes)
- Avg payload 1.2 KB JSON
Device telemetry: avg 25K events/sec, peak 60K/sec
- Avg payload 0.8 KB JSON
Disputes/chargebacks: 8M records/day, bursty, updates over time
Reference data: ~30 GB/day changes

Storage and retention targets

Raw retention: 180 days (audit requirement)
Curated (bronze/silver): 2 years for analytics
Gold in Snowflake: “indefinite” but must support tiering/archival strategy
GDPR/CCPA deletion: delete a user’s data within 72 hours of request

Data quality realities

Duplicates: ~0.4% of auth events (retries, at-least-once delivery)
Late arrivals: 2% arrive > 2 hours late; 0.2% arrive > 3 days late (network delays)
Schema evolution: new optional fields weekly; occasional breaking changes from upstream teams

Pipeline behaviors that affect capacity

Streaming writes create many small files; daily compaction rewrites partitions.
Backfills happen ~2×/month and can rewrite up to 14 days of data.
Bronze and Silver layers currently store overlapping columns; some teams also keep “debug copies.”

Your Task

Design a capacity planning approach and produce a plan that Finance and Engineering can sign off on.

1) Quantitative capacity plan (must show your math)

Provide an 18-month forecast for storage growth and monthly cost drivers. At minimum, include:

Daily ingest volume estimate (raw) using the provided event rates and payload sizes.
Expected compression ratios when converting JSON → Parquet/Snappy (state assumptions).
Overhead from:
- duplicates
- late-arriving data handling (watermarks + late partitions)
- compaction rewrites (temporary storage + rewrite amplification)
- backfills (rewrite amplification)
- multi-layer storage (raw/bronze/silver/gold)
Retention policy impact (rolling windows) and how it changes steady-state storage.
A buffer strategy (e.g., 20–30%) and how you justify it.

2) Storage architecture decisions

Explain what you would change (if anything) to reduce growth while keeping reliability:

Partitioning strategy (e.g., event_date, event_hour, merchant_id, region) and trade-offs.
File sizing targets (e.g., 128–512 MB Parquet) and compaction cadence.
Whether you would introduce Iceberg/Delta/Hudi for table management, and why.
How you would avoid storing the same data 3 times across layers (or justify why it’s necessary).

3) Operational guardrails

Propose concrete guardrails to prevent runaway storage:

dataset-level quotas and alerts
lifecycle policies (S3 Intelligent-Tiering / Glacier)
schema governance (compatibility rules, deprecation)
“reprocessing budget” (limits on backfills/rewrites)

4) Compliance and deletion

Describe how retention and deletion affect capacity planning:

How do you enforce retention in S3 and Snowflake?
How do you handle deletion in Parquet-based lakes (rewrites) without exploding costs?
What audit evidence would you produce for regulators?

5) Monitoring and reporting

Define the metrics and dashboards you’d build so the plan stays accurate over time:

storage by dataset/layer
rewrite amplification
late-arrival rates
cost per TB ingested and cost per query (Snowflake)

Constraints

Cloud: AWS + Snowflake (no migration off Snowflake this year).
Team: 5 data engineers; moderate Spark expertise; limited on-call bandwidth.
Budget: storage cost growth must be < 15% QoQ; avoid large one-time platform rewrites.
Availability: ingestion must tolerate an AZ outage; no data loss.

Deliverables (what you’d present in the interview)

A capacity model (assumptions + formulas) and an 18-month forecast.
A recommended storage/retention/compaction strategy with trade-offs.
A monitoring + alerting plan and failure modes.

Your answer will be evaluated on correctness of the capacity model, realism of assumptions, and whether the plan holds up under late data, backfills, and compliance requirements.

Context

Current Architecture (Simplified)

Layer	Location	Format	Notes
Raw landing	S3 `s3://paywave-raw/`	JSON	Immutable, partitioned by `event_date`
Bronze	S3 `s3://paywave-bronze/`	Parquet (Snappy)	Deduped + schema-normalized
Silver	S3 `s3://paywave-silver/`	Parquet	Enriched, joins to reference data
Gold	Snowflake	Tables	Aggregates for dashboards + ML features
Orchestration	Airflow 2.x	DAGs	Backfills, compaction, dbt runs

Scale Requirements & Data Characteristics

Ingestion volumes

Auth events: avg 70K events/sec, peak 180K/sec (holiday spikes)
- Avg payload 1.2 KB JSON
Device telemetry: avg 25K events/sec, peak 60K/sec
- Avg payload 0.8 KB JSON
Disputes/chargebacks: 8M records/day, bursty, updates over time
Reference data: ~30 GB/day changes

Storage and retention targets

Raw retention: 180 days (audit requirement)
Curated (bronze/silver): 2 years for analytics
Gold in Snowflake: “indefinite” but must support tiering/archival strategy
GDPR/CCPA deletion: delete a user’s data within 72 hours of request

Data quality realities

Duplicates: ~0.4% of auth events (retries, at-least-once delivery)
Late arrivals: 2% arrive > 2 hours late; 0.2% arrive > 3 days late (network delays)
Schema evolution: new optional fields weekly; occasional breaking changes from upstream teams

Pipeline behaviors that affect capacity

Streaming writes create many small files; daily compaction rewrites partitions.
Backfills happen ~2×/month and can rewrite up to 14 days of data.
Bronze and Silver layers currently store overlapping columns; some teams also keep “debug copies.”

Your Task

Design a capacity planning approach and produce a plan that Finance and Engineering can sign off on.

1) Quantitative capacity plan (must show your math)

Provide an 18-month forecast for storage growth and monthly cost drivers. At minimum, include:

Daily ingest volume estimate (raw) using the provided event rates and payload sizes.
Expected compression ratios when converting JSON → Parquet/Snappy (state assumptions).
Overhead from:
- duplicates
- late-arriving data handling (watermarks + late partitions)
- compaction rewrites (temporary storage + rewrite amplification)
- backfills (rewrite amplification)
- multi-layer storage (raw/bronze/silver/gold)
Retention policy impact (rolling windows) and how it changes steady-state storage.
A buffer strategy (e.g., 20–30%) and how you justify it.

2) Storage architecture decisions

Explain what you would change (if anything) to reduce growth while keeping reliability:

Partitioning strategy (e.g., event_date, event_hour, merchant_id, region) and trade-offs.
File sizing targets (e.g., 128–512 MB Parquet) and compaction cadence.
Whether you would introduce Iceberg/Delta/Hudi for table management, and why.
How you would avoid storing the same data 3 times across layers (or justify why it’s necessary).

3) Operational guardrails

Propose concrete guardrails to prevent runaway storage:

dataset-level quotas and alerts
lifecycle policies (S3 Intelligent-Tiering / Glacier)
schema governance (compatibility rules, deprecation)
“reprocessing budget” (limits on backfills/rewrites)

4) Compliance and deletion

Describe how retention and deletion affect capacity planning:

How do you enforce retention in S3 and Snowflake?
How do you handle deletion in Parquet-based lakes (rewrites) without exploding costs?
What audit evidence would you produce for regulators?

5) Monitoring and reporting

Define the metrics and dashboards you’d build so the plan stays accurate over time:

storage by dataset/layer
rewrite amplification
late-arrival rates
cost per TB ingested and cost per query (Snowflake)

Constraints

Cloud: AWS + Snowflake (no migration off Snowflake this year).
Team: 5 data engineers; moderate Spark expertise; limited on-call bandwidth.
Budget: storage cost growth must be < 15% QoQ; avoid large one-time platform rewrites.
Availability: ingestion must tolerate an AZ outage; no data loss.

Deliverables (what you’d present in the interview)

A capacity model (assumptions + formulas) and an 18-month forecast.
A recommended storage/retention/compaction strategy with trade-offs.
A monitoring + alerting plan and failure modes.

Your answer will be evaluated on correctness of the capacity model, realism of assumptions, and whether the plan holds up under late data, backfills, and compliance requirements.

Context

Current Architecture (Simplified)

Layer	Location	Format	Notes
Raw landing	S3 `s3://paywave-raw/`	JSON	Immutable, partitioned by `event_date`
Bronze	S3 `s3://paywave-bronze/`	Parquet (Snappy)	Deduped + schema-normalized
Silver	S3 `s3://paywave-silver/`	Parquet	Enriched, joins to reference data
Gold	Snowflake	Tables	Aggregates for dashboards + ML features
Orchestration	Airflow 2.x	DAGs	Backfills, compaction, dbt runs

Scale Requirements & Data Characteristics

Ingestion volumes

Auth events: avg 70K events/sec, peak 180K/sec (holiday spikes)
- Avg payload 1.2 KB JSON
Device telemetry: avg 25K events/sec, peak 60K/sec
- Avg payload 0.8 KB JSON
Disputes/chargebacks: 8M records/day, bursty, updates over time
Reference data: ~30 GB/day changes

Storage and retention targets

Raw retention: 180 days (audit requirement)
Curated (bronze/silver): 2 years for analytics
Gold in Snowflake: “indefinite” but must support tiering/archival strategy
GDPR/CCPA deletion: delete a user’s data within 72 hours of request

Data quality realities

Duplicates: ~0.4% of auth events (retries, at-least-once delivery)
Late arrivals: 2% arrive > 2 hours late; 0.2% arrive > 3 days late (network delays)
Schema evolution: new optional fields weekly; occasional breaking changes from upstream teams

Pipeline behaviors that affect capacity

Streaming writes create many small files; daily compaction rewrites partitions.
Backfills happen ~2×/month and can rewrite up to 14 days of data.
Bronze and Silver layers currently store overlapping columns; some teams also keep “debug copies.”

Your Task

Design a capacity planning approach and produce a plan that Finance and Engineering can sign off on.

1) Quantitative capacity plan (must show your math)

Provide an 18-month forecast for storage growth and monthly cost drivers. At minimum, include:

Daily ingest volume estimate (raw) using the provided event rates and payload sizes.
Expected compression ratios when converting JSON → Parquet/Snappy (state assumptions).
Overhead from:
- duplicates
- late-arriving data handling (watermarks + late partitions)
- compaction rewrites (temporary storage + rewrite amplification)
- backfills (rewrite amplification)
- multi-layer storage (raw/bronze/silver/gold)
Retention policy impact (rolling windows) and how it changes steady-state storage.
A buffer strategy (e.g., 20–30%) and how you justify it.

2) Storage architecture decisions

Explain what you would change (if anything) to reduce growth while keeping reliability:

Partitioning strategy (e.g., event_date, event_hour, merchant_id, region) and trade-offs.
File sizing targets (e.g., 128–512 MB Parquet) and compaction cadence.
Whether you would introduce Iceberg/Delta/Hudi for table management, and why.
How you would avoid storing the same data 3 times across layers (or justify why it’s necessary).

3) Operational guardrails

Propose concrete guardrails to prevent runaway storage:

dataset-level quotas and alerts
lifecycle policies (S3 Intelligent-Tiering / Glacier)
schema governance (compatibility rules, deprecation)
“reprocessing budget” (limits on backfills/rewrites)

4) Compliance and deletion

Describe how retention and deletion affect capacity planning:

How do you enforce retention in S3 and Snowflake?
How do you handle deletion in Parquet-based lakes (rewrites) without exploding costs?
What audit evidence would you produce for regulators?

5) Monitoring and reporting

Define the metrics and dashboards you’d build so the plan stays accurate over time:

storage by dataset/layer
rewrite amplification
late-arrival rates
cost per TB ingested and cost per query (Snowflake)

Constraints

Cloud: AWS + Snowflake (no migration off Snowflake this year).
Team: 5 data engineers; moderate Spark expertise; limited on-call bandwidth.
Budget: storage cost growth must be < 15% QoQ; avoid large one-time platform rewrites.
Availability: ingestion must tolerate an AZ outage; no data loss.

Deliverables (what you’d present in the interview)

A capacity model (assumptions + formulas) and an 18-month forecast.
A recommended storage/retention/compaction strategy with trade-offs.
A monitoring + alerting plan and failure modes.

Your answer will be evaluated on correctness of the capacity model, realism of assumptions, and whether the plan holds up under late data, backfills, and compliance requirements.

Interview Guides

Context

Current Architecture (Simplified)

Scale Requirements & Data Characteristics

Ingestion volumes

Storage and retention targets

Data quality realities

Pipeline behaviors that affect capacity

Your Task

1) Quantitative capacity plan (must show your math)

2) Storage architecture decisions

3) Operational guardrails

4) Compliance and deletion

5) Monitoring and reporting

Constraints

Deliverables (what you’d present in the interview)

Capacity Plan Lakehouse Storage Growth

Context

Current Architecture (Simplified)

Scale Requirements & Data Characteristics

Ingestion volumes

Storage and retention targets

Data quality realities

Pipeline behaviors that affect capacity

Your Task

1) Quantitative capacity plan (must show your math)

2) Storage architecture decisions

3) Operational guardrails

4) Compliance and deletion

5) Monitoring and reporting

Constraints

Deliverables (what you’d present in the interview)

Your Answer

Capacity Plan Lakehouse Storage Growth

Context

Current Architecture (Simplified)

Scale Requirements & Data Characteristics

Ingestion volumes

Storage and retention targets

Data quality realities

Pipeline behaviors that affect capacity

Your Task

1) Quantitative capacity plan (must show your math)

2) Storage architecture decisions

3) Operational guardrails

4) Compliance and deletion

5) Monitoring and reporting

Constraints

Deliverables (what you’d present in the interview)

Capacity Plan Lakehouse Storage Growth

Context

Current Architecture (Simplified)

Scale Requirements & Data Characteristics

Ingestion volumes

Storage and retention targets

Data quality realities

Pipeline behaviors that affect capacity

Your Task

1) Quantitative capacity plan (must show your math)

2) Storage architecture decisions

3) Operational guardrails

4) Compliance and deletion

5) Monitoring and reporting

Constraints

Deliverables (what you’d present in the interview)

Your Answer