Choose Storage for Lakehouse ETL

Context

You’re interviewing with the Data Platform team at a global fintech that offers consumer credit and payment products in the US and EU. The company has 18M monthly active users and processes $4B/month in card spend. Fraud, risk, and finance teams rely on analytics that must be explainable and auditable (SOX) and, for EU users, compliant with GDPR (right-to-erasure, retention controls).

Today, the platform ingests data from (1) card authorization events, (2) ledger postings, (3) customer profile updates, and (4) third-party KYC providers. The current architecture is a patchwork: some teams land files in Azure Blob Storage, others use Azure Data Lake Storage Gen2 (ADLS Gen2), and several operational reporting workflows write directly into a traditional SQL database (Azure SQL / SQL Server). As a result, pipelines are inconsistent, costs are hard to predict, and analysts complain about stale or contradictory metrics.

Leadership wants a standardized pipeline pattern that clearly defines where data should live at each stage (raw → curated → serving), how it should be secured, and how it should be processed (ETL vs ELT). You are asked to propose a design and explicitly explain the differences and trade-offs between Blob Storage, ADLS Gen2, and traditional SQL databases in the context of production pipelines.

Scale Requirements

Ingestion volume:
- Card auth events: 25K events/sec peak, 8K/sec avg, ~1.5 KB JSON each
- Ledger postings: 120M rows/day, ~600 bytes each
- Customer updates: 30M/day, CDC style
Daily raw data: ~8–12 TB/day (compressed)
Retention:
- Raw: 13 months (audit)
- Curated: 24 months
- Aggregates: indefinite
Freshness SLAs:
- Fraud features: < 5 minutes end-to-end
- Finance close: T+0 intraday with hourly reconciliation
- Product analytics: < 60 minutes
Concurrency: 300 analysts + 50 automated jobs; peak 1,500 queries/hour on serving layer

Data Characteristics

Key datasets

Dataset	Example fields	Notes
`card_auth_events`	`event_id`, `user_id`, `merchant_id`, `amount`, `currency`, `event_ts`, `status`, `device_fingerprint`	Out-of-order events; duplicates possible; PII present
`ledger_entries`	`entry_id`, `account_id`, `debit`, `credit`, `posting_ts`, `effective_date`, `journal_id`	Must be immutable; reconciliation critical
`user_profile`	`user_id`, `email`, `country`, `kyc_status`, `updated_ts`	CDC; GDPR deletes

Quality issues you must handle

Late-arriving data: up to 72 hours late for some ledger corrections
Duplicates: retries from upstream services (same event_id)
Schema evolution: new JSON fields added weekly; occasional type changes
PII handling: encryption, access control, and deletion workflows

Your Task

Design an end-to-end pipeline and answer the following:

Storage layer decision: For each stage (raw landing, curated lakehouse tables, serving/BI), decide whether to use Blob Storage, ADLS Gen2, or a traditional SQL database, and justify your choice.
ETL vs ELT: Decide which transformations happen before the warehouse (ETL) vs inside the warehouse (ELT). Explain why.
Data modeling: Propose a minimal set of curated tables (e.g., bronze/silver/gold or raw/clean/serve) and how you partition/cluster them.
Security & governance: Explain how you would implement RBAC/ABAC, encryption, audit logs, and GDPR deletion across the chosen stores.
Orchestration: Provide an Airflow-based orchestration approach for batch + near-real-time workloads, including backfills.
Reliability: Describe how you ensure idempotency, handle late data, and reconcile ledger totals.

Constraints

Cloud: Azure for storage/compute; analytics warehouse is Snowflake on Azure (already contracted).
Team skills: strong SQL + dbt; moderate Spark; limited appetite for bespoke services.
Budget: incremental platform spend capped at $80K/month.
Compliance: SOX auditability; GDPR deletion within 72 hours; data access must be least-privilege.

What we’re evaluating

We’re not looking for a generic definition list. We want you to connect the differences between Blob Storage, ADLS Gen2, and SQL databases to concrete pipeline decisions: filesystem semantics, hierarchical namespace, ACLs, throughput patterns, schema enforcement, transactional guarantees, query patterns, and operational burden.

Context

Scale Requirements

Ingestion volume:
- Card auth events: 25K events/sec peak, 8K/sec avg, ~1.5 KB JSON each
- Ledger postings: 120M rows/day, ~600 bytes each
- Customer updates: 30M/day, CDC style
Daily raw data: ~8–12 TB/day (compressed)
Retention:
- Raw: 13 months (audit)
- Curated: 24 months
- Aggregates: indefinite
Freshness SLAs:
- Fraud features: < 5 minutes end-to-end
- Finance close: T+0 intraday with hourly reconciliation
- Product analytics: < 60 minutes
Concurrency: 300 analysts + 50 automated jobs; peak 1,500 queries/hour on serving layer

Data Characteristics

Key datasets

Dataset	Example fields	Notes
`card_auth_events`	`event_id`, `user_id`, `merchant_id`, `amount`, `currency`, `event_ts`, `status`, `device_fingerprint`	Out-of-order events; duplicates possible; PII present
`ledger_entries`	`entry_id`, `account_id`, `debit`, `credit`, `posting_ts`, `effective_date`, `journal_id`	Must be immutable; reconciliation critical
`user_profile`	`user_id`, `email`, `country`, `kyc_status`, `updated_ts`	CDC; GDPR deletes

Quality issues you must handle

Late-arriving data: up to 72 hours late for some ledger corrections
Duplicates: retries from upstream services (same event_id)
Schema evolution: new JSON fields added weekly; occasional type changes
PII handling: encryption, access control, and deletion workflows

Your Task

Design an end-to-end pipeline and answer the following:

Storage layer decision: For each stage (raw landing, curated lakehouse tables, serving/BI), decide whether to use Blob Storage, ADLS Gen2, or a traditional SQL database, and justify your choice.
ETL vs ELT: Decide which transformations happen before the warehouse (ETL) vs inside the warehouse (ELT). Explain why.
Data modeling: Propose a minimal set of curated tables (e.g., bronze/silver/gold or raw/clean/serve) and how you partition/cluster them.
Security & governance: Explain how you would implement RBAC/ABAC, encryption, audit logs, and GDPR deletion across the chosen stores.
Orchestration: Provide an Airflow-based orchestration approach for batch + near-real-time workloads, including backfills.
Reliability: Describe how you ensure idempotency, handle late data, and reconcile ledger totals.

Constraints

Cloud: Azure for storage/compute; analytics warehouse is Snowflake on Azure (already contracted).
Team skills: strong SQL + dbt; moderate Spark; limited appetite for bespoke services.
Budget: incremental platform spend capped at $80K/month.
Compliance: SOX auditability; GDPR deletion within 72 hours; data access must be least-privilege.

What we’re evaluating

Context

Scale Requirements

Ingestion volume:
- Card auth events: 25K events/sec peak, 8K/sec avg, ~1.5 KB JSON each
- Ledger postings: 120M rows/day, ~600 bytes each
- Customer updates: 30M/day, CDC style
Daily raw data: ~8–12 TB/day (compressed)
Retention:
- Raw: 13 months (audit)
- Curated: 24 months
- Aggregates: indefinite
Freshness SLAs:
- Fraud features: < 5 minutes end-to-end
- Finance close: T+0 intraday with hourly reconciliation
- Product analytics: < 60 minutes
Concurrency: 300 analysts + 50 automated jobs; peak 1,500 queries/hour on serving layer

Data Characteristics

Key datasets

Dataset	Example fields	Notes
`card_auth_events`	`event_id`, `user_id`, `merchant_id`, `amount`, `currency`, `event_ts`, `status`, `device_fingerprint`	Out-of-order events; duplicates possible; PII present
`ledger_entries`	`entry_id`, `account_id`, `debit`, `credit`, `posting_ts`, `effective_date`, `journal_id`	Must be immutable; reconciliation critical
`user_profile`	`user_id`, `email`, `country`, `kyc_status`, `updated_ts`	CDC; GDPR deletes

Quality issues you must handle

Late-arriving data: up to 72 hours late for some ledger corrections
Duplicates: retries from upstream services (same event_id)
Schema evolution: new JSON fields added weekly; occasional type changes
PII handling: encryption, access control, and deletion workflows

Your Task

Design an end-to-end pipeline and answer the following:

Storage layer decision: For each stage (raw landing, curated lakehouse tables, serving/BI), decide whether to use Blob Storage, ADLS Gen2, or a traditional SQL database, and justify your choice.
ETL vs ELT: Decide which transformations happen before the warehouse (ETL) vs inside the warehouse (ELT). Explain why.
Data modeling: Propose a minimal set of curated tables (e.g., bronze/silver/gold or raw/clean/serve) and how you partition/cluster them.
Security & governance: Explain how you would implement RBAC/ABAC, encryption, audit logs, and GDPR deletion across the chosen stores.
Orchestration: Provide an Airflow-based orchestration approach for batch + near-real-time workloads, including backfills.
Reliability: Describe how you ensure idempotency, handle late data, and reconcile ledger totals.

Constraints

Cloud: Azure for storage/compute; analytics warehouse is Snowflake on Azure (already contracted).
Team skills: strong SQL + dbt; moderate Spark; limited appetite for bespoke services.
Budget: incremental platform spend capped at $80K/month.
Compliance: SOX auditability; GDPR deletion within 72 hours; data access must be least-privilege.

What we’re evaluating

Context

Scale Requirements

Ingestion volume:
- Card auth events: 25K events/sec peak, 8K/sec avg, ~1.5 KB JSON each
- Ledger postings: 120M rows/day, ~600 bytes each
- Customer updates: 30M/day, CDC style
Daily raw data: ~8–12 TB/day (compressed)
Retention:
- Raw: 13 months (audit)
- Curated: 24 months
- Aggregates: indefinite
Freshness SLAs:
- Fraud features: < 5 minutes end-to-end
- Finance close: T+0 intraday with hourly reconciliation
- Product analytics: < 60 minutes
Concurrency: 300 analysts + 50 automated jobs; peak 1,500 queries/hour on serving layer

Data Characteristics

Key datasets

Dataset	Example fields	Notes
`card_auth_events`	`event_id`, `user_id`, `merchant_id`, `amount`, `currency`, `event_ts`, `status`, `device_fingerprint`	Out-of-order events; duplicates possible; PII present
`ledger_entries`	`entry_id`, `account_id`, `debit`, `credit`, `posting_ts`, `effective_date`, `journal_id`	Must be immutable; reconciliation critical
`user_profile`	`user_id`, `email`, `country`, `kyc_status`, `updated_ts`	CDC; GDPR deletes

Quality issues you must handle

Late-arriving data: up to 72 hours late for some ledger corrections
Duplicates: retries from upstream services (same event_id)
Schema evolution: new JSON fields added weekly; occasional type changes
PII handling: encryption, access control, and deletion workflows

Your Task

Design an end-to-end pipeline and answer the following:

Storage layer decision: For each stage (raw landing, curated lakehouse tables, serving/BI), decide whether to use Blob Storage, ADLS Gen2, or a traditional SQL database, and justify your choice.
ETL vs ELT: Decide which transformations happen before the warehouse (ETL) vs inside the warehouse (ELT). Explain why.
Data modeling: Propose a minimal set of curated tables (e.g., bronze/silver/gold or raw/clean/serve) and how you partition/cluster them.
Security & governance: Explain how you would implement RBAC/ABAC, encryption, audit logs, and GDPR deletion across the chosen stores.
Orchestration: Provide an Airflow-based orchestration approach for batch + near-real-time workloads, including backfills.
Reliability: Describe how you ensure idempotency, handle late data, and reconcile ledger totals.

Constraints

Cloud: Azure for storage/compute; analytics warehouse is Snowflake on Azure (already contracted).
Team skills: strong SQL + dbt; moderate Spark; limited appetite for bespoke services.
Budget: incremental platform spend capped at $80K/month.
Compliance: SOX auditability; GDPR deletion within 72 hours; data access must be least-privilege.

Interview Guides

Context

Scale Requirements

Data Characteristics

Key datasets

Quality issues you must handle

Your Task

Constraints

What we’re evaluating

Choose Storage for Lakehouse ETL

Context

Scale Requirements

Data Characteristics

Key datasets

Quality issues you must handle

Your Task

Constraints

What we’re evaluating

Your Answer

Choose Storage for Lakehouse ETL

Context

Scale Requirements

Data Characteristics

Key datasets

Quality issues you must handle

Your Task

Constraints

What we’re evaluating

Choose Storage for Lakehouse ETL

Context

Scale Requirements

Data Characteristics

Key datasets

Quality issues you must handle

Your Task

Constraints

What we’re evaluating

Your Answer