You’re interviewing with the Data Platform team at a global fintech that offers consumer credit and payment products in the US and EU. The company has 18M monthly active users and processes $4B/month in card spend. Fraud, risk, and finance teams rely on analytics that must be explainable and auditable (SOX) and, for EU users, compliant with GDPR (right-to-erasure, retention controls).
Today, the platform ingests data from (1) card authorization events, (2) ledger postings, (3) customer profile updates, and (4) third-party KYC providers. The current architecture is a patchwork: some teams land files in Azure Blob Storage, others use Azure Data Lake Storage Gen2 (ADLS Gen2), and several operational reporting workflows write directly into a traditional SQL database (Azure SQL / SQL Server). As a result, pipelines are inconsistent, costs are hard to predict, and analysts complain about stale or contradictory metrics.
Leadership wants a standardized pipeline pattern that clearly defines where data should live at each stage (raw → curated → serving), how it should be secured, and how it should be processed (ETL vs ELT). You are asked to propose a design and explicitly explain the differences and trade-offs between Blob Storage, ADLS Gen2, and traditional SQL databases in the context of production pipelines.
| Dataset | Example fields | Notes |
|---|---|---|
card_auth_events | event_id, user_id, merchant_id, amount, currency, event_ts, status, device_fingerprint | Out-of-order events; duplicates possible; PII present |
ledger_entries | entry_id, account_id, debit, credit, posting_ts, effective_date, journal_id | Must be immutable; reconciliation critical |
user_profile | user_id, email, country, kyc_status, updated_ts | CDC; GDPR deletes |
event_id)Design an end-to-end pipeline and answer the following:
We’re not looking for a generic definition list. We want you to connect the differences between Blob Storage, ADLS Gen2, and SQL databases to concrete pipeline decisions: filesystem semantics, hierarchical namespace, ACLs, throughput patterns, schema enforcement, transactional guarantees, query patterns, and operational burden.