Context
You’re interviewing with the Data Platform team at MediClaims, a US healthcare claims processor that handles eligibility checks and claim adjudication for regional insurers. The company processes ~120M claim line items/day and must meet HIPAA requirements (auditability, least privilege, encryption, and controlled access). Analytics teams use downstream lakehouse data to detect provider fraud, monitor denial rates, and report operational KPIs to insurer partners.
Today, core transactional data lives in an on‑prem SQL Server 2019 AlwaysOn cluster in two data centers. Analysts currently get data via a brittle nightly export: a SQL Agent job dumps full tables to a file share, then an ops script copies files to Azure. This creates 8–24 hour freshness lag, frequent partial loads, and no reliable way to handle late-arriving updates (e.g., claims that are re-priced days later). Leadership wants a professional, incremental pipeline that lands data in Azure Blob Storage (ADLS Gen2) as a durable raw layer, enabling downstream transformations (e.g., Spark/dbt/Synapse/Snowflake—pick later).
Data Sources & Characteristics
You need to incrementally extract from SQL Server the following tables (simplified):
| Table | Approx Rows | Daily Change | Notes |
|---|
dbo.Claims | 450M | 8M inserts, 1M updates | Updates can occur up to 30 days late |
dbo.ClaimLines | 2.8B | 120M inserts, 10M updates | Large, wide table; frequent schema additions |
dbo.Providers | 12M | 50K updates | Slowly changing attributes |
Assume:
- SQL Server is OLTP and must not be overloaded; you have a 2-hour nightly window for heavier reads, but incremental loads should run every 15 minutes.
- Not all tables have reliable
updated_at. Some have rowversion, others only have an identity PK.
- Deletes occur (e.g., compliance removals), but are rare (<0.01% of rows/day). You must still represent them downstream.
- Network between on‑prem and Azure is via site-to-site VPN today; ExpressRoute is planned but not available for 6 months.
Scale Requirements
- Freshness / latency: data should be queryable in Blob within <30 minutes of commit for high-value tables (
Claims, ClaimLines).
- Throughput: peak incremental extraction of ~5–10K rows/sec sustained without impacting OLTP p95 latency by more than 5%.
- Storage: raw retention in Blob for 7 years (regulatory/audit), with immutable history for key tables.
- Correctness: exactly-once effects in the lake (idempotent writes; no double counting on retries).
Your Task
Design an end-to-end incremental pipeline from on‑prem SQL Server → Azure Blob Storage.
Functional Requirements
- Incremental extraction for inserts and updates, with support for late-arriving updates (up to 30 days).
- Delete handling: represent deletes as tombstones or change events so downstream consumers can reconcile.
- Raw landing format: choose file format and partitioning strategy suitable for large-scale analytics.
- Schema evolution: handle added columns and type widening with minimal operational burden.
- Backfills: support reprocessing a date range (e.g., last 14 days) without corrupting data.
Non-Functional Requirements
- Idempotency: retries must not create duplicates or partial partitions.
- Observability: metrics, logs, lineage, and data quality checks with actionable alerts.
- Security & compliance: encryption in transit/at rest, secrets management, least privilege, audit logs.
- Operational simplicity: a small team (4 DEs) should be able to run it; on-call load must be reasonable.
Constraints
- You may use Azure-native services (ADF, Azure Databricks, Event Hubs, Functions, Key Vault, Monitor) and common OSS tools (Airflow, dbt). Prefer managed where it reduces ops.
- You cannot install heavy agents on the SQL Server hosts without a security review; lightweight configuration changes (enabling CDC) are possible but require justification.
- Budget target: <$35K/month incremental cloud spend.
Discussion Prompts (what the interviewer will probe)
- How do you implement incremental capture on SQL Server: CDC vs Change Tracking vs
rowversion watermark vs trigger-based?
- How do you guarantee idempotent writes to Blob when jobs retry mid-batch?
- How do you manage late updates without scanning huge tables every 15 minutes?
- How do you model raw data: append-only change log vs upserted “latest snapshot” files?
- What’s your plan for schema drift and downstream compatibility?