Context
You’re interviewing with the Data Platform team at MercuryMart, a large e-commerce marketplace operating in North America and the EU. MercuryMart processes orders, payments, and fulfillment events across web and mobile with 25M DAUs and frequent traffic spikes during promotions. The analytics and finance teams rely on a centralized warehouse in Azure Synapse for revenue reporting, fraud monitoring, and regulatory audits (SOX + GDPR). A recent incident caused a $1.8M revenue misstatement because a pipeline silently skipped a subset of late-arriving payment adjustments.
Today, MercuryMart uses Azure Data Factory (ADF) to orchestrate ingestion from multiple sources into ADLS Gen2 and then into Synapse. The current implementation mixes orchestration logic (branching, retries, notifications) with heavy transformations inside a single monolithic pipeline. This has led to brittle deployments, unclear ownership, and poor debuggability—engineers can’t quickly tell whether a failure is due to orchestration (control plane) or transformation/compute (data plane).
Your task is to propose a production-grade ADF design that clearly distinguishes control flow from data flow, and to explain how you would structure pipelines, datasets, linked services, and monitoring so that failures are isolated and recoverable.
Current Architecture (Problem Statement)
| Layer | Current Implementation | Pain Point |
|---|
| Ingestion | ADF Copy Activity from SQL + REST into ADLS | Late-arriving updates not consistently captured |
| Transform | ADF Mapping Data Flows for cleansing + joins | Expensive runs, hard to test, unclear lineage |
| Load | Stored procedures into Synapse | Non-idempotent loads cause duplicates on retry |
| Orchestration | One pipeline with nested if/foreach + transformations | Failures are opaque; reruns are risky |
Scale Requirements
- Daily data volume: ~20 TB/day raw landing in ADLS Gen2
- Peak ingestion: 120k rows/sec from OLTP (Azure SQL) during business hours
- Files: ~60k files/day from partner SFTP drops (CSV/JSON)
- Freshness SLA:
- Finance tables: < 30 minutes end-to-end
- Product analytics: < 2 hours
- Late-arriving data: up to 72 hours late for payment adjustments and refunds
- Retention: raw immutable landing 90 days, curated tables 2 years
Data Characteristics
You ingest three primary domains:
-
Orders (Azure SQL)
- Columns:
order_id, user_id, order_ts, status, total_amount, updated_at
- CDC available via
updated_at (no native SQL CDC enabled yet)
-
Payments (REST API)
- JSON events with
payment_id, order_id, event_type, event_ts, amount, currency, ingested_at
- Duplicates possible (at-least-once delivery)
-
Fulfillment (SFTP partner drops)
- Daily/hourly files, inconsistent schema versions, occasional missing columns
Quality issues observed:
- Duplicate payment events (same
payment_id + event_type)
- Schema drift in fulfillment files
- Late updates to orders and refunds
- Partial loads when downstream Synapse is under maintenance
Requirements
Functional
- Explain and implement separation of concerns:
- What belongs in ADF control flow (pipeline activities) vs ADF data flow (Mapping Data Flow transformations)?
- Build an ingestion + transform + load design that supports:
- Incremental loads for Orders (based on
updated_at watermark)
- Deduplication for Payments (idempotent upserts)
- Schema drift handling for Fulfillment
- Provide a reprocessing strategy for late-arriving data (up to 72 hours) without full reloads.
- Ensure idempotent loads into Synapse so retries do not create duplicates.
- Define a data quality gate (row counts, null checks, uniqueness) that can block promotion to curated tables.
Non-functional
- Observability: end-to-end lineage and metrics to answer “what failed and where?” within 5 minutes.
- Reliability: safe retries, bounded backoff, and clear DLQ/quarantine for bad files/events.
- Cost control: Mapping Data Flows should not run continuously; use the smallest viable compute and avoid unnecessary shuffles.
Constraints
- Must stay on Azure (ADF, ADLS Gen2, Synapse).
- Team skillset: strong SQL + ADF experience; limited Spark expertise.
- Budget: incremental spend capped at $35k/month.
- Compliance:
- GDPR delete requests must propagate to curated tables within 72 hours.
- Auditability: keep immutable raw data and pipeline run logs for 1 year.
Interview Prompts
- Conceptual: In ADF, describe the difference between control flow and data flow. Give concrete examples from this scenario.
- Design: Propose how you would structure:
- A master orchestration pipeline (control flow)
- Domain-specific pipelines
- Mapping Data Flows (data flow)
- Parameterization (date ranges, watermarks, table names)
- Late data: How do you design backfills and rolling reprocessing windows (e.g., last 3 days) without blowing up cost?
- Idempotency: How do you guarantee that rerunning a failed pipeline run doesn’t duplicate rows in Synapse?
- Monitoring: What metrics and alerts do you implement for both control flow and data flow?
Deliver your answer as a short architecture proposal with key ADF activities, data flow steps, and operational practices.