ADF Control Flow vs Data Flow Design

Context

You’re interviewing with the Data Platform team at MercuryMart, a large e-commerce marketplace operating in North America and the EU. MercuryMart processes orders, payments, and fulfillment events across web and mobile with 25M DAUs and frequent traffic spikes during promotions. The analytics and finance teams rely on a centralized warehouse in Azure Synapse for revenue reporting, fraud monitoring, and regulatory audits (SOX + GDPR). A recent incident caused a $1.8M revenue misstatement because a pipeline silently skipped a subset of late-arriving payment adjustments.

Today, MercuryMart uses Azure Data Factory (ADF) to orchestrate ingestion from multiple sources into ADLS Gen2 and then into Synapse. The current implementation mixes orchestration logic (branching, retries, notifications) with heavy transformations inside a single monolithic pipeline. This has led to brittle deployments, unclear ownership, and poor debuggability—engineers can’t quickly tell whether a failure is due to orchestration (control plane) or transformation/compute (data plane).

Your task is to propose a production-grade ADF design that clearly distinguishes control flow from data flow, and to explain how you would structure pipelines, datasets, linked services, and monitoring so that failures are isolated and recoverable.

Current Architecture (Problem Statement)

Layer	Current Implementation	Pain Point
Ingestion	ADF Copy Activity from SQL + REST into ADLS	Late-arriving updates not consistently captured
Transform	ADF Mapping Data Flows for cleansing + joins	Expensive runs, hard to test, unclear lineage
Load	Stored procedures into Synapse	Non-idempotent loads cause duplicates on retry
Orchestration	One pipeline with nested if/foreach + transformations	Failures are opaque; reruns are risky

Scale Requirements

Daily data volume: ~20 TB/day raw landing in ADLS Gen2
Peak ingestion: 120k rows/sec from OLTP (Azure SQL) during business hours
Files: ~60k files/day from partner SFTP drops (CSV/JSON)
Freshness SLA:
- Finance tables: < 30 minutes end-to-end
- Product analytics: < 2 hours
Late-arriving data: up to 72 hours late for payment adjustments and refunds
Retention: raw immutable landing 90 days, curated tables 2 years

Data Characteristics

You ingest three primary domains:

Orders (Azure SQL)
- Columns: order_id, user_id, order_ts, status, total_amount, updated_at
- CDC available via updated_at (no native SQL CDC enabled yet)
Payments (REST API)
- JSON events with payment_id, order_id, event_type, event_ts, amount, currency, ingested_at
- Duplicates possible (at-least-once delivery)
Fulfillment (SFTP partner drops)
- Daily/hourly files, inconsistent schema versions, occasional missing columns

Quality issues observed:

Duplicate payment events (same payment_id + event_type)
Schema drift in fulfillment files
Late updates to orders and refunds
Partial loads when downstream Synapse is under maintenance

Requirements

Functional

Explain and implement separation of concerns:
- What belongs in ADF control flow (pipeline activities) vs ADF data flow (Mapping Data Flow transformations)?
Build an ingestion + transform + load design that supports:
- Incremental loads for Orders (based on updated_at watermark)
- Deduplication for Payments (idempotent upserts)
- Schema drift handling for Fulfillment
Provide a reprocessing strategy for late-arriving data (up to 72 hours) without full reloads.
Ensure idempotent loads into Synapse so retries do not create duplicates.
Define a data quality gate (row counts, null checks, uniqueness) that can block promotion to curated tables.

Non-functional

Observability: end-to-end lineage and metrics to answer “what failed and where?” within 5 minutes.
Reliability: safe retries, bounded backoff, and clear DLQ/quarantine for bad files/events.
Cost control: Mapping Data Flows should not run continuously; use the smallest viable compute and avoid unnecessary shuffles.

Constraints

Must stay on Azure (ADF, ADLS Gen2, Synapse).
Team skillset: strong SQL + ADF experience; limited Spark expertise.
Budget: incremental spend capped at $35k/month.
Compliance:
- GDPR delete requests must propagate to curated tables within 72 hours.
- Auditability: keep immutable raw data and pipeline run logs for 1 year.

Interview Prompts

Conceptual: In ADF, describe the difference between control flow and data flow. Give concrete examples from this scenario.
Design: Propose how you would structure:
- A master orchestration pipeline (control flow)
- Domain-specific pipelines
- Mapping Data Flows (data flow)
- Parameterization (date ranges, watermarks, table names)
Late data: How do you design backfills and rolling reprocessing windows (e.g., last 3 days) without blowing up cost?
Idempotency: How do you guarantee that rerunning a failed pipeline run doesn’t duplicate rows in Synapse?
Monitoring: What metrics and alerts do you implement for both control flow and data flow?

Deliver your answer as a short architecture proposal with key ADF activities, data flow steps, and operational practices.

Context

Current Architecture (Problem Statement)

Layer	Current Implementation	Pain Point
Ingestion	ADF Copy Activity from SQL + REST into ADLS	Late-arriving updates not consistently captured
Transform	ADF Mapping Data Flows for cleansing + joins	Expensive runs, hard to test, unclear lineage
Load	Stored procedures into Synapse	Non-idempotent loads cause duplicates on retry
Orchestration	One pipeline with nested if/foreach + transformations	Failures are opaque; reruns are risky

Scale Requirements

Daily data volume: ~20 TB/day raw landing in ADLS Gen2
Peak ingestion: 120k rows/sec from OLTP (Azure SQL) during business hours
Files: ~60k files/day from partner SFTP drops (CSV/JSON)
Freshness SLA:
- Finance tables: < 30 minutes end-to-end
- Product analytics: < 2 hours
Late-arriving data: up to 72 hours late for payment adjustments and refunds
Retention: raw immutable landing 90 days, curated tables 2 years

Data Characteristics

You ingest three primary domains:

Orders (Azure SQL)
- Columns: order_id, user_id, order_ts, status, total_amount, updated_at
- CDC available via updated_at (no native SQL CDC enabled yet)
Payments (REST API)
- JSON events with payment_id, order_id, event_type, event_ts, amount, currency, ingested_at
- Duplicates possible (at-least-once delivery)
Fulfillment (SFTP partner drops)
- Daily/hourly files, inconsistent schema versions, occasional missing columns

Quality issues observed:

Duplicate payment events (same payment_id + event_type)
Schema drift in fulfillment files
Late updates to orders and refunds
Partial loads when downstream Synapse is under maintenance

Requirements

Functional

Explain and implement separation of concerns:
- What belongs in ADF control flow (pipeline activities) vs ADF data flow (Mapping Data Flow transformations)?
Build an ingestion + transform + load design that supports:
- Incremental loads for Orders (based on updated_at watermark)
- Deduplication for Payments (idempotent upserts)
- Schema drift handling for Fulfillment
Provide a reprocessing strategy for late-arriving data (up to 72 hours) without full reloads.
Ensure idempotent loads into Synapse so retries do not create duplicates.
Define a data quality gate (row counts, null checks, uniqueness) that can block promotion to curated tables.

Non-functional

Observability: end-to-end lineage and metrics to answer “what failed and where?” within 5 minutes.
Reliability: safe retries, bounded backoff, and clear DLQ/quarantine for bad files/events.
Cost control: Mapping Data Flows should not run continuously; use the smallest viable compute and avoid unnecessary shuffles.

Constraints

Must stay on Azure (ADF, ADLS Gen2, Synapse).
Team skillset: strong SQL + ADF experience; limited Spark expertise.
Budget: incremental spend capped at $35k/month.
Compliance:
- GDPR delete requests must propagate to curated tables within 72 hours.
- Auditability: keep immutable raw data and pipeline run logs for 1 year.

Interview Prompts

Conceptual: In ADF, describe the difference between control flow and data flow. Give concrete examples from this scenario.
Design: Propose how you would structure:
- A master orchestration pipeline (control flow)
- Domain-specific pipelines
- Mapping Data Flows (data flow)
- Parameterization (date ranges, watermarks, table names)
Late data: How do you design backfills and rolling reprocessing windows (e.g., last 3 days) without blowing up cost?
Idempotency: How do you guarantee that rerunning a failed pipeline run doesn’t duplicate rows in Synapse?
Monitoring: What metrics and alerts do you implement for both control flow and data flow?

Deliver your answer as a short architecture proposal with key ADF activities, data flow steps, and operational practices.

Context

Current Architecture (Problem Statement)

Layer	Current Implementation	Pain Point
Ingestion	ADF Copy Activity from SQL + REST into ADLS	Late-arriving updates not consistently captured
Transform	ADF Mapping Data Flows for cleansing + joins	Expensive runs, hard to test, unclear lineage
Load	Stored procedures into Synapse	Non-idempotent loads cause duplicates on retry
Orchestration	One pipeline with nested if/foreach + transformations	Failures are opaque; reruns are risky

Scale Requirements

Daily data volume: ~20 TB/day raw landing in ADLS Gen2
Peak ingestion: 120k rows/sec from OLTP (Azure SQL) during business hours
Files: ~60k files/day from partner SFTP drops (CSV/JSON)
Freshness SLA:
- Finance tables: < 30 minutes end-to-end
- Product analytics: < 2 hours
Late-arriving data: up to 72 hours late for payment adjustments and refunds
Retention: raw immutable landing 90 days, curated tables 2 years

Data Characteristics

You ingest three primary domains:

Orders (Azure SQL)
- Columns: order_id, user_id, order_ts, status, total_amount, updated_at
- CDC available via updated_at (no native SQL CDC enabled yet)
Payments (REST API)
- JSON events with payment_id, order_id, event_type, event_ts, amount, currency, ingested_at
- Duplicates possible (at-least-once delivery)
Fulfillment (SFTP partner drops)
- Daily/hourly files, inconsistent schema versions, occasional missing columns

Quality issues observed:

Duplicate payment events (same payment_id + event_type)
Schema drift in fulfillment files
Late updates to orders and refunds
Partial loads when downstream Synapse is under maintenance

Requirements

Functional

Explain and implement separation of concerns:
- What belongs in ADF control flow (pipeline activities) vs ADF data flow (Mapping Data Flow transformations)?
Build an ingestion + transform + load design that supports:
- Incremental loads for Orders (based on updated_at watermark)
- Deduplication for Payments (idempotent upserts)
- Schema drift handling for Fulfillment
Provide a reprocessing strategy for late-arriving data (up to 72 hours) without full reloads.
Ensure idempotent loads into Synapse so retries do not create duplicates.
Define a data quality gate (row counts, null checks, uniqueness) that can block promotion to curated tables.

Non-functional

Observability: end-to-end lineage and metrics to answer “what failed and where?” within 5 minutes.
Reliability: safe retries, bounded backoff, and clear DLQ/quarantine for bad files/events.
Cost control: Mapping Data Flows should not run continuously; use the smallest viable compute and avoid unnecessary shuffles.

Constraints

Must stay on Azure (ADF, ADLS Gen2, Synapse).
Team skillset: strong SQL + ADF experience; limited Spark expertise.
Budget: incremental spend capped at $35k/month.
Compliance:
- GDPR delete requests must propagate to curated tables within 72 hours.
- Auditability: keep immutable raw data and pipeline run logs for 1 year.

Interview Prompts

Conceptual: In ADF, describe the difference between control flow and data flow. Give concrete examples from this scenario.
Design: Propose how you would structure:
- A master orchestration pipeline (control flow)
- Domain-specific pipelines
- Mapping Data Flows (data flow)
- Parameterization (date ranges, watermarks, table names)
Late data: How do you design backfills and rolling reprocessing windows (e.g., last 3 days) without blowing up cost?
Idempotency: How do you guarantee that rerunning a failed pipeline run doesn’t duplicate rows in Synapse?
Monitoring: What metrics and alerts do you implement for both control flow and data flow?

Deliver your answer as a short architecture proposal with key ADF activities, data flow steps, and operational practices.

Context

Current Architecture (Problem Statement)

Layer	Current Implementation	Pain Point
Ingestion	ADF Copy Activity from SQL + REST into ADLS	Late-arriving updates not consistently captured
Transform	ADF Mapping Data Flows for cleansing + joins	Expensive runs, hard to test, unclear lineage
Load	Stored procedures into Synapse	Non-idempotent loads cause duplicates on retry
Orchestration	One pipeline with nested if/foreach + transformations	Failures are opaque; reruns are risky

Scale Requirements

Daily data volume: ~20 TB/day raw landing in ADLS Gen2
Peak ingestion: 120k rows/sec from OLTP (Azure SQL) during business hours
Files: ~60k files/day from partner SFTP drops (CSV/JSON)
Freshness SLA:
- Finance tables: < 30 minutes end-to-end
- Product analytics: < 2 hours
Late-arriving data: up to 72 hours late for payment adjustments and refunds
Retention: raw immutable landing 90 days, curated tables 2 years

Data Characteristics

You ingest three primary domains:

Orders (Azure SQL)
- Columns: order_id, user_id, order_ts, status, total_amount, updated_at
- CDC available via updated_at (no native SQL CDC enabled yet)
Payments (REST API)
- JSON events with payment_id, order_id, event_type, event_ts, amount, currency, ingested_at
- Duplicates possible (at-least-once delivery)
Fulfillment (SFTP partner drops)
- Daily/hourly files, inconsistent schema versions, occasional missing columns

Quality issues observed:

Duplicate payment events (same payment_id + event_type)
Schema drift in fulfillment files
Late updates to orders and refunds
Partial loads when downstream Synapse is under maintenance

Requirements

Functional

Explain and implement separation of concerns:
- What belongs in ADF control flow (pipeline activities) vs ADF data flow (Mapping Data Flow transformations)?
Build an ingestion + transform + load design that supports:
- Incremental loads for Orders (based on updated_at watermark)
- Deduplication for Payments (idempotent upserts)
- Schema drift handling for Fulfillment
Provide a reprocessing strategy for late-arriving data (up to 72 hours) without full reloads.
Ensure idempotent loads into Synapse so retries do not create duplicates.
Define a data quality gate (row counts, null checks, uniqueness) that can block promotion to curated tables.

Non-functional

Observability: end-to-end lineage and metrics to answer “what failed and where?” within 5 minutes.
Reliability: safe retries, bounded backoff, and clear DLQ/quarantine for bad files/events.
Cost control: Mapping Data Flows should not run continuously; use the smallest viable compute and avoid unnecessary shuffles.

Constraints

Must stay on Azure (ADF, ADLS Gen2, Synapse).
Team skillset: strong SQL + ADF experience; limited Spark expertise.
Budget: incremental spend capped at $35k/month.
Compliance:
- GDPR delete requests must propagate to curated tables within 72 hours.
- Auditability: keep immutable raw data and pipeline run logs for 1 year.

Interview Prompts

Conceptual: In ADF, describe the difference between control flow and data flow. Give concrete examples from this scenario.
Design: Propose how you would structure:
- A master orchestration pipeline (control flow)
- Domain-specific pipelines
- Mapping Data Flows (data flow)
- Parameterization (date ranges, watermarks, table names)
Late data: How do you design backfills and rolling reprocessing windows (e.g., last 3 days) without blowing up cost?
Idempotency: How do you guarantee that rerunning a failed pipeline run doesn’t duplicate rows in Synapse?
Monitoring: What metrics and alerts do you implement for both control flow and data flow?

Deliver your answer as a short architecture proposal with key ADF activities, data flow steps, and operational practices.

Interview Guides

Context

Current Architecture (Problem Statement)

Scale Requirements

Data Characteristics

Requirements

Functional

Non-functional

Constraints

Interview Prompts

ADF Control Flow vs Data Flow Design

Context

Current Architecture (Problem Statement)

Scale Requirements

Data Characteristics

Requirements

Functional

Non-functional

Constraints

Interview Prompts

Your Answer

ADF Control Flow vs Data Flow Design

Context

Current Architecture (Problem Statement)

Scale Requirements

Data Characteristics

Requirements

Functional

Non-functional

Constraints

Interview Prompts

ADF Control Flow vs Data Flow Design

Context

Current Architecture (Problem Statement)

Scale Requirements

Data Characteristics

Requirements

Functional

Non-functional

Constraints

Interview Prompts

Your Answer