Dataford
Interview Guides
Upgrade
All questions/Pipelines/Ensure Data Accuracy in ETL

Ensure Data Accuracy in ETL

Easy
Pipelines
Asked at 5 companies5Data ModelingETLQuality
Also asked at
BlackRockFutures FirstAMichelin North AmericaPattern

Problem

Context

FinPulse, a personal finance platform, ingests transaction, account balance, and user profile data from partner banks and internal product databases. Its current pipeline uses nightly batch jobs to load CSV and JSON extracts into Snowflake, but analysts frequently find duplicate records, schema drift, and inconsistent aggregates between source systems and warehouse tables.

You are asked to design a reliable batch-first data pipeline that ensures data accuracy and integrity from ingestion through serving. The company wants stronger validation, reproducibility, and recovery without rebuilding its entire AWS-based stack.

Scale Requirements

  • Sources: 25 partner bank SFTP feeds, 8 internal PostgreSQL databases, 3 SaaS APIs
  • Volume: 120M transaction records/day, ~1.5 TB raw data/day
  • Batch cadence: Hourly for internal sources, daily for partner feeds
  • Latency target: Internal data queryable within 30 minutes; partner feeds within 2 hours of arrival
  • Retention: Raw data 1 year, curated warehouse tables 5 years
  • Data quality SLA: 99.9% of records loaded with validated schema and referential integrity checks

Requirements

  1. Design ingestion for mixed file-based and database/API sources into a centralized raw zone.
  2. Implement schema validation, null checks, uniqueness checks, and reconciliation against source row counts and control totals.
  3. Prevent duplicate loads during retries and support idempotent backfills for historical dates.
  4. Build transformation layers for standardized transaction, account, and customer tables in Snowflake.
  5. Orchestrate dependencies across ingestion, validation, transformation, and publishing steps.
  6. Define monitoring, alerting, and failure recovery for late files, bad schemas, partial loads, and downstream model failures.
  7. Preserve auditability: every published table must be traceable to source file, extraction timestamp, and pipeline run ID.

Constraints

  • Existing platform is AWS + Snowflake + Airflow; avoid introducing more than one major new platform.
  • Team size is 3 data engineers and 1 analytics engineer.
  • Budget increase is capped at $15K/month.
  • Must satisfy SOX-style auditability and support PII masking for non-production environments.

Problem

Context

FinPulse, a personal finance platform, ingests transaction, account balance, and user profile data from partner banks and internal product databases. Its current pipeline uses nightly batch jobs to load CSV and JSON extracts into Snowflake, but analysts frequently find duplicate records, schema drift, and inconsistent aggregates between source systems and warehouse tables.

You are asked to design a reliable batch-first data pipeline that ensures data accuracy and integrity from ingestion through serving. The company wants stronger validation, reproducibility, and recovery without rebuilding its entire AWS-based stack.

Scale Requirements

  • Sources: 25 partner bank SFTP feeds, 8 internal PostgreSQL databases, 3 SaaS APIs
  • Volume: 120M transaction records/day, ~1.5 TB raw data/day
  • Batch cadence: Hourly for internal sources, daily for partner feeds
  • Latency target: Internal data queryable within 30 minutes; partner feeds within 2 hours of arrival
  • Retention: Raw data 1 year, curated warehouse tables 5 years
  • Data quality SLA: 99.9% of records loaded with validated schema and referential integrity checks

Requirements

  1. Design ingestion for mixed file-based and database/API sources into a centralized raw zone.
  2. Implement schema validation, null checks, uniqueness checks, and reconciliation against source row counts and control totals.
  3. Prevent duplicate loads during retries and support idempotent backfills for historical dates.
  4. Build transformation layers for standardized transaction, account, and customer tables in Snowflake.
  5. Orchestrate dependencies across ingestion, validation, transformation, and publishing steps.
  6. Define monitoring, alerting, and failure recovery for late files, bad schemas, partial loads, and downstream model failures.
  7. Preserve auditability: every published table must be traceable to source file, extraction timestamp, and pipeline run ID.

Constraints

  • Existing platform is AWS + Snowflake + Airflow; avoid introducing more than one major new platform.
  • Team size is 3 data engineers and 1 analytics engineer.
  • Budget increase is capped at $15K/month.
  • Must satisfy SOX-style auditability and support PII masking for non-production environments.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
Voya FinancialDesign Data Integrity ETL PipelineMediumABuild Integrity-Safe Large-Scale ETLMediumRevDesign Data Quality Controls PipelineEasy
Next question