Process 500M Business Records Daily

Context

Dun & Bradstreet ingests business records from partner feeds, public filings, web-sourced updates, and internal product refreshes that ultimately power D&B Hoovers, D&B Finance Analytics, and the D-U-N-S. Today, validation and enrichment run through a mix of nightly batch jobs and ad hoc replay workflows, creating long recovery times, inconsistent quality checks, and delayed downstream availability.

You are asked to design a new pipeline that can process and validate 500 million business records per day while supporting both large batch loads and near-real-time corrections for high-priority updates.

Scale Requirements

Daily volume: 500M records/day
Average record size: 4-8 KB JSON or Avro
Peak ingest: 20K-35K records/sec during regional file drops
Latency targets:
- Batch partner files queryable within 2 hours of arrival
- Priority corrections available within 15 minutes
Storage: 2-4 TB raw/day, 90-day raw retention, 7-year curated retention
Quality target: >= 99.95% successful validation coverage with full audit trail

Requirements

Design an ingestion layer for mixed sources: SFTP file drops, API submissions, and event-based corrections.
Validate schema, mandatory identifiers, address quality, duplicate records, and cross-source conflicts before publishing curated outputs.
Support both batch ETL/ELT and stream processing in one architecture without duplicating business rules.
Maintain idempotent reprocessing for late-arriving files, replayed events, and backfills.
Publish validated records to an analytics store and an operational serving layer used by D&B products.
Define orchestration, dependency management, SLA tracking, and lineage for every dataset.
Include monitoring, alerting, and recovery plans for malformed feeds, partition skew, and downstream load failures.

Constraints

Prefer AWS-native infrastructure already common in enterprise data platforms.
Assume strict auditability, PII handling, and regulator/customer traceability requirements.
Team size is limited: 5-7 engineers, so operational complexity matters.
Budget should favor managed services where they materially reduce on-call burden.

Provide the target architecture, core data model choices, validation strategy, orchestration plan, and how you would measure correctness, freshness, and cost efficiency.

Context

Scale Requirements

Daily volume: 500M records/day
Average record size: 4-8 KB JSON or Avro
Peak ingest: 20K-35K records/sec during regional file drops
Latency targets:
- Batch partner files queryable within 2 hours of arrival
- Priority corrections available within 15 minutes
Storage: 2-4 TB raw/day, 90-day raw retention, 7-year curated retention
Quality target: >= 99.95% successful validation coverage with full audit trail

Requirements

Design an ingestion layer for mixed sources: SFTP file drops, API submissions, and event-based corrections.
Validate schema, mandatory identifiers, address quality, duplicate records, and cross-source conflicts before publishing curated outputs.
Support both batch ETL/ELT and stream processing in one architecture without duplicating business rules.
Maintain idempotent reprocessing for late-arriving files, replayed events, and backfills.
Publish validated records to an analytics store and an operational serving layer used by D&B products.
Define orchestration, dependency management, SLA tracking, and lineage for every dataset.
Include monitoring, alerting, and recovery plans for malformed feeds, partition skew, and downstream load failures.

Constraints

Prefer AWS-native infrastructure already common in enterprise data platforms.
Assume strict auditability, PII handling, and regulator/customer traceability requirements.
Team size is limited: 5-7 engineers, so operational complexity matters.
Budget should favor managed services where they materially reduce on-call burden.

Provide the target architecture, core data model choices, validation strategy, orchestration plan, and how you would measure correctness, freshness, and cost efficiency.

Context

Scale Requirements

Daily volume: 500M records/day
Average record size: 4-8 KB JSON or Avro
Peak ingest: 20K-35K records/sec during regional file drops
Latency targets:
- Batch partner files queryable within 2 hours of arrival
- Priority corrections available within 15 minutes
Storage: 2-4 TB raw/day, 90-day raw retention, 7-year curated retention
Quality target: >= 99.95% successful validation coverage with full audit trail

Requirements

Design an ingestion layer for mixed sources: SFTP file drops, API submissions, and event-based corrections.
Validate schema, mandatory identifiers, address quality, duplicate records, and cross-source conflicts before publishing curated outputs.
Support both batch ETL/ELT and stream processing in one architecture without duplicating business rules.
Maintain idempotent reprocessing for late-arriving files, replayed events, and backfills.
Publish validated records to an analytics store and an operational serving layer used by D&B products.
Define orchestration, dependency management, SLA tracking, and lineage for every dataset.
Include monitoring, alerting, and recovery plans for malformed feeds, partition skew, and downstream load failures.

Constraints

Prefer AWS-native infrastructure already common in enterprise data platforms.
Assume strict auditability, PII handling, and regulator/customer traceability requirements.
Team size is limited: 5-7 engineers, so operational complexity matters.
Budget should favor managed services where they materially reduce on-call burden.

Provide the target architecture, core data model choices, validation strategy, orchestration plan, and how you would measure correctness, freshness, and cost efficiency.

Context

Scale Requirements

Daily volume: 500M records/day
Average record size: 4-8 KB JSON or Avro
Peak ingest: 20K-35K records/sec during regional file drops
Latency targets:
- Batch partner files queryable within 2 hours of arrival
- Priority corrections available within 15 minutes
Storage: 2-4 TB raw/day, 90-day raw retention, 7-year curated retention
Quality target: >= 99.95% successful validation coverage with full audit trail

Requirements

Design an ingestion layer for mixed sources: SFTP file drops, API submissions, and event-based corrections.
Validate schema, mandatory identifiers, address quality, duplicate records, and cross-source conflicts before publishing curated outputs.
Support both batch ETL/ELT and stream processing in one architecture without duplicating business rules.
Maintain idempotent reprocessing for late-arriving files, replayed events, and backfills.
Publish validated records to an analytics store and an operational serving layer used by D&B products.
Define orchestration, dependency management, SLA tracking, and lineage for every dataset.
Include monitoring, alerting, and recovery plans for malformed feeds, partition skew, and downstream load failures.

Constraints

Prefer AWS-native infrastructure already common in enterprise data platforms.
Assume strict auditability, PII handling, and regulator/customer traceability requirements.
Team size is limited: 5-7 engineers, so operational complexity matters.
Budget should favor managed services where they materially reduce on-call burden.

Provide the target architecture, core data model choices, validation strategy, orchestration plan, and how you would measure correctness, freshness, and cost efficiency.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Process 500M Business Records Daily

Context

Scale Requirements

Requirements

Constraints

Your Answer

Process 500M Business Records Daily

Context

Scale Requirements

Requirements

Constraints

Process 500M Business Records Daily

Context

Scale Requirements

Requirements

Constraints

Your Answer