Context
Dun & Bradstreet ingests business records from partner feeds, public filings, web-sourced updates, and internal product refreshes that ultimately power D&B Hoovers, D&B Finance Analytics, and the D-U-N-S. Today, validation and enrichment run through a mix of nightly batch jobs and ad hoc replay workflows, creating long recovery times, inconsistent quality checks, and delayed downstream availability.
You are asked to design a new pipeline that can process and validate 500 million business records per day while supporting both large batch loads and near-real-time corrections for high-priority updates.
Scale Requirements
- Daily volume: 500M records/day
- Average record size: 4-8 KB JSON or Avro
- Peak ingest: 20K-35K records/sec during regional file drops
- Latency targets:
- Batch partner files queryable within 2 hours of arrival
- Priority corrections available within 15 minutes
- Storage: 2-4 TB raw/day, 90-day raw retention, 7-year curated retention
- Quality target: >= 99.95% successful validation coverage with full audit trail
Requirements
- Design an ingestion layer for mixed sources: SFTP file drops, API submissions, and event-based corrections.
- Validate schema, mandatory identifiers, address quality, duplicate records, and cross-source conflicts before publishing curated outputs.
- Support both batch ETL/ELT and stream processing in one architecture without duplicating business rules.
- Maintain idempotent reprocessing for late-arriving files, replayed events, and backfills.
- Publish validated records to an analytics store and an operational serving layer used by D&B products.
- Define orchestration, dependency management, SLA tracking, and lineage for every dataset.
- Include monitoring, alerting, and recovery plans for malformed feeds, partition skew, and downstream load failures.
Constraints
- Prefer AWS-native infrastructure already common in enterprise data platforms.
- Assume strict auditability, PII handling, and regulator/customer traceability requirements.
- Team size is limited: 5-7 engineers, so operational complexity matters.
- Budget should favor managed services where they materially reduce on-call burden.
Provide the target architecture, core data model choices, validation strategy, orchestration plan, and how you would measure correctness, freshness, and cost efficiency.