Context
Northstar Retail, a multi-brand commerce company, already ingests orders, inventory, and customer data from its core platform into Snowflake using Airflow, S3, and dbt. The company is adding a new third-party marketplace partner that exposes order and refund data via a REST API, and analytics teams need this source integrated into the existing pipeline without breaking downstream models.
The new source is semi-structured JSON, has occasional schema drift, and delivers updates with up to 24 hours of delay. Your task is to design how you would onboard this source into the current batch-oriented data platform while preserving data quality, lineage, and recoverability.
Scale Requirements
- API volume: 8M order events/day, 1M refund events/day
- Peak ingestion window: 15K records/min during hourly syncs
- Payload size: 3-8 KB JSON per record
- Latency target: New/updated records available in Snowflake within 30 minutes
- Historical load: 18 months of backfill (~4.5 TB raw JSON)
- Retention: Raw zone 1 year, curated warehouse tables 5 years
Requirements
- Ingest full and incremental extracts from the partner REST API into the existing platform.
- Handle late-arriving updates, duplicate deliveries, and API pagination/rate limits.
- Validate schema and data quality before loading curated warehouse tables.
- Support idempotent re-runs and historical backfills without creating duplicate facts.
- Transform raw partner data into standardized
orders and refunds models used by downstream dbt marts.
- Provide monitoring, alerting, and operational runbooks for failures.
- Minimize impact on existing Airflow DAGs and Snowflake workloads.
Constraints
- Existing stack is AWS + Airflow + Snowflake + dbt; avoid introducing a large new platform.
- Partner API is limited to 1,200 requests/hour and occasionally returns partial pages.
- PII fields must be encrypted at rest and masked in analytics schemas.
- Incremental cloud spend should stay under $12K/month.
- Team size is 3 data engineers; solution should be maintainable with limited operational overhead.