Context
Qlik operates an existing analytics platform that ingests CRM, billing, and product telemetry into a centralized warehouse for Qlik Sense dashboards and downstream Qlik AutoML use cases. A new SaaS support platform must be integrated quickly, but the source exposes both REST APIs and webhook events, has evolving schemas, and must not disrupt current SLAs.
The current stack uses Qlik Talend Data Integration for ingestion, Apache Airflow for orchestration, Amazon S3 as the landing zone, and Snowflake for serving curated models. Your task is to design the approach for onboarding this new source into the existing system.
Scale Requirements
- Source volume: 120M records historical backfill, then 3M new records/day
- Webhook rate: 1,500 events/sec peak, 200 events/sec average
- API limits: 5,000 requests/minute, cursor-based pagination
- Latency target: webhook data queryable in Snowflake within 10 minutes; batch sync within 2 hours
- Storage: ~4 TB raw/year in S3, 1.2 TB curated/year in Snowflake
- Retention: raw immutable data for 180 days; curated tables for 3 years
Requirements
- Design ingestion for both historical backfill and incremental loads using Qlik Talend Data Integration.
- Support schema drift without breaking downstream Qlik Sense apps.
- Ensure idempotent processing for retries, duplicate webhooks, and replayed API pages.
- Define raw, standardized, and curated data models, including CDC-style merge logic in Snowflake.
- Orchestrate dependencies between extraction, validation, transformation, and publish steps.
- Implement data quality checks for null spikes, primary key uniqueness, referential integrity, and freshness.
- Provide a rollback and reprocessing strategy for bad loads and late-arriving updates.
Constraints
- AWS-first environment; no net-new platform outside current approved stack
- Team of 3 data engineers; solution should minimize operational overhead
- PII fields must be masked in non-production and deletable within 72 hours
- Incremental cloud spend should stay below $18K/month
- Existing curated tables cannot experience more than 15 minutes of unavailability during cutover