Context
InsightLoop, a B2B analytics company, ingests customer CRM exports, product usage logs, and billing data into Snowflake for internal reporting and customer-facing dashboards. Today, the team runs nightly Airflow jobs with limited validation, and analysts often find missing records, duplicate rows, and inconsistent dimensions after the data is already published.
You need to design a batch-first data pipeline that ensures data quality before data reaches curated analytics tables. The company wants a practical solution that improves trust in analysis without introducing a complex streaming stack.
Scale Requirements
- Sources: Salesforce CSV exports, PostgreSQL transactional DB, Stripe API extracts
- Volume: 250 GB/day raw data, ~120 million rows/day across all sources
- Batch frequency: Hourly ingestion for operational data, nightly full reconciliation
- Latency target: Source to analytics-ready tables within 60 minutes for hourly loads
- Retention: 2 years raw data in object storage, 5 years curated warehouse data
- Data quality SLA: 99.5% of scheduled pipeline runs complete with all critical checks passing
Requirements
- Design ingestion and transformation pipelines for structured batch data from files, databases, and APIs.
- Implement data quality checks for schema validation, null thresholds, uniqueness, referential integrity, freshness, and reconciliation against source counts.
- Prevent bad data from reaching curated tables; define quarantine and reprocessing flows.
- Support idempotent reruns, backfills for historical partitions, and dependency management across datasets.
- Expose data quality results to analysts and on-call engineers with clear pass/fail status.
- Define monitoring, alerting, and failure recovery for ingestion, transformation, and validation stages.
Constraints
- Existing stack is AWS + Snowflake; prefer managed services and SQL/Python tooling already familiar to data engineers.
- Team size is 3 data engineers, so operational complexity must stay low.
- Budget for incremental infrastructure is limited to $15K/month.
- SOC 2 controls require auditability of pipeline runs, validation outcomes, and manual overrides.