Context
FinSight Analytics provides finance and operations reporting for mid-market retailers. A client reports that revenue, order counts, and refund totals differ between Salesforce dashboards, NetSuite exports, and Snowflake-based BI reports. The current stack uses Fivetran into Snowflake, dbt for transformations, and Airflow for scheduled reconciliation jobs, but there is no consistent lineage, freshness SLA enforcement, or source-of-truth contract.
You are asked to design a pipeline and investigation approach that identifies the first places to check when reporting outputs become inconsistent across systems, and then hardens the platform to prevent recurrence.
Scale Requirements
- Sources: Salesforce, NetSuite, Stripe, PostgreSQL app DB, S3 CSV drops
- Volume: 120M fact rows/day, 2 TB/day raw ingest
- Latency: Finance dashboards must be consistent within 30 minutes of source updates
- Historical retention: 3 years online in Snowflake, 7 years archived in S3
- Reconciliation scope: 250 client-facing metrics across 40 marts
Requirements
- Design an ingestion and transformation pipeline that can isolate where metric drift begins: source extraction, load, transformation, or serving.
- Define the first checks you would perform when a client reports inconsistency, including freshness, schema drift, duplicate loads, join cardinality changes, and business logic divergence.
- Build reconciliation tables that compare record counts, checksums, and metric aggregates across systems by business date and source batch.
- Ensure transformations are idempotent and support backfills for corrected source data.
- Add data quality gates before publishing marts used by Looker and client exports.
- Provide monitoring, alerting, and failure recovery for delayed loads, broken models, and mismatched KPI thresholds.
Constraints
- Must stay on AWS + Snowflake; no full platform rewrite
- Incremental budget limit: $18K/month
- SOX-sensitive finance data requires auditability and reproducible metric definitions
- Source APIs have rate limits and occasional late-arriving corrections
- Team size: 3 data engineers, 1 analytics engineer