Context
Northstar Data Consulting is implementing a new analytics platform for a mid-market retail client. The client currently uses ad hoc Python scripts, manual CSV uploads, and a legacy SQL Server reporting database; delivery is slow, brittle, and difficult to support across multiple consulting engagements.
You are asked to design a repeatable tool-selection approach and target pipeline architecture that your consulting team can deploy for this client and reuse for similar implementations.
Scale Requirements
- Sources: Shopify, NetSuite, Salesforce, PostgreSQL, and SFTP-delivered CSV files
- Batch volume: 250 GB/day raw ingest, growing 20% YoY
- Tables/files: ~1,200 source objects, 150 business-critical datasets
- Freshness: finance data every 4 hours; sales and inventory data every 15 minutes
- Users: 80 BI users, 12 analysts, 4 data engineers
- Retention: 2 years hot storage, 7 years archived for audit
Requirements
- Propose how you would choose tools for ingestion, transformation, orchestration, storage, and data quality in a consulting implementation.
- Design a pipeline that supports both ELT for SaaS/database sources and ETL for messy file-based feeds.
- Define criteria for build-vs-buy decisions, including implementation speed, maintainability, client skill set, observability, and total cost of ownership.
- Ensure pipelines are idempotent, support backfills, and can onboard a new source in less than 3 days.
- Include a monitoring and alerting strategy for failed loads, schema drift, freshness SLA breaches, and data quality regressions.
- Describe how your design would standardize delivery across clients while allowing client-specific customization.
Constraints
- Client prefers AWS and already has S3 and Redshift contracts
- Incremental tooling budget is capped at $12K/month
- Small support team after handoff: 2 client data engineers
- SOX-related auditability is required for finance datasets
- Consulting team must minimize custom code and avoid tools requiring deep platform specialization