Context
FinSure, an insurance technology company, ingests customer policy, claims, and payment data from a PostgreSQL OLTP database, partner SFTP drops, and a Kafka topic carrying application events. The current pipeline is a mix of ad hoc Python scripts and nightly SQL jobs, which has led to duplicate loads, inconsistent record counts, and weak controls around PII handling.
You need to redesign the pipeline so analytics and compliance teams can trust the data while meeting security requirements for regulated customer information.
Scale Requirements
- Sources: PostgreSQL (~150 tables), partner SFTP files (~20K files/day), Kafka events (~15K events/sec peak)
- Daily volume: ~1.2 TB raw data/day
- Latency: batch data available within 30 minutes; streaming events available within 2 minutes
- Retention: 7 years for curated warehouse data, 90 days for raw landing zone
- Availability target: 99.9% for scheduled loads
Requirements
- Design an end-to-end pipeline that ingests batch and streaming data into a centralized platform.
- Ensure data integrity through schema validation, deduplication, idempotent loads, reconciliation checks, and audit logging.
- Ensure data security through encryption in transit and at rest, RBAC, secrets management, PII masking/tokenization, and access auditing.
- Support CDC or incremental extraction from PostgreSQL and reliable ingestion of partner files.
- Build curated warehouse tables for policy, claims, and payments analytics.
- Define orchestration, monitoring, alerting, and failure recovery.
Constraints
- AWS is the required cloud platform.
- Incremental budget is capped at $18K/month.
- Compliance requirements: SOC 2, GDPR, and support for customer deletion requests within 72 hours.
- Team size: 3 data engineers, so the solution should favor managed services over heavy platform operations.
- Historical backfill of 2 years must be supported without corrupting current production tables.