Context
FlexPerks, an HR SaaS company serving 12,000 employers, currently stores employee benefits elections in a single transactional PostgreSQL schema optimized for the application. Reporting, payroll exports, and compliance extracts are generated by ad hoc SQL and nightly scripts, which break whenever a new benefit type, carrier rule, or enrollment attribute is introduced.
You need to design a data pipeline and analytics-ready schema that supports a flexible employee benefits system where plans, coverage tiers, dependents, deductions, life events, and carrier-specific attributes evolve frequently without requiring repeated full reloads or schema rewrites.
Scale Requirements
- Employers: 12,000
- Employees: 8M active, 20M historical
- Open enrollment peak: 25K enrollment changes/minute
- Daily batch volume: 150M source rows across HRIS, payroll, carrier, and eligibility feeds
- Latency target: <15 minutes for operational reporting, nightly SLA for payroll exports by 4 AM local employer time
- Storage: 40 TB raw history, 7-year retention for audit/compliance
Requirements
- Design an ingestion and transformation pipeline for HRIS, payroll, carrier eligibility, and enrollment event data.
- Model a flexible benefits schema that supports new benefit types and plan attributes without frequent table redesigns.
- Preserve full history for elections, coverage changes, dependent updates, and payroll deduction changes using auditable SCD/event patterns.
- Support idempotent reprocessing, late-arriving files, and backfills for corrected payroll periods.
- Produce analytics-ready tables for enrollment status, employer-level participation, deduction reconciliation, and carrier export readiness.
- Implement data quality checks for eligibility mismatches, duplicate elections, invalid effective dates, and missing dependent relationships.
- Orchestrate employer-specific batch windows and downstream dependencies reliably.
Constraints
- AWS-first stack; existing systems already use S3 and Airflow.
- HIPAA/PII controls required: encryption, row-level access, audit logs, and least-privilege access.
- Incremental cloud budget increase capped at $35K/month.
- Source systems are inconsistent: SFTP CSV, REST APIs, CDC from PostgreSQL, and carrier files with changing layouts.