Context
You’re interviewing with the Risk & Fraud Data Platform team at a large fintech that issues credit cards and offers BNPL. The company operates in the US and EU, processes $40B/year in payment volume, and must meet PCI-DSS and SOX controls. Fraud losses are materially impacted by data freshness: a 30–60 minute delay in feature availability can increase chargeback exposure by 6–10% during attack spikes.
Today the platform is Azure-first and split across two partially overlapping stacks:
- Azure Synapse Analytics is used as the enterprise warehouse for BI (Power BI), regulatory reporting, and curated marts. Most transformations are SQL-based and orchestrated via Synapse pipelines.
- Azure Databricks is used by the ML and data science org for feature engineering, ad-hoc exploration, and some streaming prototypes.
The CTO wants to standardize the next-generation pipeline for transaction events + account events into a single “lakehouse-style” architecture that supports both near-real-time fraud features and auditable financial reporting. Your task is to compare Synapse vs Databricks and propose when to use each, then design a production pipeline that can be implemented in 90 days by a team of 6 data engineers.
Scale Requirements
- Ingestion throughput: peak 250K events/sec (card auths + declines + reversals) during seasonal spikes; average 60K events/sec.
- Latency:
- Fraud feature tables: P95 < 3 minutes from event time to availability.
- BI/reporting marts: hourly refresh; end-of-day close must finish by 02:00 UTC.
- Storage:
- Raw retention: 13 months in immutable storage for audit.
- Curated retention: 7 years for financial aggregates.
- Backfill: ability to reprocess 30 days of history within 8 hours.
- Availability: 99.9% for feature pipeline; graceful degradation if streaming is impaired.
Data Characteristics
Event sources
- Card authorization stream (JSON/Avro): auth_id, card_id, merchant_id, amount, currency, auth_ts, decision, reason_codes[], device_fingerprint, ip, mcc.
- Account lifecycle events (CDC): customer profile changes, KYC status, credit limit updates.
- Reference data: merchant risk tiers, BIN tables, FX rates.
Data quality issues
- Late-arriving events: up to 45 minutes late due to network retries and processor replays.
- Duplicates: same auth_id can appear 2–5 times.
- Schema evolution: new reason codes and nested attributes added weekly.
- PII: card_id and device_fingerprint are sensitive; EU data residency applies.
Requirements
Functional
- Ingest streaming auth events and CDC account events into a raw zone with immutable auditability.
- Produce deduplicated, schema-validated “silver” tables with late-arrival handling.
- Produce gold outputs:
- Fraud features (rolling 5m/1h/24h aggregates per card_id/device/merchant)
- Finance marts (daily settlement aggregates, chargeback rates)
- Support both streaming (fraud) and batch (finance close + backfills) using the same code paths where possible.
- Provide data quality gates (freshness, completeness, uniqueness) and a quarantine path.
Non-functional
- Meet PCI/SOX controls: lineage, access controls, reproducibility, and change management.
- Enable self-serve analytics in Power BI with predictable performance.
- Cost guardrails: incremental platform cost target <$120K/month.
Constraints
- Azure is mandatory. Existing contracts include Synapse and Databricks.
- Team skills: strong SQL + Spark; moderate Kafka; heavy Power BI usage.
- Current ingestion is via Event Hubs (not Kafka) and landing into ADLS Gen2.
- Must support EU residency: EU events cannot leave EU region; cross-region aggregation must be anonymized.
Interview Tasks
- Compare Synapse vs Databricks for this scenario. Be explicit about:
- Streaming maturity and operational model
- ELT/ETL ergonomics (SQL vs Spark)
- Governance (catalog, lineage), security, and compliance
- Performance for BI vs feature engineering
- Cost drivers and scaling behavior
- Propose a reference architecture (you may choose “Databricks-first lakehouse with Synapse serving” or “Synapse-first with Spark pools”, or a hybrid). Justify the choice.
- Describe how you will handle late data, deduplication, schema evolution, and backfills.
- Define monitoring/alerting and failure recovery (including replay and idempotency).
Your answer should make it clear when you would use Synapse over Databricks and vice versa, and how you’d keep the platform coherent rather than duplicative.