Context
You’re interviewing with the Data Platform team at PulsePay, a fast-growing fintech offering consumer credit cards and BNPL in the US and EU. PulsePay has 18M monthly active users, processes ~40K card authorization events/sec peak (avg 8K/sec), and stores ~3.5 PB of historical transaction and customer-support data. The company is under PCI-DSS, SOX, and GDPR obligations, and is preparing for a partner bank audit after a near-miss incident where a contractor accidentally queried a table containing full PAN and SSN.
Today, PulsePay runs a mixed pipeline: real-time events land in Kafka, are processed by Spark Structured Streaming into an S3-based lake (Parquet), and are loaded into Snowflake for analytics via Snowpipe. Transformations are managed in dbt, orchestrated by Airflow. Access control is inconsistent: Snowflake has some roles, S3 relies on broad IAM policies, and Kafka topics are mostly open to any service account in the “data” AWS account. Analysts want self-serve access to transaction analytics, but PII/PCI data must be tightly restricted and all access must be auditable.
Your task is to design and implement role-based access control (RBAC) and supporting pipeline patterns so that sensitive data is protected end-to-end (ingestion → processing → storage → serving), without breaking existing downstream consumers or slowing delivery.
Scale Requirements
- Streaming ingest: 8K events/sec avg, 40K events/sec peak; event size 1–3 KB JSON.
- Daily volume: ~1.2B events/day (~2–3 TB/day raw), plus batch backfills up to 30 days.
- Latency: P95 event-to-queryable in Snowflake < 5 minutes for non-sensitive analytics tables.
- Users: ~450 internal users (data analysts, DS, finance, risk) + ~120 service accounts.
- Retention: raw events 90 days in S3; curated analytics tables indefinite in Snowflake; audit logs 1 year.
Data Characteristics
Key datasets
| Dataset | Example Fields | Sensitivity |
|---|
auth_events (Kafka) | event_id, user_id, card_token, merchant_id, amount, ts, ip | PII/PCI-adjacent |
customer_profile (batch) | user_id, name, email, phone, dob, ssn_last4, address | PII |
support_tickets | ticket_id, user_id, free_text, attachments_uri | PII may appear |
transactions_curated (Snowflake) | user_id, merchant_category, amount, country, ts | Mostly non-PII if modeled correctly |
Quality and security issues
- Schema drift in Kafka events (mobile app versions) and occasional missing
user_id.
- Late arriving events up to 2 hours due to mobile offline buffering.
- Free-text leakage: support tickets sometimes contain PAN/SSN typed by users.
- Join risk: even if a table has no direct PII, joining on
user_id can re-identify.
Requirements
Functional
- Implement RBAC for:
- Human users (Okta/SSO groups)
- Service accounts (Airflow, Spark jobs, dbt, BI tools)
- Enforce least privilege across:
- Kafka topics (produce/consume)
- S3 data lake (raw/clean/curated zones)
- Snowflake (databases, schemas, tables, views)
- Provide tiered access:
- Public analytics (aggregated, non-PII)
- Internal analytics (row-level restricted where needed)
- Restricted PII/PCI (very limited, break-glass)
- Support masking/tokenization patterns so most analytics never needs raw PII.
- Ensure auditing: who accessed what, when, from where; include failed access attempts.
Non-functional
- Minimal disruption: existing dashboards should keep working (or have a migration plan).
- Clear operational model: onboarding/offboarding users within 24 hours.
- Compliance-ready: demonstrate controls for PCI/GDPR and produce audit evidence.
Constraints
- Cloud: AWS + Snowflake; Kafka is MSK.
- Team: 5 data engineers, 1 security engineer shared across org.
- Budget: prefer configuration/managed features over building a custom authorization service.
- Must keep Spark jobs running on EMR; cannot migrate to Databricks this quarter.
What you should design (interview deliverables)
- RBAC model: define roles, role hierarchy, and mapping to Okta groups and service accounts.
- Data zoning + contract: raw/clean/curated schemas and which roles can access each.
- Snowflake strategy: roles, warehouses, databases/schemas, secure views, masking policies, row access policies.
- S3/IAM strategy: bucket layout, prefix policies, KMS keys, and how Spark/Airflow assume roles.
- Kafka strategy: topic-level ACLs, producer/consumer identities, and schema registry permissions.
- Pipeline changes: where to tokenize/mask, how to prevent PII from leaking into curated tables, and how dbt models enforce the contract.
- Monitoring + audit: metrics, alerts, and periodic access reviews.
Be prepared to discuss trade-offs (e.g., masking in Snowflake vs in Spark, secure views vs separate tables, row-level policies vs physical separation), and how you would roll this out safely.