RBAC for Regulated ETL Lakehouse

Context

You’re interviewing with the Data Platform team at PulsePay, a fast-growing fintech offering consumer credit cards and BNPL in the US and EU. PulsePay has 18M monthly active users, processes ~40K card authorization events/sec peak (avg 8K/sec), and stores ~3.5 PB of historical transaction and customer-support data. The company is under PCI-DSS, SOX, and GDPR obligations, and is preparing for a partner bank audit after a near-miss incident where a contractor accidentally queried a table containing full PAN and SSN.

Today, PulsePay runs a mixed pipeline: real-time events land in Kafka, are processed by Spark Structured Streaming into an S3-based lake (Parquet), and are loaded into Snowflake for analytics via Snowpipe. Transformations are managed in dbt, orchestrated by Airflow. Access control is inconsistent: Snowflake has some roles, S3 relies on broad IAM policies, and Kafka topics are mostly open to any service account in the “data” AWS account. Analysts want self-serve access to transaction analytics, but PII/PCI data must be tightly restricted and all access must be auditable.

Your task is to design and implement role-based access control (RBAC) and supporting pipeline patterns so that sensitive data is protected end-to-end (ingestion → processing → storage → serving), without breaking existing downstream consumers or slowing delivery.

Scale Requirements

Streaming ingest: 8K events/sec avg, 40K events/sec peak; event size 1–3 KB JSON.
Daily volume: ~1.2B events/day (~2–3 TB/day raw), plus batch backfills up to 30 days.
Latency: P95 event-to-queryable in Snowflake < 5 minutes for non-sensitive analytics tables.
Users: ~450 internal users (data analysts, DS, finance, risk) + ~120 service accounts.
Retention: raw events 90 days in S3; curated analytics tables indefinite in Snowflake; audit logs 1 year.

Data Characteristics

Key datasets

Dataset	Example Fields	Sensitivity
`auth_events` (Kafka)	`event_id`, `user_id`, `card_token`, `merchant_id`, `amount`, `ts`, `ip`	PII/PCI-adjacent
`customer_profile` (batch)	`user_id`, `name`, `email`, `phone`, `dob`, `ssn_last4`, `address`	PII
`support_tickets`	`ticket_id`, `user_id`, `free_text`, `attachments_uri`	PII may appear
`transactions_curated` (Snowflake)	`user_id`, `merchant_category`, `amount`, `country`, `ts`	Mostly non-PII if modeled correctly

Quality and security issues

Schema drift in Kafka events (mobile app versions) and occasional missing user_id.
Late arriving events up to 2 hours due to mobile offline buffering.
Free-text leakage: support tickets sometimes contain PAN/SSN typed by users.
Join risk: even if a table has no direct PII, joining on user_id can re-identify.

Requirements

Functional

Implement RBAC for:
- Human users (Okta/SSO groups)
- Service accounts (Airflow, Spark jobs, dbt, BI tools)
Enforce least privilege across:
- Kafka topics (produce/consume)
- S3 data lake (raw/clean/curated zones)
- Snowflake (databases, schemas, tables, views)
Provide tiered access:
- Public analytics (aggregated, non-PII)
- Internal analytics (row-level restricted where needed)
- Restricted PII/PCI (very limited, break-glass)
Support masking/tokenization patterns so most analytics never needs raw PII.
Ensure auditing: who accessed what, when, from where; include failed access attempts.

Non-functional

Minimal disruption: existing dashboards should keep working (or have a migration plan).
Clear operational model: onboarding/offboarding users within 24 hours.
Compliance-ready: demonstrate controls for PCI/GDPR and produce audit evidence.

Constraints

Cloud: AWS + Snowflake; Kafka is MSK.
Team: 5 data engineers, 1 security engineer shared across org.
Budget: prefer configuration/managed features over building a custom authorization service.
Must keep Spark jobs running on EMR; cannot migrate to Databricks this quarter.

What you should design (interview deliverables)

RBAC model: define roles, role hierarchy, and mapping to Okta groups and service accounts.
Data zoning + contract: raw/clean/curated schemas and which roles can access each.
Snowflake strategy: roles, warehouses, databases/schemas, secure views, masking policies, row access policies.
S3/IAM strategy: bucket layout, prefix policies, KMS keys, and how Spark/Airflow assume roles.
Kafka strategy: topic-level ACLs, producer/consumer identities, and schema registry permissions.
Pipeline changes: where to tokenize/mask, how to prevent PII from leaking into curated tables, and how dbt models enforce the contract.
Monitoring + audit: metrics, alerts, and periodic access reviews.

Be prepared to discuss trade-offs (e.g., masking in Snowflake vs in Spark, secure views vs separate tables, row-level policies vs physical separation), and how you would roll this out safely.

Context

Scale Requirements

Streaming ingest: 8K events/sec avg, 40K events/sec peak; event size 1–3 KB JSON.
Daily volume: ~1.2B events/day (~2–3 TB/day raw), plus batch backfills up to 30 days.
Latency: P95 event-to-queryable in Snowflake < 5 minutes for non-sensitive analytics tables.
Users: ~450 internal users (data analysts, DS, finance, risk) + ~120 service accounts.
Retention: raw events 90 days in S3; curated analytics tables indefinite in Snowflake; audit logs 1 year.

Data Characteristics

Key datasets

Dataset	Example Fields	Sensitivity
`auth_events` (Kafka)	`event_id`, `user_id`, `card_token`, `merchant_id`, `amount`, `ts`, `ip`	PII/PCI-adjacent
`customer_profile` (batch)	`user_id`, `name`, `email`, `phone`, `dob`, `ssn_last4`, `address`	PII
`support_tickets`	`ticket_id`, `user_id`, `free_text`, `attachments_uri`	PII may appear
`transactions_curated` (Snowflake)	`user_id`, `merchant_category`, `amount`, `country`, `ts`	Mostly non-PII if modeled correctly

Quality and security issues

Schema drift in Kafka events (mobile app versions) and occasional missing user_id.
Late arriving events up to 2 hours due to mobile offline buffering.
Free-text leakage: support tickets sometimes contain PAN/SSN typed by users.
Join risk: even if a table has no direct PII, joining on user_id can re-identify.

Requirements

Functional

Implement RBAC for:
- Human users (Okta/SSO groups)
- Service accounts (Airflow, Spark jobs, dbt, BI tools)
Enforce least privilege across:
- Kafka topics (produce/consume)
- S3 data lake (raw/clean/curated zones)
- Snowflake (databases, schemas, tables, views)
Provide tiered access:
- Public analytics (aggregated, non-PII)
- Internal analytics (row-level restricted where needed)
- Restricted PII/PCI (very limited, break-glass)
Support masking/tokenization patterns so most analytics never needs raw PII.
Ensure auditing: who accessed what, when, from where; include failed access attempts.

Non-functional

Minimal disruption: existing dashboards should keep working (or have a migration plan).
Clear operational model: onboarding/offboarding users within 24 hours.
Compliance-ready: demonstrate controls for PCI/GDPR and produce audit evidence.

Constraints

Cloud: AWS + Snowflake; Kafka is MSK.
Team: 5 data engineers, 1 security engineer shared across org.
Budget: prefer configuration/managed features over building a custom authorization service.
Must keep Spark jobs running on EMR; cannot migrate to Databricks this quarter.

What you should design (interview deliverables)

RBAC model: define roles, role hierarchy, and mapping to Okta groups and service accounts.
Data zoning + contract: raw/clean/curated schemas and which roles can access each.
Snowflake strategy: roles, warehouses, databases/schemas, secure views, masking policies, row access policies.
S3/IAM strategy: bucket layout, prefix policies, KMS keys, and how Spark/Airflow assume roles.
Kafka strategy: topic-level ACLs, producer/consumer identities, and schema registry permissions.
Pipeline changes: where to tokenize/mask, how to prevent PII from leaking into curated tables, and how dbt models enforce the contract.
Monitoring + audit: metrics, alerts, and periodic access reviews.

Be prepared to discuss trade-offs (e.g., masking in Snowflake vs in Spark, secure views vs separate tables, row-level policies vs physical separation), and how you would roll this out safely.

Context

Scale Requirements

Streaming ingest: 8K events/sec avg, 40K events/sec peak; event size 1–3 KB JSON.
Daily volume: ~1.2B events/day (~2–3 TB/day raw), plus batch backfills up to 30 days.
Latency: P95 event-to-queryable in Snowflake < 5 minutes for non-sensitive analytics tables.
Users: ~450 internal users (data analysts, DS, finance, risk) + ~120 service accounts.
Retention: raw events 90 days in S3; curated analytics tables indefinite in Snowflake; audit logs 1 year.

Data Characteristics

Key datasets

Dataset	Example Fields	Sensitivity
`auth_events` (Kafka)	`event_id`, `user_id`, `card_token`, `merchant_id`, `amount`, `ts`, `ip`	PII/PCI-adjacent
`customer_profile` (batch)	`user_id`, `name`, `email`, `phone`, `dob`, `ssn_last4`, `address`	PII
`support_tickets`	`ticket_id`, `user_id`, `free_text`, `attachments_uri`	PII may appear
`transactions_curated` (Snowflake)	`user_id`, `merchant_category`, `amount`, `country`, `ts`	Mostly non-PII if modeled correctly

Quality and security issues

Schema drift in Kafka events (mobile app versions) and occasional missing user_id.
Late arriving events up to 2 hours due to mobile offline buffering.
Free-text leakage: support tickets sometimes contain PAN/SSN typed by users.
Join risk: even if a table has no direct PII, joining on user_id can re-identify.

Requirements

Functional

Implement RBAC for:
- Human users (Okta/SSO groups)
- Service accounts (Airflow, Spark jobs, dbt, BI tools)
Enforce least privilege across:
- Kafka topics (produce/consume)
- S3 data lake (raw/clean/curated zones)
- Snowflake (databases, schemas, tables, views)
Provide tiered access:
- Public analytics (aggregated, non-PII)
- Internal analytics (row-level restricted where needed)
- Restricted PII/PCI (very limited, break-glass)
Support masking/tokenization patterns so most analytics never needs raw PII.
Ensure auditing: who accessed what, when, from where; include failed access attempts.

Non-functional

Minimal disruption: existing dashboards should keep working (or have a migration plan).
Clear operational model: onboarding/offboarding users within 24 hours.
Compliance-ready: demonstrate controls for PCI/GDPR and produce audit evidence.

Constraints

Cloud: AWS + Snowflake; Kafka is MSK.
Team: 5 data engineers, 1 security engineer shared across org.
Budget: prefer configuration/managed features over building a custom authorization service.
Must keep Spark jobs running on EMR; cannot migrate to Databricks this quarter.

What you should design (interview deliverables)

RBAC model: define roles, role hierarchy, and mapping to Okta groups and service accounts.
Data zoning + contract: raw/clean/curated schemas and which roles can access each.
Snowflake strategy: roles, warehouses, databases/schemas, secure views, masking policies, row access policies.
S3/IAM strategy: bucket layout, prefix policies, KMS keys, and how Spark/Airflow assume roles.
Kafka strategy: topic-level ACLs, producer/consumer identities, and schema registry permissions.
Pipeline changes: where to tokenize/mask, how to prevent PII from leaking into curated tables, and how dbt models enforce the contract.
Monitoring + audit: metrics, alerts, and periodic access reviews.

Be prepared to discuss trade-offs (e.g., masking in Snowflake vs in Spark, secure views vs separate tables, row-level policies vs physical separation), and how you would roll this out safely.

Context

Scale Requirements

Streaming ingest: 8K events/sec avg, 40K events/sec peak; event size 1–3 KB JSON.
Daily volume: ~1.2B events/day (~2–3 TB/day raw), plus batch backfills up to 30 days.
Latency: P95 event-to-queryable in Snowflake < 5 minutes for non-sensitive analytics tables.
Users: ~450 internal users (data analysts, DS, finance, risk) + ~120 service accounts.
Retention: raw events 90 days in S3; curated analytics tables indefinite in Snowflake; audit logs 1 year.

Data Characteristics

Key datasets

Dataset	Example Fields	Sensitivity
`auth_events` (Kafka)	`event_id`, `user_id`, `card_token`, `merchant_id`, `amount`, `ts`, `ip`	PII/PCI-adjacent
`customer_profile` (batch)	`user_id`, `name`, `email`, `phone`, `dob`, `ssn_last4`, `address`	PII
`support_tickets`	`ticket_id`, `user_id`, `free_text`, `attachments_uri`	PII may appear
`transactions_curated` (Snowflake)	`user_id`, `merchant_category`, `amount`, `country`, `ts`	Mostly non-PII if modeled correctly

Quality and security issues

Schema drift in Kafka events (mobile app versions) and occasional missing user_id.
Late arriving events up to 2 hours due to mobile offline buffering.
Free-text leakage: support tickets sometimes contain PAN/SSN typed by users.
Join risk: even if a table has no direct PII, joining on user_id can re-identify.

Requirements

Functional

Implement RBAC for:
- Human users (Okta/SSO groups)
- Service accounts (Airflow, Spark jobs, dbt, BI tools)
Enforce least privilege across:
- Kafka topics (produce/consume)
- S3 data lake (raw/clean/curated zones)
- Snowflake (databases, schemas, tables, views)
Provide tiered access:
- Public analytics (aggregated, non-PII)
- Internal analytics (row-level restricted where needed)
- Restricted PII/PCI (very limited, break-glass)
Support masking/tokenization patterns so most analytics never needs raw PII.
Ensure auditing: who accessed what, when, from where; include failed access attempts.

Non-functional

Minimal disruption: existing dashboards should keep working (or have a migration plan).
Clear operational model: onboarding/offboarding users within 24 hours.
Compliance-ready: demonstrate controls for PCI/GDPR and produce audit evidence.

Constraints

Cloud: AWS + Snowflake; Kafka is MSK.
Team: 5 data engineers, 1 security engineer shared across org.
Budget: prefer configuration/managed features over building a custom authorization service.
Must keep Spark jobs running on EMR; cannot migrate to Databricks this quarter.

What you should design (interview deliverables)

RBAC model: define roles, role hierarchy, and mapping to Okta groups and service accounts.
Data zoning + contract: raw/clean/curated schemas and which roles can access each.
Snowflake strategy: roles, warehouses, databases/schemas, secure views, masking policies, row access policies.
S3/IAM strategy: bucket layout, prefix policies, KMS keys, and how Spark/Airflow assume roles.
Kafka strategy: topic-level ACLs, producer/consumer identities, and schema registry permissions.
Pipeline changes: where to tokenize/mask, how to prevent PII from leaking into curated tables, and how dbt models enforce the contract.
Monitoring + audit: metrics, alerts, and periodic access reviews.

Be prepared to discuss trade-offs (e.g., masking in Snowflake vs in Spark, secure views vs separate tables, row-level policies vs physical separation), and how you would roll this out safely.

Interview Guides

Context

Scale Requirements

Data Characteristics

Key datasets

Quality and security issues

Requirements

Functional

Non-functional

Constraints

What you should design (interview deliverables)

RBAC for Regulated ETL Lakehouse

Context

Scale Requirements

Data Characteristics

Key datasets

Quality and security issues

Requirements

Functional

Non-functional

Constraints

What you should design (interview deliverables)

Your Answer

RBAC for Regulated ETL Lakehouse

Context

Scale Requirements

Data Characteristics

Key datasets

Quality and security issues

Requirements

Functional

Non-functional

Constraints

What you should design (interview deliverables)

RBAC for Regulated ETL Lakehouse

Context

Scale Requirements

Data Characteristics

Key datasets

Quality and security issues

Requirements

Functional

Non-functional

Constraints

What you should design (interview deliverables)

Your Answer