Secure Secrets in ETL Pipelines

Context

FinLedger, a fintech company, runs batch and streaming data pipelines on AWS using Apache Airflow, dbt, Spark, and Kubernetes. Today, many credentials are stored as CI/CD variables and injected into deployment jobs, but the platform team has found hardcoded secrets in DAGs, leaked .env files in build artifacts, and inconsistent rotation practices across environments.

You need to design a secure approach for handling secrets and sensitive configuration data across deployment pipelines and runtime execution for data workloads.

Scale Requirements

Pipelines: 250 Airflow DAGs, 40 dbt jobs, 15 Spark applications
Deployments: ~120 CI/CD runs per day across dev, staging, and prod
Secrets: ~180 managed secrets (database passwords, API tokens, Snowflake keys, Kafka SASL credentials)
Latency target: secret retrieval must add < 200 ms per task startup on average
Rotation target: critical credentials rotated every 30 days with zero manual code changes
Audit retention: 1 year of access logs for compliance reviews

Requirements

Design a deployment pipeline that never stores plaintext secrets in source control, container images, or CI logs.
Separate build-time, deploy-time, and runtime secret access patterns for Airflow, Spark, and dbt workloads.
Support environment-specific configuration with strict access boundaries between dev, staging, and prod.
Implement automated secret rotation and rollout without requiring DAG or application redeploys where possible.
Provide a strategy for short-lived credentials, least-privilege IAM, and service-to-service authentication.
Define monitoring, auditing, and alerting for secret access anomalies, failed retrievals, and expired credentials.
Include failure recovery for secret manager outages and misconfigured permissions.

Constraints

AWS is the primary cloud; GitHub Actions is the CI/CD platform.
Existing tools must remain: Airflow 2.x, dbt Core, Spark on EKS, Snowflake.
SOC 2 and PCI-related controls require auditability and separation of duties.
The team is small: 3 data engineers, 1 platform engineer, so the solution should minimize operational overhead.

Context

You need to design a secure approach for handling secrets and sensitive configuration data across deployment pipelines and runtime execution for data workloads.

Scale Requirements

Pipelines: 250 Airflow DAGs, 40 dbt jobs, 15 Spark applications
Deployments: ~120 CI/CD runs per day across dev, staging, and prod
Secrets: ~180 managed secrets (database passwords, API tokens, Snowflake keys, Kafka SASL credentials)
Latency target: secret retrieval must add < 200 ms per task startup on average
Rotation target: critical credentials rotated every 30 days with zero manual code changes
Audit retention: 1 year of access logs for compliance reviews

Requirements

Design a deployment pipeline that never stores plaintext secrets in source control, container images, or CI logs.
Separate build-time, deploy-time, and runtime secret access patterns for Airflow, Spark, and dbt workloads.
Support environment-specific configuration with strict access boundaries between dev, staging, and prod.
Implement automated secret rotation and rollout without requiring DAG or application redeploys where possible.
Provide a strategy for short-lived credentials, least-privilege IAM, and service-to-service authentication.
Define monitoring, auditing, and alerting for secret access anomalies, failed retrievals, and expired credentials.
Include failure recovery for secret manager outages and misconfigured permissions.

Constraints

AWS is the primary cloud; GitHub Actions is the CI/CD platform.
Existing tools must remain: Airflow 2.x, dbt Core, Spark on EKS, Snowflake.
SOC 2 and PCI-related controls require auditability and separation of duties.
The team is small: 3 data engineers, 1 platform engineer, so the solution should minimize operational overhead.

Context

You need to design a secure approach for handling secrets and sensitive configuration data across deployment pipelines and runtime execution for data workloads.

Scale Requirements

Pipelines: 250 Airflow DAGs, 40 dbt jobs, 15 Spark applications
Deployments: ~120 CI/CD runs per day across dev, staging, and prod
Secrets: ~180 managed secrets (database passwords, API tokens, Snowflake keys, Kafka SASL credentials)
Latency target: secret retrieval must add < 200 ms per task startup on average
Rotation target: critical credentials rotated every 30 days with zero manual code changes
Audit retention: 1 year of access logs for compliance reviews

Requirements

Design a deployment pipeline that never stores plaintext secrets in source control, container images, or CI logs.
Separate build-time, deploy-time, and runtime secret access patterns for Airflow, Spark, and dbt workloads.
Support environment-specific configuration with strict access boundaries between dev, staging, and prod.
Implement automated secret rotation and rollout without requiring DAG or application redeploys where possible.
Provide a strategy for short-lived credentials, least-privilege IAM, and service-to-service authentication.
Define monitoring, auditing, and alerting for secret access anomalies, failed retrievals, and expired credentials.
Include failure recovery for secret manager outages and misconfigured permissions.

Constraints

AWS is the primary cloud; GitHub Actions is the CI/CD platform.
Existing tools must remain: Airflow 2.x, dbt Core, Spark on EKS, Snowflake.
SOC 2 and PCI-related controls require auditability and separation of duties.
The team is small: 3 data engineers, 1 platform engineer, so the solution should minimize operational overhead.

Context

You need to design a secure approach for handling secrets and sensitive configuration data across deployment pipelines and runtime execution for data workloads.

Scale Requirements

Pipelines: 250 Airflow DAGs, 40 dbt jobs, 15 Spark applications
Deployments: ~120 CI/CD runs per day across dev, staging, and prod
Secrets: ~180 managed secrets (database passwords, API tokens, Snowflake keys, Kafka SASL credentials)
Latency target: secret retrieval must add < 200 ms per task startup on average
Rotation target: critical credentials rotated every 30 days with zero manual code changes
Audit retention: 1 year of access logs for compliance reviews

Requirements

Design a deployment pipeline that never stores plaintext secrets in source control, container images, or CI logs.
Separate build-time, deploy-time, and runtime secret access patterns for Airflow, Spark, and dbt workloads.
Support environment-specific configuration with strict access boundaries between dev, staging, and prod.
Implement automated secret rotation and rollout without requiring DAG or application redeploys where possible.
Provide a strategy for short-lived credentials, least-privilege IAM, and service-to-service authentication.
Define monitoring, auditing, and alerting for secret access anomalies, failed retrievals, and expired credentials.
Include failure recovery for secret manager outages and misconfigured permissions.

Constraints

AWS is the primary cloud; GitHub Actions is the CI/CD platform.
Existing tools must remain: Airflow 2.x, dbt Core, Spark on EKS, Snowflake.
SOC 2 and PCI-related controls require auditability and separation of duties.
The team is small: 3 data engineers, 1 platform engineer, so the solution should minimize operational overhead.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Secure Secrets in ETL Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer

Secure Secrets in ETL Pipelines

Context

Scale Requirements

Requirements

Constraints

Secure Secrets in ETL Pipelines

Context

Scale Requirements

Requirements

Constraints

Your Answer