Context
FinLedger, a fintech company, runs batch and streaming data pipelines on AWS using Apache Airflow, dbt, Spark, and Kubernetes. Today, many credentials are stored as CI/CD variables and injected into deployment jobs, but the platform team has found hardcoded secrets in DAGs, leaked .env files in build artifacts, and inconsistent rotation practices across environments.
You need to design a secure approach for handling secrets and sensitive configuration data across deployment pipelines and runtime execution for data workloads.
Scale Requirements
- Pipelines: 250 Airflow DAGs, 40 dbt jobs, 15 Spark applications
- Deployments: ~120 CI/CD runs per day across dev, staging, and prod
- Secrets: ~180 managed secrets (database passwords, API tokens, Snowflake keys, Kafka SASL credentials)
- Latency target: secret retrieval must add < 200 ms per task startup on average
- Rotation target: critical credentials rotated every 30 days with zero manual code changes
- Audit retention: 1 year of access logs for compliance reviews
Requirements
- Design a deployment pipeline that never stores plaintext secrets in source control, container images, or CI logs.
- Separate build-time, deploy-time, and runtime secret access patterns for Airflow, Spark, and dbt workloads.
- Support environment-specific configuration with strict access boundaries between dev, staging, and prod.
- Implement automated secret rotation and rollout without requiring DAG or application redeploys where possible.
- Provide a strategy for short-lived credentials, least-privilege IAM, and service-to-service authentication.
- Define monitoring, auditing, and alerting for secret access anomalies, failed retrievals, and expired credentials.
- Include failure recovery for secret manager outages and misconfigured permissions.
Constraints
- AWS is the primary cloud; GitHub Actions is the CI/CD platform.
- Existing tools must remain: Airflow 2.x, dbt Core, Spark on EKS, Snowflake.
- SOC 2 and PCI-related controls require auditability and separation of duties.
- The team is small: 3 data engineers, 1 platform engineer, so the solution should minimize operational overhead.