Context
BlueRail, a B2B SaaS company, runs its ETL and deployment workloads on Linux-based build servers that host Apache Airflow schedulers, dbt jobs, CI runners, and deployment scripts. Access is currently managed with shared SSH keys, broad sudo privileges, and manually created service accounts, creating audit gaps and risk of accidental or unauthorized changes to production pipelines.
You are asked to design a permissions and access-control model for these Linux servers that supports both human operators and automated pipeline jobs while minimizing blast radius and maintaining delivery velocity.
Scale Requirements
- Fleet size: 40 Linux servers across dev, staging, and production
- Users: 25 engineers, 6 data engineers, 4 SREs, 8 service accounts
- Workload volume: ~3,000 Airflow task runs/day, 500 dbt runs/day, 150 deployments/day
- Access latency: New user provisioning or revocation must propagate within 15 minutes
- Audit retention: 1 year of command, login, and privilege-escalation logs
Requirements
- Design role-based access control for Linux hosts used by ETL orchestration, batch jobs, and deployment automation.
- Separate permissions for developers, data engineers, SREs, and CI/CD service accounts.
- Ensure Airflow, dbt, and deployment jobs run with least privilege and isolated filesystem/network access.
- Prevent shared credentials; require centralized authentication and short-lived access where possible.
- Provide audit trails for SSH logins, sudo usage, file changes to DAGs/scripts, and secret access.
- Support automated provisioning, offboarding, and periodic access reviews.
- Define monitoring, alerting, and failure recovery for permission drift, unauthorized access attempts, and broken service-account permissions.
Constraints
- Infrastructure is AWS-based and uses Ubuntu 22.04 hosts.
- Existing stack includes Apache Airflow 2.x, dbt Core, GitHub Actions runners, Terraform, and CloudWatch.
- Team prefers infrastructure-as-code and cannot add heavy manual approval steps to every deployment.
- Production servers must meet SOC 2 controls for least privilege, auditability, and access review.