Context
AMD validation labs currently rely on engineers to run ad hoc shell and Python scripts after a new server, board, or firmware image is brought online. Results are scattered across log files and shared folders, making it hard to compare runs, detect regressions, or re-run failed checks consistently.
Design a simple script-driven pipeline that automates bring-up sanity checks for AMD lab systems using AMD ROCm validation hosts, AMD EPYC-based lab servers, and a centralized results store. The goal is not a full manufacturing system, but a reliable batch pipeline that executes a standard test pack, captures structured outputs, and publishes pass/fail status within minutes.
Scale Requirements
- Lab footprint: 300 active systems across 5 labs
- Execution volume: 2,000 sanity runs/day, burst of 150 concurrent runs after firmware drops
- Per-run payload: 20-50 scripts, ~10 MB logs and ~2 MB structured metrics per run
- Latency target: run completion to dashboard visibility in < 3 minutes
- Retention: raw logs for 180 days, summarized results for 2 years
Requirements
- Orchestrate a batch workflow that triggers a standard sanity suite on target lab machines after bring-up events or on-demand requests.
- Support script execution in a controlled order with dependencies, retries, timeouts, and environment setup.
- Collect stdout/stderr, exit codes, hardware metadata, BIOS/firmware versions, and test metrics into structured records.
- Ensure idempotent re-runs so the same bring-up event does not create duplicate result records.
- Store raw artifacts separately from curated run summaries used by dashboards and trend reports.
- Add data quality checks for missing logs, malformed JSON outputs, duplicate run IDs, and incomplete test packs.
- Provide monitoring, alerting, and failure recovery for orchestration and ingestion failures.
Constraints
- Prefer a lightweight design that a small platform team can operate.
- Existing lab scripts are mostly Bash and Python and cannot be rewritten immediately.
- Some lab networks are intermittently disconnected; the pipeline must tolerate delayed uploads.
- Budget favors open-source orchestration and object storage over large proprietary platforms.