Context
NorthBridge Securities runs daily trading workflows that prepare reference data, ingest prior-day executions, calculate positions and P&L, and publish files to downstream risk and finance systems before the market opens. Today, these jobs are triggered manually with cron scripts on EC2, causing missed dependencies, inconsistent reruns, and limited visibility during failures.
You need to design a production-grade scheduling and orchestration solution for recurring batch jobs that supports daily trading operations with strict cutoffs and reliable recovery.
Scale Requirements
- Job volume: 120 scheduled jobs per trading day across 8 workflows
- Data volume: 250 GB/day of CSV, JSON, and Parquet from brokers, market data vendors, and internal OMS systems
- Critical SLA: All pre-market datasets available by 06:30 ET on trading days
- Latency: Critical job retries and dependency checks should react within 2 minutes
- Retention: 7 years of audit logs and run metadata for compliance
Requirements
- Design a scheduler that supports daily, hourly, and market-calendar-based recurring jobs, including holidays and early market close days.
- Model job dependencies across ingestion, validation, transformation, and publishing steps.
- Ensure idempotent reruns for failed jobs without duplicating downstream data or reports.
- Support backfills for missed trading dates and selective reruns for a single portfolio or broker feed.
- Include data quality gates before downstream publishing, such as row-count checks, schema validation, and reconciliation against source control totals.
- Provide operational monitoring, alerting, and an auditable run history for compliance and incident review.
- Show how business users and operators can see workflow state, SLA risk, and failed task context.
Constraints
- Infrastructure must remain primarily on AWS using managed services where possible.
- Budget allows only modest expansion beyond the current EC2 + S3 + Redshift footprint.
- The platform must satisfy SOX-style auditability and preserve immutable execution logs.
- Some upstream broker files may arrive late or be re-sent with corrected records.