Context
AutoVision, an autonomous vehicle manufacturer, aims to enhance its real-time perception capabilities by deploying multiple machine learning models concurrently on vehicle GPUs. The current architecture runs models sequentially, leading to unacceptable latencies for critical applications like obstacle detection and lane-keeping. A new architecture is needed to ensure that the vehicle can process data from sensors and execute multiple models with a guaranteed maximum latency of 30ms.
Scale Requirements
- Throughput: Process data from 10 cameras and 5 LiDAR sensors at 60 FPS, totaling ~900 MB/s.
- Latency: Each model must produce results within 30ms of data capture to ensure real-time responsiveness.
- Concurrency: Support for at least 5 concurrent ML models on a single GPU, each handling different tasks (e.g., object detection, segmentation).
Requirements
- Design a streaming architecture that can ingest and preprocess data from multiple sensors in real-time.
- Implement a model serving layer that can load and execute multiple models concurrently on the GPU.
- Ensure that the system can handle dynamic model updates without downtime.
- Incorporate real-time data quality checks to validate input data before model inference.
- Design monitoring and alerting systems to ensure compliance with latency and throughput requirements.
Constraints
- Infrastructure: Limited to NVIDIA GPUs (e.g., RTX 3090) with a maximum of 24GB memory.
- Budget: Monthly operating cost should not exceed $10,000, including cloud and hardware expenses.
- Compliance: Must adhere to automotive safety standards (ISO 26262).