Product Context
VisionGrid provides video analytics for retail chains, warehouses, and campuses. Customers upload or stream camera feeds and expect near-real-time detection, search, and alerting for events such as person entry, vehicle counting, safety violations, and suspicious activity.
Scale
| Signal | Value |
|---|
| Customers | 18,000 businesses |
| Daily active operators | 220,000 |
| Active cameras | 2.4M |
| Peak concurrent video streams | 850,000 |
| Ingest rate at peak | ~3.2M frames/sec after adaptive sampling |
| Events stored per day | 9B detections / tracks |
| Searchable video archive | 14 PB hot + warm storage |
| Alert query QPS | 45,000 |
| Investigative search QPS | 6,000 |
| End-to-end alert latency budget | p99 < 2 seconds |
Task
Design the end-to-end ML system for this platform. Address the following:
- Clarify the product requirements and define the primary ML tasks, outputs, and users.
- Estimate system scale and propose a multi-stage architecture for ingest, candidate retrieval, ranking, and alert generation.
- Choose models for each stage and explain online vs batch inference decisions.
- Design the training, feature, and feedback pipelines, including how labels are created from delayed human review.
- Define offline and online evaluation, monitoring, and rollout strategy.
- Identify major failure modes, especially around feature drift, training-serving skew, camera heterogeneity, and operational outages.
Constraints
- Cameras vary widely in resolution, frame rate, lighting, and placement; many are low quality.
- Raw video retention is limited: 7 days hot, 30 days warm, then derived features only for compliance and cost.
- Some customers require on-prem or edge inference for privacy; others allow cloud processing.
- False negatives on safety alerts are costly, but excessive false positives cause alert fatigue.
- Serving cost must stay below $0.015 per camera-hour on average.