Business Context
VideoPro, a leading video editing software provider with over 1 million active users, aims to integrate AI-driven features that enhance user creativity and efficiency. The goal is to build a multimodal large language model (LLM) that can understand user commands in natural language and interact with video content in real-time, enabling seamless editing workflows.
Dataset
| Feature Group | Count | Examples |
|---|
| Video Metadata | 10 | duration, resolution, frame_rate, codec |
| User Commands | 5 | 'Trim video', 'Add filter', 'Speed up', 'Add text overlay' |
| Video Content | 50K | frames, audio segments, color histograms, scene descriptors |
| User Interaction | 15 | click events, time spent on each tool, undo actions |
- Size: 50K video clips, each with 100 frames, 4 audio channels, and user interaction logs
- Target: Multimodal output based on user commands (e.g., video edits, effects)
- Class balance: Varied; some commands are more frequent than others (e.g., 'Trim video' is used 40% of the time, while 'Add text overlay' is used 10% of the time)
- Missing data: 5% of user commands missing for new users; 10% of video frames may be corrupted in legacy videos
Requirements
- Develop and train a multimodal LLM to process and generate video editing commands.
- Achieve an accuracy of 85% in command interpretation and execution.
- Ensure the model can process inputs with minimal latency (under 200ms).
- Implement a mechanism for real-time user feedback to improve model performance.
Constraints
- The model must be capable of real-time inference to support user interactions without noticeable delays.
- It should handle various video formats and resolutions, ensuring broad compatibility.
- The model should be easily interpretable for debugging and user feedback purposes.