Business Context
RoboMotion operates warehouse picking robots that must move a 6-DoF arm smoothly and accurately to target positions. The controls team wants an offline-to-online training pipeline for a continuous-action policy that can be improved in simulation before limited deployment on real hardware.
Dataset
You are given logged transition data collected from a MuJoCo-style robotic reaching simulator and a small amount of real robot replay data. The task is to train and evaluate a Soft Actor-Critic (SAC) agent for continuous control.
| Feature Group | Count | Examples |
|---|
| State features | 24 | joint angles, joint velocities, end-effector position, target position, gripper state |
| Action features | 6 | continuous joint torque commands in [-1, 1] |
| Reward signals | 4 | dense distance reward, action penalty, collision penalty, success bonus |
| Episode metadata | 5 | episode_id, timestep, domain (sim/real), reset_type, done flag |
- Size: 2.4M transitions across 180K episodes
- Target: Learn a policy maximizing discounted return, not a supervised label
- Action space: 6 continuous actions
- Missing data: ~1.5% corrupted sensor rows in real-robot logs; simulator data is complete
Success Criteria
A strong solution should:
- Achieve average episode return >= 220 on the simulator test environments
- Reach success rate >= 85% on held-out target positions
- Produce stable training with low Q-value divergence across seeds
- Explain why SAC is appropriate for continuous control compared with DQN or vanilla policy gradients
Constraints
- Training budget is limited to 1 GPU for 8 hours
- Inference on the robot controller must be < 10 ms per step
- The policy should remain robust to moderate domain shift between simulation and real logs
Deliverables
- Explain SAC and why entropy-regularized RL is useful for continuous control.
- Build a training pipeline for actor, twin critics, and target networks.
- Describe preprocessing and filtering for corrupted replay data.
- Evaluate the learned policy with concrete metrics and ablations.
- Recommend deployment safeguards for testing the policy on real hardware.