Train SAC for Robot Arm Control

Business Context

RoboMotion operates warehouse picking robots that must move a 6-DoF arm smoothly and accurately to target positions. The controls team wants an offline-to-online training pipeline for a continuous-action policy that can be improved in simulation before limited deployment on real hardware.

Dataset

You are given logged transition data collected from a MuJoCo-style robotic reaching simulator and a small amount of real robot replay data. The task is to train and evaluate a Soft Actor-Critic (SAC) agent for continuous control.

Feature Group	Count	Examples
State features	24	joint angles, joint velocities, end-effector position, target position, gripper state
Action features	6	continuous joint torque commands in [-1, 1]
Reward signals	4	dense distance reward, action penalty, collision penalty, success bonus
Episode metadata	5	episode_id, timestep, domain (sim/real), reset_type, done flag

Size: 2.4M transitions across 180K episodes
Target: Learn a policy maximizing discounted return, not a supervised label
Action space: 6 continuous actions
Missing data: ~1.5% corrupted sensor rows in real-robot logs; simulator data is complete

Success Criteria

A strong solution should:

Achieve average episode return >= 220 on the simulator test environments
Reach success rate >= 85% on held-out target positions
Produce stable training with low Q-value divergence across seeds
Explain why SAC is appropriate for continuous control compared with DQN or vanilla policy gradients

Constraints

Training budget is limited to 1 GPU for 8 hours
Inference on the robot controller must be < 10 ms per step
The policy should remain robust to moderate domain shift between simulation and real logs

Deliverables

Explain SAC and why entropy-regularized RL is useful for continuous control.
Build a training pipeline for actor, twin critics, and target networks.
Describe preprocessing and filtering for corrupted replay data.
Evaluate the learned policy with concrete metrics and ablations.
Recommend deployment safeguards for testing the policy on real hardware.

Business Context

Dataset

Feature Group	Count	Examples
State features	24	joint angles, joint velocities, end-effector position, target position, gripper state
Action features	6	continuous joint torque commands in [-1, 1]
Reward signals	4	dense distance reward, action penalty, collision penalty, success bonus
Episode metadata	5	episode_id, timestep, domain (sim/real), reset_type, done flag

Size: 2.4M transitions across 180K episodes
Target: Learn a policy maximizing discounted return, not a supervised label
Action space: 6 continuous actions
Missing data: ~1.5% corrupted sensor rows in real-robot logs; simulator data is complete

Success Criteria

A strong solution should:

Achieve average episode return >= 220 on the simulator test environments
Reach success rate >= 85% on held-out target positions
Produce stable training with low Q-value divergence across seeds
Explain why SAC is appropriate for continuous control compared with DQN or vanilla policy gradients

Constraints

Training budget is limited to 1 GPU for 8 hours
Inference on the robot controller must be < 10 ms per step
The policy should remain robust to moderate domain shift between simulation and real logs

Deliverables

Explain SAC and why entropy-regularized RL is useful for continuous control.
Build a training pipeline for actor, twin critics, and target networks.
Describe preprocessing and filtering for corrupted replay data.
Evaluate the learned policy with concrete metrics and ablations.
Recommend deployment safeguards for testing the policy on real hardware.

Business Context

Dataset

Feature Group	Count	Examples
State features	24	joint angles, joint velocities, end-effector position, target position, gripper state
Action features	6	continuous joint torque commands in [-1, 1]
Reward signals	4	dense distance reward, action penalty, collision penalty, success bonus
Episode metadata	5	episode_id, timestep, domain (sim/real), reset_type, done flag

Size: 2.4M transitions across 180K episodes
Target: Learn a policy maximizing discounted return, not a supervised label
Action space: 6 continuous actions
Missing data: ~1.5% corrupted sensor rows in real-robot logs; simulator data is complete

Success Criteria

A strong solution should:

Achieve average episode return >= 220 on the simulator test environments
Reach success rate >= 85% on held-out target positions
Produce stable training with low Q-value divergence across seeds
Explain why SAC is appropriate for continuous control compared with DQN or vanilla policy gradients

Constraints

Training budget is limited to 1 GPU for 8 hours
Inference on the robot controller must be < 10 ms per step
The policy should remain robust to moderate domain shift between simulation and real logs

Deliverables

Explain SAC and why entropy-regularized RL is useful for continuous control.
Build a training pipeline for actor, twin critics, and target networks.
Describe preprocessing and filtering for corrupted replay data.
Evaluate the learned policy with concrete metrics and ablations.
Recommend deployment safeguards for testing the policy on real hardware.

Business Context

Dataset

Feature Group	Count	Examples
State features	24	joint angles, joint velocities, end-effector position, target position, gripper state
Action features	6	continuous joint torque commands in [-1, 1]
Reward signals	4	dense distance reward, action penalty, collision penalty, success bonus
Episode metadata	5	episode_id, timestep, domain (sim/real), reset_type, done flag

Size: 2.4M transitions across 180K episodes
Target: Learn a policy maximizing discounted return, not a supervised label
Action space: 6 continuous actions
Missing data: ~1.5% corrupted sensor rows in real-robot logs; simulator data is complete

Success Criteria

A strong solution should:

Achieve average episode return >= 220 on the simulator test environments
Reach success rate >= 85% on held-out target positions
Produce stable training with low Q-value divergence across seeds
Explain why SAC is appropriate for continuous control compared with DQN or vanilla policy gradients

Constraints

Training budget is limited to 1 GPU for 8 hours
Inference on the robot controller must be < 10 ms per step
The policy should remain robust to moderate domain shift between simulation and real logs

Deliverables

Explain SAC and why entropy-regularized RL is useful for continuous control.
Build a training pipeline for actor, twin critics, and target networks.
Describe preprocessing and filtering for corrupted replay data.
Evaluate the learned policy with concrete metrics and ablations.
Recommend deployment safeguards for testing the policy on real hardware.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Train SAC for Robot Arm Control

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Train SAC for Robot Arm Control

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Train SAC for Robot Arm Control

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer