What They Don't Tell You About Making a Robot Arm Grab Things
Stefano Maestri — Software Engineer & Robotics Tinkerer
LeRobot + SO101 hardware, assembly, cameras, recording pipeline
5 real bugs: CPU policy, action chunking, causal confounding, calibration drift, distribution shift
Profiling, diagnostics, tools we built, training metrics that matter
What worked, what did not, and the easier path with Cyberwave
5 bugs 5 fixes 1 robot 17.5 degrees
lerobot-record, lerobot-train, lerobot-rollout"Grab the red block and put it in the box"
Sounds simple. It was not.
YouTube: robots folding laundry, cooking eggs, sorting warehouse boxes.
Papers: "straightforward training pipeline."
LeRobot README: "simple, accessible, state-of-the-art."
Narrator: it was not done by Sunday lunch.
Assembly surprises no tutorial prepares you for
3D-printed shoulder bracket had layer adhesion failure. Caused binding under load. Reprinted with higher infill.
Some servos shipped on different firmware versions. Motors on mismatched firmware don't work together — every servo had to be flashed to the same version.
The FD debug tool for STS3215 firmware updates runs exclusively on Windows. Linux users: find a Windows machine or VM.
Every joint must be manually aligned to its zero position. Millimeter precision required. One wrong offset propagates to the entire kinematic chain.
The leader arm moves. The follower arm follows.
You move your wrist and a machine six inches away mirrors you in real time. It is, genuinely, magical.
Welcome to embedded Linux.
| Camera | Position | Resolution |
|---|---|---|
front | Top-down view | 640x480 @ 30Hz |
right | Side angle | 640x480 @ 30Hz |
wrist | Gripper-mounted | 640x480 @ 30Hz |
Fix MJPEG at camera level + spread across USB controllers.
lerobot-recordWayland blocks global keyboard event snooping. The pynput.keyboard.Listener silently fails. No arrow-key controls during recording.
[Y/n/q] between episodes--dataset.reset_time_s=-1# Activate interactive mode
lerobot-record \
--robot.type=so101_follower \
--dataset.reset_time_s=-1 \
--dataset.single_task="Grab the red block" \
...
# Between episodes:
[INTERACTIVE RESET] Episode 3 recorded.
Keep scene and record next? [Y/n/q]: y
# During recording:
# Press Ctrl+\ to end episode early
[INTERACTIVE] Episode end requested
via SIGQUIT
50 attempts to grab a red block. Each recording session:
Lesson #1: Data collection IS the job.
Not a step before the job. Not a prerequisite. The job.
Symptom: GPU utilization at 2%. Training "works" but policy is garbage.
Fix: policy.to("cuda") explicitly after loading. Check next(policy.parameters()).device.
Symptom: Policy works perfectly in replay, fails on real robot.
Fix: Remove leader arm from camera frame. It leaks future actions into observations.
Symptom: Works at 10am, fails at 3pm. Same setup, same code.
Fix: Record across lighting conditions. Enable color augmentation (but not hue jitter for color tasks).
Common thread: the system never tells you something is wrong.
No errors. No warnings. Just a robot that doesn't work.
ACT policy. 1 hour on GPU.
Loss: 0.08
Looks great.
Robot reaches for the block...
and misses.
Every. Single. Time.
Not randomly. Systematically. Always 6-7 cm to the right.
Maybe the model isn't big enough?
Try ACT → misses →
Try Pi0.5 → misses →
Try SmolVLA → misses
The problem isn't the model.
Choosing the right model for a budget setup
chunk_size=100, n_action_steps=100
Language prompt + vision encoder
chunk_size=50, pretrained backbone
When "GPU idle 90% of the time" is actually correct
chunk_size = 100, n_action_steps = 100
1 forward pass = 100 actions = 3.3s of motion at 30 Hz
Sporadic "running slower than requested fps" warnings = timing jitter, not a bug.
Cost us 2 days debugging a non-problem.
Reads every episode in a dataset. Computes per-joint drift statistics between leader commands and follower positions.
Found the 17.5° wrist_flex drift
# Usage python compare_leader_follower.py \ --dataset ./data/pick_block
Flags outlier episodes by trajectory smoothness, gripper timing, and completion metrics.
Identified 12% corrupted episodes
# Usage python evaluate_dataset_quality.py \ --dataset ./data/pick_block \ --threshold 2.0
Static calibration check. Reads current joint positions in real-time. Compare leader vs follower live.
Verify calibration before recording
# Usage python read_leader_pos.py python read_follower_pos.py
~600 lines of Python
that changed everything
The policy learned to cheat
Leader arm visible in camera during teleoperation recording. Policy learns a shortcut: track the leader arm, not the block.
Leader arm is absent. Policy sees an unfamiliar scene. Output: random, erratic movements.
Reposition cameras so the leader arm is never in frame. Re-record entire dataset.
The root cause of everything
# compare_leader_follower.py output
Joint Mean |diff| Max |diff|
───────────────────────────────────
shoulder_pan 1.2° 3.1°
shoulder_lift 0.9° 2.4°
elbow_flex 1.1° 2.8°
wrist_flex 17.5° 22.3°
wrist_roll 0.7° 1.9°
gripper 2.3° 4.1°
WARNING: wrist_flex exceeds 5° threshold!
Recalibrate this joint.
| Metric | Before | After |
|---|---|---|
| wrist_flex offset | 17.5° | 0.85° |
| Gripper accuracy | ±6-7 cm | ±0.3 cm |
| Grasp success | 0% | 80% |
Every single episode I recorded taught the robot
a physically incorrect mapping of the world.
A 40-line Python script found in 3 seconds what I couldn't find in 2 weeks.
Same policy. Same code. Same hyperparameters.
Morning light training, afternoon deployment
Policy trained with morning lighting fails in afternoon conditions. Shadows change, color temperature shifts, white balance differs.
# Enable built-in transforms
lerobot-train \
--training.image_transforms.enable=true \
...
Default augmentations:
What l1_loss actually means for your robot
Your most powerful debugging tool: lerobot-dataset-viz
LIVE DEMO
Rerun visualization: 3 camera streams + joint positions + episode timeline
lerobot-dataset-viz \
--repo-id smaestri/so101_pick \
--episode-index 0
Requires pip install 'lerobot[viz]'
Success and failure — because both matter
The 30Hz loop breakdown revealed that compute was never the bottleneck.
Every bug was found by interrogating the recorded data, not the model.
One flag (image_transforms.enable=true) fixed lighting sensitivity.
600 lines of Python saved weeks of guesswork.
Switched to SmolVLA when the problem was a 17.5° calibration offset. Model size is irrelevant if the data is wrong.
Trained for 200k steps instead of 100k. Loss plateaued at 0.065. The problem was data distribution, not underfitting.
Hypothesized fewer inputs = easier learning. Wrong: wrist perspective is critical for fine grasping.
Spent 3 weeks on software when the answer was a recalibration that took 10 minutes.
Every visible symptom had its root cause 2-3 layers below the surface.
1. Verify calibration before every recording session
2. Build dataset inspection tools before training tools
3. Start with ACT. Add complexity only when needed
4. Record across conditions (lighting, position, angle)
5. Check your GPU utilization. Always
After fighting every layer of the stack, I found a platform that automates most of it.
One API for any robot. Write Python once — it runs in simulation AND on real hardware.
What if the hard parts were automated?
cyberwave.com
docs.cyberwave.com/tutorials/so101-voice-pick-and-place
| Challenge | Us (weeks) | Cyberwave |
|---|---|---|
| Calibration | 3 weeks | Auto-detected |
| Camera setup | 2 days | Pre-configured |
| Training infra | 1 day | Cloud GPU |
| Data augmentation | 1 day | Built-in |
| Deployment | 2 days | One-click |
| Diagnostics | 1 week | Dashboard |
"Start local to understand. Go cloud to ship."