Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Revolution in How Robots Learn (newyorker.com)
82 points by jsomers on Nov 26, 2024 | hide | past | favorite | 21 comments


I did a review of state of the art in robotics recently in prep for some job interviews and the stack is the same as all other ML problems these days, take a large pretrained multi modal model and do supervised fine tuning of it on your domain data.

In this case it's "VLA" as in Vision Language Action models, where a multimodal decoder predicts action tokens and "behavior cloning" is a fancy made up term for supervised learning, because all of the RL people can't get themselves to admit that supervised learning works way better than reinforcement learning in the real world.

Proper imitation learning where a robot learns from 3rd person view of humans doing stuff does not work yet, but some people in the field like to pretend that teleoperation and "behavior cloning" is a form of imitation learning.


I'd like to see the resources you found most helpful as well


Would you mind sharing which readings you found most useful in that review?


just coming back to this thread, this paper is quite a good read: https://arxiv.org/html/2406.09246v3#S3

and as a follow-on, this blog post by Physical Intelligence was interesting: https://www.physicalintelligence.company/blog/pi0


hey just got back on and the papers you shared are the main works that I was about to link. There's also a new VLA paper fro waymo https://arxiv.org/abs/2410.23262v2

and some recent talks on youtube:

- OpenVLA: https://www.youtube.com/watch?v=-0s0v3q7mBk

- The current state of robotics by Alex Irpan: ‬https://www.youtube.com/watch?v=XocmVe1FCMY

- Robot Learning, with inspiration from child development–Jitendra Malik: https://www.youtube.com/watch?v=69ZWEaOKnQQ

- AI Symposium 2024 | Dieter Fox Keynote: https://www.youtube.com/watch?v=vgqHR9gK9bQ

- 1st Workshop on X-Embodiment Robot Learning, CoRL'24: https://www.youtube.com/watch?v=ELUMFpJCUS0


In addition to the papers on end-to-end learning for robotics, it might also be worth reading about the state-of-the art in classical robotics. There's a lot of debate in the field about whether end-to-end learning and scaling will solve robotics[1]. On the E2E side, there's the bitter lesson, scaling for LLMs and other AI success cases. On the skeptical side, there's the reliability limit (has anyone seen any ML cross the 1 failure out 100,000 barrier on real data?), and the bitter-er lesson (scaling on search can be better than scaling on data and classical robotics is scaling search instead of data). Data availability is a blocker for research, but in production many use-cases are profitable with teleop so data can be collected profitably, especially with UX design to make teleop more efficient.

Navigating in stair-free commercial environments was solved in mid-2009 by classical planning + SLAM with LIDAR, and open-sourced in the ROS navstack. A LIDAR-free version using stereo cameras was also open-sourced shortly thereafter. The navstack is still maintained and integrated by Open Robotics[2] and Opennav[3]. These techniques (and in many cases forks of the OSS code) power e.g. 10,000 bear.ai robots in restaurants today, as well as some of the newer Roombas. All of this is CPU-only, and can run on a NUC.

Classical planning has also solved arm navigation quite well. The modern technology here is MoveIt! 2[4]. MoveIt! uses essentially the CAD model of the arm (which most robot manufacturers provide in the correct format) plus data about objects in the environment from sensors to plan motions. There are modules to create smoother, human-like motions as well. All of this works efficiently on CPU-only.

Lastly, LIDAR-less SLAM and mapping is also starting: https://docs.luxonis.com/software/ros/vio-slam. LIDAR costs have also fallen to the point where robot vacuums are sold with integrated LIDARs.

The main area where classical has not made as much progress is on soft objects (e.g folding towels) and on object detection. Classical point-cloud based object detection for example is based on correspondence grouping[5], but overall everyone is using at least partially neural nets for these problems.

As for end-to-end in prod without human-in-the-loop, covariant and ambi are the only cases I've seen so far. They benefit from having the ability to have a classical safety layer and a classical success detector via e.g. object weights (I'm not sure what approach they are using, I've just seen object weight elsewhere). With that they can get the much-desired data flywheel effect of self-improving systems.

1. https://spectrum.ieee.org/solve-robotics 2. https://openrobotics.org 3. https://opennav.org 4. https://moveit.picknik.ai/humble/index.html 5. https://pcl.readthedocs.io/projects/tutorials/en/latest/corr...


This is excellent - thank you so much


brilliant, thank you


yeah, I'd love to see your list


We should consider that it may be possible to train a model that first maps 3rd-person views to 1st person views, before a secondary model then trains on the first person view.

An untapped area is existing first person videos for small object manipulation, like police-cameras, where they handle flashlights and other objects regularly. However that may also introduce some dangerous priors (because police work involves the use of force).

- This reply generated by P.R.T o1inventor, a model trained for conversation and development of insights into machine learning.


One particularly fascinating aspect of this essay is the comparison between human motor learning and robotic dexterity development, particularly the concept of “motor babbling.” The author highlights how babies use seemingly random movements to calibrate their brains with their bodies, drawing a parallel to how robots are being trained to achieve precise physical tasks. This framing makes the complexity of robotic learning, such as a robot tying shoelaces or threading a needle, more relatable and underscores the immense challenge of replicating human physical intelligence in machines. For me it is also a vivid reminder of how much we take our own physical adaptability for granted.


machine learning is not equal to human infant development, full stop.



Hey, I wonder if we can use LLMs to learn learning patterns, I guess the bottleneck would be the curse of dimensionality when it comes to real world problems, but I think maybe (correct me if I'm wrong) geographic/domain specific attention networks could be used.

Maybe it's like:

1. Intention, context 2. Attention scanning for components 3. Attention network discovery 4. Rescan for missing components 5. If no relevant context exists or found 6. Learned parameters are initially greedy 7. Storage of parameters gets reduced over time by other contributors

I guess this relies on there being the tough parts: induction, deduction, abductive reasoning.

Can we fake reasoning to test hypothesis that alter the weights of whatever model we use for reasoning?


Maybe I'm just complicating unsupervised reinforcement learning, and adding central authorities for domain specific models.


A research result reported before, but, as usual, the New Yorker has better writers.

Is there something which shows what the tokens they use look like?


> the New Yorker has better writers.

Really? I suppose it's very subjective, but I find their style, both in this article and in general to be unbearably long - almost as if their journalists enjoy writing for the sake of writing, with the transmission of information being a minor concern.


There's a big asterisk on the word "learn" in that headline.


Oh my, that has to be one of the worst jobs ever invented.


Anyone find is suspcious that all these paywalled fluff tech legacy media articles keep on ending up on hn? Feels like an op. Who in tech actually reads NYT for example?


People procrastinating on their ML job.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: