In the summer of 2021, OpenAI quietly shuttered its robotics team, announcing that progress was being stifled by a lack of data necessary to train robots in how to move and reason using artificial intelligence. 

Now three of OpenAI’s early research scientists say the startup they spun off in 2017, called Covariant, has solved that problem and unveiled a system that combines the reasoning skills of large language models with the physical dexterity of an advanced robot.

The new model, called RFM-1, was trained on years of data collected from Covariant’s small fleet of item-picking robots that customers like Crate & Barrel and Bonprix use in warehouses around the world, as well as words and videos from the internet. In the coming months, the model will be released to Covariant customers. The company hopes the system will become more capable and efficient as it’s deployed in the real world. 

So what can it do? In a demonstration I attended last week, Covariant cofounders Peter Chen and Pieter Abbeel showed me how users can prompt the model using five different types of input: text, images, video, robot instructions, and measurements. 

For example, show it an image of a bin filled with sports equipment, and tell it to pick up the pack of tennis balls. The robot can then grab the item, generate an image of what the bin will look like after the tennis balls are gone, or create a video showing a bird’s-eye view of how the robot will look doing the task. 

If the model predicts it won’t be able to properly grasp the item, it might even type back, “I can’t get a good grip. Do you have any tips?” A response could advise it to use a specific number of the suction cups on its arms to give it better a grasp—eight versus six, for example. 

This represents a leap forward, Chen told me, in robots that can adapt to their environment using training data rather than the complex, task-specific code that powered the previous generation of industrial robots. It’s also a step toward worksites where managers can issue instructions in human language without concern for the limitations of human labor. (“Pack 600 meal-prep kits for red pepper pasta using the following recipe. Take no breaks!”)

Lerrel Pinto, a researcher who runs the general-purpose robotics and AI lab at New York University and has no ties to Covariant, says that even though roboticists have built basic multimodal robots before and used them in lab settings, deploying one at scale that’s able to communicate in this many modes marks an impressive feat for the company. 

To outpace its competitors, Covariant will have to get its hands on enough data for the robot to become useful in the wild, Pinto told me. Warehouse floors and loading docks are where it will be put to the test, constantly interacting with new instructions, people, objects, and environments. 

“The groups which are going to train good models are going to be the ones that have either access to already large amounts of robot data or capabilities to generate those data,” he says.

Covariant says the model has a “human-like” ability to reason, but it has its limitations. During the demonstration, in which I could see a live feed of a Covariant robot as well as a chat window to communicate with it, Chen invited me to prompt the model with anything I wanted. When I asked the robot to “return the banana to Tote Two,” it struggled with retracing its steps, leading it to pick up a sponge, then an apple, then a host of other items before it finally accomplished the banana task. 

“It doesn’t understand the new concept,” Chen said by way of explanation, “but it’s a good example—it might not work well yet in the places where you don’t have good training data.”

The company’s new model embodies a paradigm shift rippling through the robotics world. Rather than teaching a robot how the world works manually, through instructions like physics equations and code, researchers are teaching it in the same way humans learn: through millions of observations. 

The result “really can act as a very effective flexible brain to solve arbitrary robot tasks,” Chen said. 

The playing field of companies using AI to power more nimble robotic systems is likely to grow crowded this year. Earlier this month, the humanoid-robotics startup Figure AI announced it would be partnering with OpenAI and raised $675 million from tech giants like Nvidia and Microsoft. Marc Raibert, the founder of Boston Dynamics, recently started an initiative to better integrate AI into robotics.  

This means that advancements in machine learning will likely start translating to advancements in robotics. However, some issues remain unresolved. If large language models continue to be trained on millions of words without compensating the authors of those words, perhaps it will be expected that robotics models will also be trained on videos without paying their creators. And if language models hallucinate and perpetuate biases, what equivalents will surface in robotics?

In the meantime, Covariant will push forward, keen on having RFM-1 continually learn and refine. Eventually, the researchers aim to have the robot train on videos that the model itself creates—the type of meta-learning that not only makes my head spin but also sparks concern about what will happen if errors made by the model compound themselves. But with such a hunger for more training data, researchers see it almost as inevitable.

“Training on that will be a reality,” Abbeel says. “If we talk again a half year from now, that’s what we’ll be talking about.”