Gemini

Google announces Gemini Robotics for building general purpose robots

Abner Li | Mar 12 2025 - 8:57 am PT

Google DeepMind today announced Gemini Robotics to bring Gemini and “AI into the physical world,” with new models able to “perform a wider range of real-world tasks than ever before.”

In order for AI to be useful and helpful to people in the physical realm, they have to demonstrate “embodied” reasoning — the humanlike ability to comprehend and react to the world around us— as well as safely take action to get things done.

The aim is to build general purpose robots, with CEO Sundar Pichai adding how Google has “always thought of robotics as a helpful testing ground for translating AI advances into the physical world.”

“Gemini Robotics” is a vision-language-action (VLA) model built on Gemini 2.0 “with the addition of physical actions as a new output modality for the purpose of directly controlling robots.”

Going in, Google has “three principal qualities” for robotic AI models:

Advertisement - scroll for more content

Generality: “able to adapt to different situations”

Gemini Robotics is “adept at dealing with new objects, diverse instructions, and new environments,” including “tasks it has never seen before in training” by leveraging Gemini’s underlying world understanding.

Interactivity: “understand and respond quickly to instructions or changes in their environment”

Google’s new model can “respond to commands phrased in everyday, conversational language and in different languages”

Dexterity: “can do the kinds of things people generally can do with their hands and fingers, like carefully manipulate objects.”

“Gemini Robotics can tackle extremely complex, multi-step tasks that require precise manipulation such as origami folding or packing a snack into a Ziploc bag.”

Google also announced the Gemini Robotics-ER (“embodied reasoning”) vision-language model with enhanced spatial “understanding of the world in ways necessary for robotics, focusing especially on spatial reasoning, and allows roboticists to connect it with their existing low level controllers.”

For example, when shown a coffee mug, the model can intuit an appropriate two-finger grasp for picking it up by the handle and a safe trajectory for approaching it.

These models run on various robot form factors (including bi-arm and humanoid robots), with trusted testers like Agile Robots, Agility Robots, Boston Dynamics, and Enchanted Tools.