iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🤖

Gemini Robotics: Next-Generation Robot AI by Google

に公開

Introduction

Google DeepMind's Gemini Robotics, announced in 2025, is a foundational AI model specialized for robotics. Based on Gemini 2.0, it is designed as a VLA (Vision-Language-Action) model that integrates vision, language, and action.

This article provides an overview of Gemini Robotics and its primary features.

By reading this article, you will understand the following:

  • Basic mechanisms of Gemini Robotics
  • Features and performance of the VLA model
  • Introduction of the On-Device version
  • Actual use cases and future outlook

What is Gemini Robotics?

Gemini Robotics is an AI model for robots operating in the physical world. Built upon the multimodal reasoning capabilities of Gemini 2.0, it takes visual information and natural language instructions as input and translates them into robot action commands.

VLA (Vision-Language-Action) Model

At the core of Gemini Robotics is the VLA architecture:

  • Vision: Understanding images and videos from cameras and sensors
  • Language: Interpreting instructions in natural language
  • Action: Generating specific robotic movements

This approach allows robots to understand natural instructions like "Please pick up the cup on the table," identify the target object from visual information, and execute the appropriate action.

Key Features

1. High Generality

Gemini Robotics has recorded more than twice the generality score compared to conventional VLA models. It can adapt to new objects and unknown environments without prior training.

2. Dexterity

Supports complex manipulation tasks:

  • Folding origami
  • Packing items into a bag
  • Opening and closing zippers
  • Assembling industrial products

These tasks require precise control and multi-step planning.

3. Interactivity

  • Responds to natural language instructions
  • Adapts to environmental changes in real-time
  • Allows instruction changes during task execution
  • Multilingual support

Gemini Robotics-ER (Embodied Reasoning)

Gemini Robotics has a version called Gemini Robotics-ER, which is specialized for spatial awareness and reasoning.

Spatial Awareness Features

Gemini Robotics-ER excels at understanding physical space:

  • Object Position Recognition: Generates 2D coordinates and bounding boxes for objects in images
  • 3D Spatial Understanding: Supports experimental 3D reasoning features
  • Spatial Relationship Understanding: Grasps positional relationships between objects

Task Decomposition and Planning

It breaks down complex instructions into specific steps:

Instruction: "Please clean up the kitchen"

→ Decomposed steps:
1. Identify dishes on the table
2. Move dishes to the cupboard
3. Wipe the counter
4. Take out the trash

Integration via API

Gemini Robotics-ER is available through Google AI Studio and the Gemini API. It uses a normalized coordinate space (0-1000) and provides consistent output independent of image resolution.

Gemini Robotics On-Device

In June 2025, Google DeepMind announced Gemini Robotics On-Device. This version is optimized to be executed directly on the robot itself.

Advantages of On-Device

1. No Cloud Connection Required

Robots can operate without an internet connection:

  • Elimination of communication latency
  • Resilience to network failures
  • Improved privacy
  • Realization of real-time control

2. Maintaining High Performance

Maintains performance close to the cloud version despite local execution:

  • Performance significantly exceeding conventional on-device models
  • High success rate in complex tasks
  • Operation with low latency

3. Rapid Adaptation

New tasks can be learned with 50-100 demonstrations. This achieves significant efficiency compared to previous models that required more than 500.

SDK Provision

The Gemini Robotics SDK includes the following:

  • Evaluation tools: Testing in the MuJoCo physics simulator
  • Fine-tuning pipeline: Adaptation to new tasks
  • Lifecycle management: CLI and Python tools
  • Agent framework: Building robot agents
# Installing the SDK (via PyPI)
pip install safari_sdk

Training Platform: ALOHA 2

Gemini Robotics is trained on the ALOHA 2 platform.

Features of ALOHA 2

  • Bimanual Robots: Coordinated operation with two arms
  • Open Source: Hardware design and software are publicly available
  • Teleoperation: Intuitive operation by humans is possible
  • Low Cost: Pricing suitable for research and development purposes

Demonstration Examples

At Google I/O 2025, a demo of Gemini Robotics using ALOHA 2 was showcased:

  • Packing a lunch box
  • Dunking a basketball
  • Responding to voice instructions
  • Adapting to new objects

These tasks were successful even without specific prior training for them.

Partnership with Boston Dynamics

In January 2026, Boston Dynamics and Google DeepMind announced a strategic partnership.

Partnership Details

  • Integration of Gemini Robotics into the Atlas humanoid robot
  • Addition of AI capabilities to the Spot quadruped robot
  • Pilot testing in manufacturing (at Hyundai plants)

Expected Effects

While conventional robots were limited to pre-programmed tasks, Gemini Robotics enables:

  • Understanding of instructions in natural language
  • Adaptation to unstructured environments
  • Complex planning and reasoning
  • Manipulation of new objects

These capabilities will be introduced to industrial robots.

Performance Benchmarks

Gemini Robotics shows performance that significantly exceeds existing VLA models.

Comparison Table

Item Gemini Robotics Conventional VLA Models
Generality Benchmark Over 2x Baseline
Dexterity High Moderate
Instruction Understanding Excellent Good
Demos Required for Adaptation 50-100 500+
On-Device Performance High Low to Moderate

Real-world Verification

Testing in real-world environments is underway in collaboration with several companies:

  • Apptronik (Humanoid robots)
  • Boston Dynamics (Atlas, Spot)
  • Agility Robotics (Logistics robots)
  • Enchanted Tools (Service robots)

Use Cases

Manufacturing

  • Assembly of industrial products
  • Quality inspection
  • Inventory management
  • Flexible production lines

Logistics

  • Product picking
  • Packing operations
  • Movement within warehouses
  • Delivery preparation

Medical and Nursing Care

  • Preparation of medical instruments
  • Patient support
  • Environmental organization and tidying
  • Rehabilitation support

Home

  • Household chore assistance
  • Tidying and organizing
  • Transporting items
  • Everyday tasks

Technical Mechanisms

Multimodal Processing

Gemini Robotics processes multiple inputs simultaneously:

Input:
- Camera images/videos
- Voice instructions
- Text instructions
- Sensor data

↓ Integrated understanding by Gemini 2.0

Output:
- Robot control commands
- Action sequences
- Execution plans

Continual Learning

Robots continue to learn during execution:

  1. Execute the task
  2. Feed back the results
  3. Fine-tune the model
  4. Improve in the next execution

Through this mechanism, performance improves the more it is used.

Safety Considerations

Google places high importance on ethics and safety:

  • Collaboration with experts
  • Development of safety guidelines
  • Conducting risk assessments
  • Ensuring transparency

Access Methods

Gemini Robotics-ER 1.5

  • Available through Google AI Studio
  • Integration via Gemini API
  • Python and REST interfaces

Gemini Robotics On-Device

  • Application to the Trusted Tester Program is required
  • Documentation published on GitHub
  • Provision of SDK (safari_sdk)

Summary

Gemini Robotics represents a significant leap forward in robotics. Through the VLA model integrating vision, language, and action, robots are now able to understand human instructions and execute complex tasks.

Main points:

  • VLA model based on Gemini 2.0
  • Over twice the generality of conventional models
  • Support for on-device execution
  • Collaboration with major companies like Boston Dynamics
  • Wide range of applications from manufacturing to home use

The emergence of the On-Device version has enabled high-performance robot control without the need for a cloud connection. The partnership with Boston Dynamics is expected to accelerate practical implementation in industrial sectors.

As part of the initiative to integrate AI into the physical world, Gemini Robotics is opening up new possibilities in the field of robotics.

Official Documentation

Technical Information

Commentary Articles

Discussion