🤖

Action Chunking Transformer (ACT) policies and ALOHA — what these are?

に公開

ACT, what it is?

1. Action Chunking Transformer (ACT) Policies

The Action Chunking Transformer (ACT) is a transformer-based imitation-learning policy that predicts short sequences of actions (“action chunks”) rather than single steps, which helps capture longer‐range dependencies and reduces compounding errors during execution (Tony Z. Zhao).

ACT policies are typically trained and evaluated on the ALOHA suite of simulated bimanual manipulation tasks (e.g., cube transfer and insertion) provided via the gym-aloha package, and they can also be deployed on low-cost physical ALOHA robot kits to learn from human demonstrations shared through a community Data Pool (GitHub, Community).

1.1 Concept and Motivation

  • Action Chunks vs. Single Actions
    Instead of predicting one action at a time, ACT generates a fixed-length sequence of future actions—an “action chunk”—in a single forward pass, giving the policy awareness of short-term temporal context and improving stability (Tony Z. Zhao).
  • Conditional Variational Modeling
    ACT is implemented as the decoder of a Conditional VAE: a transformer encoder processes multi-view camera images, robot joint states, and a latent style variable 𝑧, and the transformer decoder outputs the entire action chunk (Tony Z. Zhao).
    About CVAE

1.2 Architecture

  • Encoder
    Aggregates visual observations and proprioceptive inputs into a unified representation, enabling the model to reason over multiple modalities (Tony Z. Zhao).
  • Decoder
    Auto-regressively predicts a sequence of 𝑁 future actions (e.g., 10 steps), which the robot can execute in batch before querying the next chunk, reducing decision overhead and error compounding (Radek Osmulski).
  • Open-Source Implementation
    The tonyzhaozh/act GitHub repository provides code to train and evaluate ACT on both simulated (gym-aloha) and real ALOHA hardware, supporting tasks like Transfer Cube and Bimanual Insertion (GitHub).

1.3 Applications and Performance

  • Fine-Grained Bimanual Manipulation
    ACT has demonstrated strong results on tasks requiring precise coordination of two arms, such as cube transfer and peg insertion, even when using low-cost, imprecise hardware (arXiv).
  • Success Rates in Simulation
    For example, the act_aloha_sim_transfer_cube_human policy trained via LeRobot achieved around an 83% success rate on the AlohaTransferCube task by learning 10-step action chunks from human demonstrations (Hugging Face).

2. ALOHA: Simulated and Physical Robotic Environments

ALOHA is name of robotics envronments both of virtual and physical.

2.1 Gym-ALOHA Simulation

  • Two Core Tasks
    The gym-aloha package offers TransferCube and Insertion environments for bimanual robots, each featuring a 14-dimensional continuous action space (six joint commands per arm + gripper open/close) (GitHub).
  • Demonstration Datasets
    Hugging Face hosts human-collected demo datasets (e.g., lerobot/aloha_sim_transfer_cube_human), which are essential for behavior-cloning and ACT training pipelines (Hugging Face).

2.2 Physical ALOHA Kit and Data Pool

  • Low-Cost Hardware
    The ALOHA system—developed by Trossen Robotics—is an affordable bimanual robot platform that hobbyists and researchers can assemble, enabling real-world data collection at low cost (Community).
  • Collaborative Data Sharing
    Users contribute their demonstration recordings to the Aloha Data Pool on Hugging Face Spaces, fostering community-driven improvements to policies trained with ACT and related methods (Community).

2.3 Real-World Success Stories

  • Beyond Simulation
    Real ALOHA robots have been trained to perform complex tasks—such as precise object transfer, garment manipulation, and even threading a hanger—showcasing the ability to learn dexterous skills on inexpensive hardware (newyorker.com).

Discussion