🐶

CVAE explanation

2025/05/14に公開

Here’s a pared-down view of how ACT uses a Conditional VAE to turn what the robot “sees” and “feels” into a short burst of actions:

 1. The Big Picture: Conditional VAE
VAE core idea: learn an encoder–decoder pair that can compress data into a “latent” code and then reconstruct it.

Conditional: we don’t just encode the data itself, we also feed in extra context (here, camera images and joint states) so the decoder knows what to generate.

 2. Inputs to the EncoderMulti-view camera images
Current robot joint states (e.g. angles, velocities)
Latent style variable 𝑧
This is like a “behavior knob”—you can tweak 𝑧 to get more gentle, faster, or more exploratory movements.
The encoder network (a transformer) combines all three and learns to output a compact latent representation.

 3. Latent Space & Style VariableDuring training, the encoder actually produces a mean and variance for 𝑧, so you can sample slightly different 𝑧’s and still get plausible actions.
At inference time you can either sample 𝑧 from that learned distribution or pick one you like to control style.

 4. Decoder = Transformer Action GeneratorThe decoder takes the compressed context (which embeds images + joint info + 𝑧) and unfolds it into a sequence of future joint targets.
Because it’s a transformer decoder, it can flexibly produce an entire “action chunk” (e.g. the next 10–20 robot commands) in one go.

 5. What You GetA short burst of motion (“action chunk”) that’s consistent with what the robot sees, how it’s currently positioned, and the style you want.
By training end-to-end, the model learns both what to do next and how (the style) in one streamlined architecture.

1. The Big Picture: Conditional VAE

2. Inputs to the Encoder

3. Latent Space & Style Variable

4. Decoder = Transformer Action Generator

5. What You Get

Discussion