🐶

CVAE explanation

に公開

Here’s a pared-down view of how ACT uses a Conditional VAE to turn what the robot “sees” and “feels” into a short burst of actions:

1. The Big Picture: Conditional VAE

  • VAE core idea: learn an encoder–decoder pair that can compress data into a “latent” code and then reconstruct it.
  • Conditional: we don’t just encode the data itself, we also feed in extra context (here, camera images and joint states) so the decoder knows what to generate.

2. Inputs to the Encoder

  1. Multi-view camera images

  2. Current robot joint states (e.g. angles, velocities)

  3. Latent style variable 𝑧

    • This is like a “behavior knob”—you can tweak 𝑧 to get more gentle, faster, or more exploratory movements.

The encoder network (a transformer) combines all three and learns to output a compact latent representation.

3. Latent Space & Style Variable

  • During training, the encoder actually produces a mean and variance for 𝑧, so you can sample slightly different 𝑧’s and still get plausible actions.
  • At inference time you can either sample 𝑧 from that learned distribution or pick one you like to control style.

4. Decoder = Transformer Action Generator

  • The decoder takes the compressed context (which embeds images + joint info + 𝑧) and unfolds it into a sequence of future joint targets.
  • Because it’s a transformer decoder, it can flexibly produce an entire “action chunk” (e.g. the next 10–20 robot commands) in one go.

5. What You Get

  • A short burst of motion (“action chunk”) that’s consistent with what the robot sees, how it’s currently positioned, and the style you want.
  • By training end-to-end, the model learns both what to do next and how (the style) in one streamlined architecture.

Discussion