🐶
CVAE explanation
Here’s a pared-down view of how ACT uses a Conditional VAE to turn what the robot “sees” and “feels” into a short burst of actions:
1. The Big Picture: Conditional VAE
- VAE core idea: learn an encoder–decoder pair that can compress data into a “latent” code and then reconstruct it.
- Conditional: we don’t just encode the data itself, we also feed in extra context (here, camera images and joint states) so the decoder knows what to generate.
2. Inputs to the Encoder
-
Multi-view camera images
-
Current robot joint states (e.g. angles, velocities)
-
Latent style variable 𝑧
- This is like a “behavior knob”—you can tweak 𝑧 to get more gentle, faster, or more exploratory movements.
The encoder network (a transformer) combines all three and learns to output a compact latent representation.
3. Latent Space & Style Variable
- During training, the encoder actually produces a mean and variance for 𝑧, so you can sample slightly different 𝑧’s and still get plausible actions.
- At inference time you can either sample 𝑧 from that learned distribution or pick one you like to control style.
4. Decoder = Transformer Action Generator
- The decoder takes the compressed context (which embeds images + joint info + 𝑧) and unfolds it into a sequence of future joint targets.
- Because it’s a transformer decoder, it can flexibly produce an entire “action chunk” (e.g. the next 10–20 robot commands) in one go.
5. What You Get
- A short burst of motion (“action chunk”) that’s consistent with what the robot sees, how it’s currently positioned, and the style you want.
- By training end-to-end, the model learns both what to do next and how (the style) in one streamlined architecture.
Discussion