🤖

1X: World Modeling Challenge at CVPR2024 workshop

2024/06/23に公開

概要

  • CVPR2024で開催されていた5th embodied ai workshopで1XのAI teamをleadしているEric Jangがinveted talkerとして発表していた
    • 4thの動画はyoutubeで公開されているので、5thの動画も公開されるかも
    • 5thの動画が公開された場合には、内容を更新します
  • 1Xが学習モデルのCompetitionを開催することを発表していた

World Modeling Challenge

Predict Everything There Is to Know

  • flatten the data into a 1d sewuence, predict with LLM
  • Yann Lecan Cake Analogy
    • Most data is cake, very little data is frosting + cherry, therefore you see better scaling laws when you train on the entire cake, not just the cherry

Scaling Laws Benefit from Massive # of Tasks

  • Intution: when number od tasks is large (as in LLMs, modeling complex distributions ), almost all data helps in some task in one way or another.
  • Slow data scaling: most pairs of tokens(B and C interactions ) don't help with predicting what A does next. Performance only gets when new infomation about p(a|b,c) is revealed

Why Does Scaling Not "Just Work" for Robotics Yet?

  • Real world evaluation is unreproducible, so hard to measure the effect of scaling
  • Data freshness cofounders
    • collect data Sunday, evaluate 100M parameter model on Monday, evaluate 1B parameter model on Friday, 1B is worse because it's Friday
  • Cross entropy loss vs. other losess(MSE)
  • Architetual bottlenecks in robotics that restrict parameter efficiency
    • Clever inducive bias that hurt end-to-end optimization at some scale (many RNNs)
    • MSE loss limits prediction to 1 mode
  • Large model like RT2-X 55B are undertrained wrt. test domain

Why 1X World Model Challenge?

  • We think that it may soon be possible to do end-to-end evaluation (+ trainng?) of millions of tasks in a learned world model
    • seeing scaling laws requires millions of tasks, not dozens
  • Solving the evaluation problem would greatly accelerate progress in general-purpose & humanoid robotics
  • The set of pepple who are qualified to help is far larger than the set of academic researchers, so we are tryinf to got for ad public challenge rather than and academic benchmark.
    • inspiration: scrollprive.org, commavq

1X World Model Challenge

Challenge Overview (v0.0.1)

  • https://github.com/1x-technologies/1xgpt
  • Dataset
    • 100+ hours of EVE Humanoids doing many tasks(driving, manipulation laundry, tidying)
    • Tokenized images( 20x20, C=1000)
    • Actions( joint angles, wheel velocities, gripper open/close)
  • Task
    • centered aroound predicting the future: fiven past image tokens, predict next image
  • Baselines: Llama3-style LLM, GENIE
  • Largely inspaired by the commavq commpression challenge

GENIE Baseline

  • Next-word LLMs are too slow to run in real time
  • We provide an open-source GENIE(Bruce et al. 2024) implementation in the github repo
  • ST-Transformer alternates self-attention between spatial (image) and time dimensions to keep sequence length tractable
  • Sample image tokens within a frame in parellel, and combine with MaskGIT sampling

Open Questions

  • Should we be scaling policies p)a|x0...x15) or world models p(x0, ...x15,a)?
    • I.e. If we see scaling emerge faster with predictinf image tokens, is that the right thing to scale up if we ultimately care about actions?
  • Is predicting future images a waste of parameters?
    • The future contains a lot of irreducible entropy, maybe parameters are better spent modeling longer horizons with contrastive methods (V-JEPA)?
  • How do we do inference with big world models fast enough at inference time?
    • Does the dextrous 10Hz policy ultimately act as a "parameter bottleneck" in scaling up VLMs?

1XのEVE

https://www.youtube.com/watch?v=iHXuU3nTXfQ

https://www.1x.tech/androids/eve
https://robotsguide.com/robots/eve

  • ハードウェアスペック
unit specification
高さ 1889mm
重さ 87Kg
走行速度(歩行 4m/s
可搬重量 14.96Kg
連続稼働時間 1時間の充電で6時間の稼働
  • カメラ
    • 頭部に2つの魚眼カメラを搭載
      • 水平方向に並べているので、ステレオ構成で使用しているかも
  • エンドエフェクタ

1x-technologies/1xgpt

  • assetには、EVEのカメラ画像を推論した結果のGIFが複数アップロードされている

Challenges

  • 2HzでEVEの顔のカメラから撮影された16枚の画像のシーケンス(合計8秒)を用いて、前の画像を用いて次の画像を予測する

1. 圧縮

  • 賞金$10000
  • 次の画像のトークンの離散分布を予測する

2. サンプリング圧縮

  • 賞金$10000
  • GANやDuffusion, MaskGITなどの方法を使用して次画像を生成する

3. 評価

  • N個のポリシーの集合が与えられた場合、各ポリシー画像トークンからアクショントークンを予測し、世界モデル内のすべてのポリシーを評価する
  • どのポリシーが最良であるかの順位付けもする

Data (Version: 0.0.1)

Discussion