🤖

1X: World Modeling Challenge at CVPR2024 workshop

2024/06/23に公開

humanoid

robotics

tech

概要

CVPR2024で開催されていた5th embodied ai workshopで1XのAI teamをleadしているEric Jangがinveted talkerとして発表していた
- 4thの動画はyoutubeで公開されているので、5thの動画も公開されるかも
- 5thの動画が公開された場合には、内容を更新します
1Xが学習モデルのCompetitionを開催することを発表していた

World Modeling Challenge

NaverLabs EuropeのPrincipal Scientist @chriswolfvisionさんのXの投稿より抜粋させていただきました

Predict Everything There Is to Know

flatten the data into a 1d sewuence, predict with LLM
Yann Lecan Cake Analogy
- Most data is cake, very little data is frosting + cherry, therefore you see better scaling laws when you train on the entire cake, not just the cherry

Scaling Laws Benefit from Massive # of Tasks

Intution: when number od tasks is large (as in LLMs, modeling complex distributions ), almost all data helps in some task in one way or another.
Slow data scaling: most pairs of tokens(B and C interactions ) don't help with predicting what A does next. Performance only gets when new infomation about p(a|b,c) is revealed

Why Does Scaling Not "Just Work" for Robotics Yet?

Real world evaluation is unreproducible, so hard to measure the effect of scaling
Data freshness cofounders
- collect data Sunday, evaluate 100M parameter model on Monday, evaluate 1B parameter model on Friday, 1B is worse because it's Friday
Cross entropy loss vs. other losess（MSE）
Architetual bottlenecks in robotics that restrict parameter efficiency
- Clever inducive bias that hurt end-to-end optimization at some scale (many RNNs)
- MSE loss limits prediction to 1 mode
Large model like RT2-X 55B are undertrained wrt. test domain

Why 1X World Model Challenge?

We think that it may soon be possible to do end-to-end evaluation (+ trainng?) of millions of tasks in a learned world model
- seeing scaling laws requires millions of tasks, not dozens
Solving the evaluation problem would greatly accelerate progress in general-purpose & humanoid robotics
The set of pepple who are qualified to help is far larger than the set of academic researchers, so we are tryinf to got for ad public challenge rather than and academic benchmark.
- inspiration: scrollprive.org, commavq

1X World Model Challenge

Challenge Overview (v0.0.1)

https://github.com/1x-technologies/1xgpt
Dataset
- 100+ hours of EVE Humanoids doing many tasks(driving, manipulation laundry, tidying)
- Tokenized images( 20x20, C=1000)
- Actions( joint angles, wheel velocities, gripper open/close)
Task
- centered aroound predicting the future: fiven past image tokens, predict next image
Baselines: Llama3-style LLM, GENIE
Largely inspaired by the commavq commpression challenge

GENIE Baseline

Next-word LLMs are too slow to run in real time
We provide an open-source GENIE(Bruce et al. 2024) implementation in the github repo
ST-Transformer alternates self-attention between spatial (image) and time dimensions to keep sequence length tractable
Sample image tokens within a frame in parellel, and combine with MaskGIT sampling

Open Questions

Should we be scaling policies p)a|x0...x15) or world models p(x0, ...x15,a)？
- I.e. If we see scaling emerge faster with predictinf image tokens, is that the right thing to scale up if we ultimately care about actions?
Is predicting future images a waste of parameters?
- The future contains a lot of irreducible entropy, maybe parameters are better spent modeling longer horizons with contrastive methods (V-JEPA)?
How do we do inference with big world models fast enough at inference time?
- Does the dextrous 10Hz policy ultimately act as a "parameter bottleneck" in scaling up VLMs?

1XのEVE

https://www.1x.tech/androids/eve
https://robotsguide.com/robots/eve

ハードウェアスペック

unit	specification
高さ	1889mm
重さ	87Kg
走行速度（歩行	4m/s
可搬重量	14.96Kg
連続稼働時間	1時間の充電で6時間の稼働

カメラ
- 頭部に2つの魚眼カメラを搭載
  - 水平方向に並べているので、ステレオ構成で使用しているかも
エンドエフェクタ
- RobotiqのAdaptive gripper

1x-technologies/1xgpt

assetには、EVEのカメラ画像を推論した結果のGIFが複数アップロードされている

Challenges

2HzでEVEの顔のカメラから撮影された16枚の画像のシーケンス（合計8秒）を用いて、前の画像を用いて次の画像を予測する

1. 圧縮

賞金$10000
次の画像のトークンの離散分布を予測する

2. サンプリング圧縮

賞金$10000
GANやDuffusion, MaskGITなどの方法を使用して次画像を生成する

3. 評価

N個のポリシーの集合が与えられた場合、各ポリシー画像トークンからアクショントークンを予測し、世界モデル内のすべてのポリシーを評価する
どのポリシーが最良であるかの順位付けもする

Data (Version: 0.0.1)

hugging faceにデータセットがアップロードされている
- https://huggingface.co/datasets/1x-technologies/worldmodel
LlmaとGENIEの学習済みモデルも同様にアップロードされている

概要

World Modeling Challenge

Predict Everything There Is to Know

Scaling Laws Benefit from Massive # of Tasks

Why Does Scaling Not "Just Work" for Robotics Yet?

Why 1X World Model Challenge?

1X World Model Challenge

Challenge Overview (v0.0.1)

GENIE Baseline

Open Questions

1XのEVE

1x-technologies/1xgpt

Challenges

1. 圧縮

2. サンプリング圧縮

3. 評価

Data (Version: 0.0.1)

Discussion