Open2021/02/18にコメント追加3

TensorFlow 3Dについて調べる

TensorFlow

機械学習

Shuichi Tsutsumi

まずはこれを読む。2021.2.11に公開されたGoogle AI Blogの記事：

TensorFlow 3Dとは何なのか、何ができるのか、ということが書かれている。いくつか抜粋していく。

TensorFlow 3D登場の背景

but entry to the field can be challenging due to the limited availability tools and resources that can be applied to 3D data. （3Dデータに適用できるツールやリソースが限られているため、この分野への参入は困難な場合があります。）

In order to further improve 3D scene understanding and reduce barriers to entry for interested researchers, we are releasing TensorFlow 3D (TF 3D), a highly modular and efficient library that is designed to bring 3D deep learning capabilities into TensorFlow. （3Dシーン理解をさらに向上させ、興味のある研究者の参入障壁を下げるために、3Dディープラーニング機能をTensorFlowに取り込むための高度なモジュール化と効率化を実現したライブラリであるTensorFlow 3D（TF 3D）をリリースします。

TensorFlow 3Dでできること

TF 3D provides a set of popular operations, loss functions, data processing tools, models and metrics that enables the broader research community to develop, train and deploy state-of-the-art 3D scene understanding models. （TF 3Dは、一般的な操作、損失関数、データ処理ツール、モデル、メトリクスのセットを提供し、より広範な研究コミュニティが最先端の3Dシーン理解モデルを開発、訓練、展開できるようにします。）

TF 3D contains training and evaluation pipelines for state-of-the-art 3D semantic segmentation, 3D object detection and 3D instance segmentation, with support for distributed training. It also enables other potential applications like 3D object shape prediction, point cloud registration and point cloud densification. In addition, it offers a unified dataset specification and configuration for training and evaluation of the standard 3D scene understanding datasets. It currently supports the Waymo Open, ScanNet, and Rio datasets. However, users can freely convert other popular datasets, such as NuScenes and Kitti, into a similar format and use them in the pre-existing or custom created pipelines, and can leverage TF 3D for a wide variety of 3D deep learning research and applications, from quickly prototyping and trying new ideas to deploying a real-time inference system. （TF 3Dには、最先端の3Dセマンティックセグメンテーション、3Dオブジェクト検出、3Dインスタンスセグメンテーションのためのトレーニングおよび評価パイプラインが含まれており、分散トレーニングをサポートしています。また、3次元物体形状予測、点群登録、点群密化などの他のアプリケーションも可能です。さらに、標準的な3Dシーン理解データセットのトレーニングと評価のための統一されたデータセット仕様と構成を提供します。現在、Waymo Open、ScanNet、Rioのデータセットをサポートしています。しかし、ユーザーはNuScenesやKittiなどの他の一般的なデータセットを同様の形式に自由に変換して、既存のパイプラインやカスタム作成されたパイプラインで使用することができ、迅速なプロトタイピングや新しいアイデアの試行からリアルタイム推論システムの展開まで、幅広い3Dディープラーニングの研究やアプリケーションにTF 3Dを活用することができます。）

（左）3Dオブジェクト検出モデルの出力例
（右）3Dインスタンスセグメンテーションモデルの出力例

3D Sparse Convolutional Network

様々な3Dシーン理解タスクにおいて最先端の結果を得るための鍵となるのが、この3D Sparse Convolutional Networkとのこと。

The 3D data captured by sensors often consists of a scene that contains a set of objects of interest (e.g. cars, pedestrians, etc.) surrounded mostly by open space, which is of limited (or no) interest. As such, 3D data is inherently sparse. In such an environment, standard implementation of convolutions would be computationally intensive and consume a large amount of memory. So, in TF 3D we use submanifold sparse convolution and pooling operations, which are designed to process 3D sparse data more efficiently. Sparse convolutional models are core to the state-of-the-art methods applied in most outdoor self-driving (e.g. Waymo, NuScenes) and indoor benchmarks (e.g. ScanNet). （センサーによって捕捉された3Dデータは、多くの場合、関心のあるオブジェクト（例えば、車、歩行者など）のセットを含むシーンで構成されており、ほとんどが限られた（または全く関心のない）オープンスペースに囲まれています。このように、３Ｄデータは本質的にスパースである。このような環境では、標準的なコンボリューションの実装は計算量が多く、大量のメモリを消費します。そこで、TF 3Dでは、3Dスパースデータをより効率的に処理するために設計された、サブマニフォールドスパース畳み込みとプーリング演算を使用します。疎な畳み込みモデルは、ほとんどの屋外のセルフドライブ（例：Waymo、NuScenes）や屋内のベンチマーク（例：ScanNet）で適用されている最先端の手法の中核となっています。）

TF 3D then uses the 3D submanifold sparse U-Net architecture to extract a feature for each voxel. The U-Net architecture has proven to be effective by letting the network extract both coarse and fine features and combining them to make the predictions. The U-Net network consists of three modules, an encoder, a bottleneck, and a decoder, each of which consists of a number of sparse convolution blocks with possible pooling or un-pooling operations. （TF 3Dは、3DサブマニフォールドスパースU-Netアーキテクチャを使用して、各ボクセルの特徴を抽出します。U-Netアーキテクチャは、粗い特徴と細かい特徴の両方を抽出し、それらを組み合わせて予測を行うことで効果的であることが証明されています。U-Netネットワークは、エンコーダ、ボトルネック、デコーダの3つのモジュールから構成されており、それぞれのモジュールは、プーリングまたはアンプーリング操作が可能なスパース畳み込みブロックの数から構成されている。）

A 3D sparse voxel U-Net architecture. Note that a horizontal arrow takes in the voxel features and applies a submanifold sparse convolution to it. An arrow that is moving down performs a submanifold sparse pooling. An arrow that is moving up will gather back the pooled features, concatenate them with the features coming from the horizontal arrow, and perform a submanifold sparse convolution on the concatenated features. （3DスパースボクセルU-Netアーキテクチャ。水平方向の矢印は、ボクセルの特徴を取り込み、それにサブマニフォールドスパースコンボリューションを適用することに注意してください。下に移動する矢印は、サブマニフォールド・スパース・プーリングを実行します。上に移動する矢印は、プールされた特徴を集めて、水平矢印から来た特徴と連結し、連結された特徴に対してサブマニフォールド・スパース畳み込みを実行します。）

このsparse convolutional networkは以降で解説される3種のパイプライン（3Dセマンティックセグメンテーシ, 3Dインスタンスセグメンテーション, 3D物体検出）のバックボーンになっているとのこと：

The sparse convolutional network described above is the backbone for the 3D scene understanding pipelines that are offered in TF 3D. Each of the models described below uses this backbone network to extract features for the sparse voxels, and then adds one or multiple additional prediction heads to infer the task of interest. The user can configure the U-Net network by changing the number of encoder / decoder layers and the number of convolutions in each layer, and by modifying the convolution filter sizes, which enables a wide range of speed / accuracy tradeoffs to be explored through the different backbone configurations （上述のスパース畳み込みネットワークは、TF 3Dで提供される3Dシーン理解パイプラインのバックボーンです。以下に説明する各モデルは、このバックボーンネットワークを使用して、疎なボクセルの特徴を抽出し、関心のあるタスクを推論するために、1つまたは複数の追加予測ヘッドを追加します。ユーザーは、エンコーダ/デコーダ層の数と各層の畳み込み数を変更したり、畳み込みフィルタのサイズを変更したりすることで、U-Netネットワークを構成することができ、異なるバックボーン構成を通して、速度と精度のトレードオフを幅広く探索することができます。）