TensorFlow v2.12.0をUbuntu20.04+CUDA 11.6/11.7環境でCUDA有効化ビルドをすると最終タスクでundefined referenceでAbortする
bazel-out/k8-opt/bin/tensorflow/compiler/xla/stream_executor/cuda/libcuda_graph.pic.a(cuda_graph.pic.o)
undefined reference to `cudaGraphDebugDotPrint`
どうも、CUDA 12.0 以降は少し手を加えるだけで (GCC12特有の調整) ビルドが通るらしい。でも、Ubntu 20.04 (GCC9) + CUDA 11.6/11.7 ではどうも最後にAbortしてしまう。
xla + cuda 周りっぽいエラーで Abort するので、あまり小賢しいソースコードの改造は施さずに、ビルドオプションで xla を無効化してビルドしてみる。おそらく、./configure
で .bazelrc
に記載されたコンフィギュレーションパターンを読み取って Bazel 実行時のデフォルトパラメータの組み合わせが自動生成されるが、bazel build
コマンドを実行するときのオプションに --define=
を上書き指定することで ./configure
で内部的に生成された(実際にはテンポラリファイルに書き出されているだけっぽい) Bazel 起動パラメータをユーザー指定値で上書きして実行できるように見える。
./configure
を実行したあとに生成されるファイル .tf_configure.bazelrc
の中身を覗くと。。。予想通り、./configure
を実行したあとに対話的に質問されてくる内容に手動で答えた内容がそのまま設定されている。
build --action_env PYTHON_BIN_PATH="/usr/bin/python3"
build --action_env PYTHON_LIB_PATH="/usr/lib/python3.8/dist-packages"
build --python_path="/usr/bin/python3"
build --config=tensorrt
build --action_env CUDA_TOOLKIT_PATH="/usr/local/cuda-11.6"
build --action_env TF_CUDA_COMPUTE_CAPABILITIES="6.1,7.5,8.6"
build --action_env LD_LIBRARY_PATH="/home/xxxx/intel/openvino_2022/tools/compile_tool:/home/xxxx/intel/openvino_2022/runtime/3rdparty/tbb/lib::/home/xxxx/intel/openvino_2022/runtime/3rdparty/hddl/lib:/home/xxxx/intel/openvino_2022/runtime/lib/intel64:/usr/local/lib/python3.8/dist-packages/nvidia/nccl/lib:/usr/local/lib/python3.8/dist-packages/nvidia/cufft/lib:/usr/local/cuda/lib64"
build --action_env GCC_HOST_COMPILER_PATH="/usr/bin/x86_64-linux-gnu-gcc-9"
build --config=cuda
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-Wno-sign-compare
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_env=LD_LIBRARY_PATH
test:v1 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial
test:v1 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu
test:v2 --test_tag_filters=-benchmark-test,-no_oss,-no_gpu,-oss_serial,-v1only
test:v2 --build_tag_filters=-benchmark-test,-no_oss,-no_gpu,-v1only
例えばココ。過去のTensorFlowバージョンでBazelの実行コマンドに指定してきた --config=
オプションへの定義が書かれているように見える。
- いつも指定していた
--config
オプション
--config=noaws \
--config=nohdfs \
--config=nonccl \
Bazel のビルドコマンド実行時には、
--config=noaws
の指定部分が、内部的には
--define=no_aws_support=true
と読み替えられて実行されていると想定。したがって、あちこちの BUILD
ファイルに定義されている以下のような定義部分を都度引用して Bazel のビルドコマンドへ渡してあげれば狙ったオプション指定ができると想定。
上記の
define_values = {"with_xla_support": "true"},
は
--define=with_xla_support=true
に脳内変換する。
sed -i '15a #include <assert.h>' tensorflow/tsl/framework/fixedpoint/MatMatProductAVX2.h
sudo bazel build \
--config=monolithic \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_pip_with_flex=true \
--define=tflite_with_xnnpack=true \
--define=with_xla_support=false \
--ui_actions_shown=20 \
//tensorflow/tools/pip_package:build_pip_package
下記も同じ意味かな。
sudo bazel build \
--config=monolithic \
--define=no_aws_support=false \
--define=no_hdfs_support=false \
--define=no_nccl_support=false \
--config=v2 \
--define=tflite_pip_with_flex=true \
--define=tflite_with_xnnpack=true \
--define=with_xla_support=false \
--ui_actions_shown=20 \
//tensorflow/tools/pip_package:build_pip_package
あぁぁぁ。。。。
ERROR: /home/xxxx/work/tensorflow/tensorflow/BUILD:1591:19: Executing genrule //tensorflow:tf_python_api_gen_v2 failed: (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
2023-02-20 01:29:50.810341: F ./tensorflow/core/framework/variant_op_registry.h:114] Check failed: existing == nullptr (0x1e44d28 vs. nullptr)UnaryVariantDeviceCopy for direction: 1 and type_index: tensorflow::Tensor already registered
/bin/bash: line 1: 3477165 Aborted (core dumped) bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2 --root_init_template=tensorflow/api_template.__init__.py --apidir=bazel-out/k8-opt/bin/tensorflow_api/v2/ --apiname=tensorflow --apiversion=2 --compat_apiversion=1 --compat_apiversion=2 --
:
opt/bin/tensorflow/_api/v2/compat/v2/compat/v2/compat/__init__.py
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /home/xxxx/work/tensorflow/tensorflow/lite/python/BUILD:69:10 Middleman _middlemen/tensorflow_Slite_Spython_Stflite_Uconvert-runfiles failed: (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
INFO: Elapsed time: 5337.954s, Critical Path: 760.32s
INFO: 36308 processes: 11892 internal, 24416 local.
FAILED: Build did NOT complete successfully
なんのこっちゃ。。。
F ./tensorflow/core/framework/variant_op_registry.h:114]
Check failed: existing == nullptr (0x1e44d28 vs. nullptr)UnaryVariantDeviceCopy for direction:
1 and type_index: tensorflow::Tensor already registered
Flex有効化オプションを除いてリトライしてみる。
sudo bazel build \
--config=monolithic \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_with_xnnpack=true \
--define=with_xla_support=false \
--ui_actions_shown=20 \
//tensorflow/tools/pip_package:build_pip_package
これでダメなら、モノリシックビルドをやめてみる。
次に試してみるコマンド。GCC9のコンパイラオプションを変更。XLAを有効。Flexを有効。
sed -i '15a #include <assert.h>' tensorflow/tsl/framework/fixedpoint/MatMatProductAVX2.h
sed -i -e 's/c++17/c++14/g' .bazelrc
supports compute capabilities >= 3.5 [Default is: 3.5,7.0]: 6.1,7.5,8.6
sudo bazel build \
--config=monolithic \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_pip_with_flex=true \
--define=tflite_with_xnnpack=true \
--ui_actions_shown=20 \
//tensorflow/tools/pip_package:build_pip_package
こうなった。c++17
から is_arithmetic
が is_arithmetic_v
に変わった模様。
あと、inline
修飾子は c++17
からの高速化仕様の模様。
ERROR: /home/xxxx/work/tensorflow/tensorflow/tsl/platform/BUILD:906:11: Compiling tensorflow/tsl/platform/strcat.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/tensorflow/tsl/platform/_objs/strcat/strcat.d ... (remaining 69 arguments skipped)
In file included from ./tensorflow/tsl/platform/types.h:22,
from ./tensorflow/tsl/platform/numbers.h:23,
from ./tensorflow/tsl/platform/strcat.h:26,
from tensorflow/tsl/platform/strcat.cc:16:
./tensorflow/tsl/platform/float8.h:169:53: error: ‘is_arithmetic_v’ is not a member of ‘std’; did you mean ‘is_arithmetic’?
169 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^~~~~~~~~~~~~~~
| is_arithmetic
./tensorflow/tsl/platform/float8.h:169:53: error: ‘is_arithmetic_v’ is not a member of ‘std’; did you mean ‘is_arithmetic’?
169 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^~~~~~~~~~~~~~~
| is_arithmetic
./tensorflow/tsl/platform/float8.h:169:69: error: template argument 1 is invalid
169 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^
./tensorflow/tsl/platform/float8.h:169:72: error: expected unqualified-id before ‘>’ token
169 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^
./tensorflow/tsl/platform/float8.h:184:53: error: ‘is_arithmetic_v’ is not a member of ‘std’; did you mean ‘is_arithmetic’?
184 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^~~~~~~~~~~~~~~
| is_arithmetic
./tensorflow/tsl/platform/float8.h:184:53: error: ‘is_arithmetic_v’ is not a member of ‘std’; did you mean ‘is_arithmetic’?
184 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^~~~~~~~~~~~~~~
| is_arithmetic
./tensorflow/tsl/platform/float8.h:184:69: error: template argument 1 is invalid
184 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^
./tensorflow/tsl/platform/float8.h:184:72: error: expected unqualified-id before ‘>’ token
184 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^
./tensorflow/tsl/platform/float8.h:219:53: error: ‘is_arithmetic_v’ is not a member of ‘std’; did you mean ‘is_arithmetic’?
219 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^~~~~~~~~~~~~~~
| is_arithmetic
./tensorflow/tsl/platform/float8.h:219:53: error: ‘is_arithmetic_v’ is not a member of ‘std’; did you mean ‘is_arithmetic’?
219 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^~~~~~~~~~~~~~~
| is_arithmetic
./tensorflow/tsl/platform/float8.h:219:69: error: template argument 1 is invalid
219 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^
./tensorflow/tsl/platform/float8.h:219:72: error: expected unqualified-id before ‘>’ token
219 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^
./tensorflow/tsl/platform/float8.h:234:53: error: ‘is_arithmetic_v’ is not a member of ‘std’; did you mean ‘is_arithmetic’?
234 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^~~~~~~~~~~~~~~
| is_arithmetic
./tensorflow/tsl/platform/float8.h:234:53: error: ‘is_arithmetic_v’ is not a member of ‘std’; did you mean ‘is_arithmetic’?
234 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^~~~~~~~~~~~~~~
| is_arithmetic
./tensorflow/tsl/platform/float8.h:234:69: error: template argument 1 is invalid
234 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^
./tensorflow/tsl/platform/float8.h:234:72: error: expected unqualified-id before ‘>’ token
234 | typename EnableIf = std::enable_if<std::is_arithmetic_v<T>>>
| ^
./tensorflow/tsl/platform/float8.h:258:10: warning: inline variables are only available with ‘-std=c++17’ or ‘-std=gnu++17’
258 | static inline constexpr const bool is_specialized = true;
| ^~~~~~
./tensorflow/tsl/platform/float8.h:259:10: warning: inline variables are only available with ‘-std=c++17’ or ‘-std=gnu++17’
259 | static inline constexpr const bool is_signed = true;
Ubuntu 20.04 に gcc-12
をインストールするには、ソースコードから自力でビルドする必要がある模様。
TensorFlow v2.12.0、フルパッケージをビルドするには gcc-12 (c++17) が必要っぽく見える。ソースコードからビルドしないと Ubuntu 20.04 には導入できないみたい。これ、結構致命的では。gccをソースコードからビルドしてまで Ubuntu 20.04 に導入したくない。いい加減 22.04に上げろ、ということ?
速攻でレスをいただけた。PPA
gcc-11
からデフォルトが c++17
っぽい。
- Ubuntu 22.04 +
gcc-12
+g++-12
$ sudo apt update && sudo apt upgrade -y && \
sudo apt install -y \
libhdf5-dev unzip pkg-config python3-pip \
cmake make python-is-python3 gcc-12 g++-12 && \
sudo pip3 install pip --upgrade && \
sudo pip3 install numpy==1.24.2 && \
sudo pip3 install keras_applications==1.0.8 --no-deps && \
sudo pip3 install keras_preprocessing==1.1.2 --no-deps && \
sudo pip3 install gdown h5py==3.6.0 && \
sudo pip3 install pybind11==2.9.2 && \
sudo pip3 install packaging && \
sudo pip3 install protobuf==3.20.3 \
pip3 install -U --user six wheel mock
$ git clone -b r2.12 https://github.com/tensorflow/tensorflow.git && cd tensorflow
$ sed -i '15a #include <assert.h>' tensorflow/tsl/framework/fixedpoint/MatMatProductAVX2.h
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 11
$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 12
$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-11 11
$ wget -O bazel https://github.com/bazelbuild/bazel/releases/download/5.3.0/bazel-5.3.0-linux-arm64 \
&& sudo chmod 777 bazel \
&& sudo cp bazel /usr/local/bin \
&& sudo bazel clean --expunge \
&& ./configure
$ sudo bazel build \
--config=monolithic \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_pip_with_flex=true \
--define=tflite_with_xnnpack=true \
--copt="-Wno-stringop-overflow" \
--ui_actions_shown=64 \
//tensorflow/tools/pip_package:build_pip_package
ERROR: /home/ubuntu/tensorflow/tensorflow/BUILD:1591:19: Executing genrule //tensorflow:tf_python_api_gen_v2 failed: (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
2023-02-20 03:08:41.757015: F tensorflow/core/common_runtime/executor_factory.cc:44] Two executor factories are being registered under
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /home/ubuntu/tensorflow/tensorflow/python/tools/BUILD:314:10 Middleman _middlemen/tensorflow_Spython_Stools_Ssaved_Umodel_Ucli-runfiles failed: (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
INFO: Elapsed time: 1135.285s, Critical Path: 390.04s
INFO: 14036 processes: 1609 internal, 12427 local.
FAILED: Build did NOT complete successfully
$ sudo bazel build \
--config=monolithic \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_with_xnnpack=true \
--copt="-Wno-stringop-overflow" \
--ui_actions_shown=64 \
//tensorflow/tools/pip_package:build_pip_package
ERROR: /home/ubuntu/tensorflow/tensorflow/BUILD:1591:19: Executing genrule //tensorflow:tf_python_api_gen_v2 failed: (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
2023-02-20 03:08:41.757015: F tensorflow/core/common_runtime/executor_factory.cc:44] Two executor factories are being registered under
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /home/ubuntu/tensorflow/tensorflow/python/tools/BUILD:314:10 Middleman _middlemen/tensorflow_Spython_Stools_Ssaved_Umodel_Ucli-runfiles failed: (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
INFO: Elapsed time: 1135.285s, Critical Path: 390.04s
INFO: 14036 processes: 1609 internal, 12427 local.
FAILED: Build did NOT complete successfully
-
--config=monolithic
除外、--define=tflite_pip_with_flex=true
除外、gcc-12
、g++-12
- ビルド成功
$ sudo bazel build \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_with_xnnpack=true \
--copt="-Wno-stringop-overflow" \
--ui_actions_shown=64 \
//tensorflow/tools/pip_package:build_pip_package
-
--config=monolithic
除外、gcc-12
、g++-12
$ sudo bazel build \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_pip_with_flex=true \
--define=tflite_with_xnnpack=true \
--copt="-Wno-stringop-overflow" \
--ui_actions_shown=64 \
//tensorflow/tools/pip_package:build_pip_package
ビルド失敗。--config=monolithic
あるいは --define=tflite_pip_with_flex=true
が指定されているとビルドエラーになることが分かった。
ERROR: /home/ubuntu/tensorflow/tensorflow/BUILD:1591:19: Executing genrule //tensorflow:tf_python_api_gen_v2 failed: (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/descriptor_database.cc:642] File already exists in database: tensorflow/compiler/jit/xla_compilation_cache.proto
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/descriptor.cc:1986] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
terminate called after throwing an instance of 'google::protobuf::FatalException'
what(): CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /home/ubuntu/tensorflow/tensorflow/python/tools/BUILD:82:10 Middleman _middlemen/tensorflow_Spython_Stools_Sfreeze_Ugraph-runfiles failed: (Aborted): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
INFO: Elapsed time: 1223.799s, Critical Path: 386.39s
INFO: 15672 processes: 1784 internal, 13888 local.
FAILED: Build did NOT complete successfully
-
--config=monolithic
除外、--define=tflite_pip_with_flex=true
除外、gcc-11
、g++-11
- ビルド成功、Ubuntu22.04の標準設定
sudo update-alternatives --config gcc
There are 2 choices for the alternative gcc (providing /usr/bin/gcc).
Selection Path Priority Status
------------------------------------------------------------
* 0 /usr/bin/gcc-12 12 auto mode
1 /usr/bin/gcc-11 11 manual mode
2 /usr/bin/gcc-12 12 manual mode
Press <enter> to keep the current choice[*], or type selection number: 1
sudo update-alternatives --config g++
There are 2 choices for the alternative g++ (providing /usr/bin/g++).
Selection Path Priority Status
------------------------------------------------------------
* 0 /usr/bin/g++-12 12 auto mode
1 /usr/bin/g++-11 11 manual mode
2 /usr/bin/g++-12 12 manual mode
Press <enter> to keep the current choice[*], or type selection number: 1
$ sudo bazel build \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_with_xnnpack=true \
--ui_actions_shown=64 \
//tensorflow/tools/pip_package:build_pip_package
-
--config=monolithic
除外、--define=tflite_pip_with_flex=true
除外、gcc-9
、g++-9
- ビルド成功、Ubuntu 20.04 の標準設定
$ sudo bazel build \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_with_xnnpack=true \
--ui_actions_shown=64 \
//tensorflow/tools/pip_package:build_pip_package
-
--config=monolithic
除外、--define=tflite_pip_with_flex=true
除外、gcc-8
、g++-8
- ビルド成功、Debian 11 (Bullseye) の標準設定
$ sudo bazel build \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_with_xnnpack=true \
--ui_actions_shown=64 \
//tensorflow/tools/pip_package:build_pip_package
-
--config=monolithic
除外、--define=tflite_pip_with_flex=true
除外、gcc-12
、g++-12
- ビルド成功、Debian 12 (Bookworm) の標準設定
$ sudo bazel build \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_with_xnnpack=true \
--copt="-Wno-stringop-overflow" \
--ui_actions_shown=64 \
//tensorflow/tools/pip_package:build_pip_package
- TensorFlow v1.13.0-rc0
- ダメ
sudo bazel build \
--config=monolithic \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_pip_with_flex=true \
--define=tflite_with_xnnpack=true \
--copt="-Wno-stringop-overflow" \
--ui_actions_shown=20 \
//tensorflow/tools/pip_package:build_pip_package
- ダメ
sudo bazel build \
--config=monolithic \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_with_xnnpack=true \
--copt="-Wno-stringop-overflow" \
--ui_actions_shown=20 \
//tensorflow/tools/pip_package:build_pip_package
- ダメ
sudo bazel build \
--config=opt \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_pip_with_flex=true \
--define=tflite_with_xnnpack=true \
--copt="-Wno-stringop-overflow" \
--ui_actions_shown=20 \
//tensorflow/tools/pip_package:build_pip_package
sudo bazel build \
--config=noaws \
--config=nohdfs \
--config=nonccl \
--config=v2 \
--define=tflite_pip_with_flex=true \
--define=tflite_with_xnnpack=true \
--copt="-Wno-stringop-overflow" \
--ui_actions_shown=20 \
//tensorflow/tools/pip_package:build_pip_package