Databricks Serverless GPU を使ってみる ~MegatronLM編~
Data + AI Summit 2025にて、「Serverless GPU」という機能が発表されました。本機能は2025年6月18日現在、ベータ版として利用可能です。将来はH100のマルチGPUまで拡張されるようですが、現時点では単一のA10をオンデマンドに利用することができます。
HuggingFace Transformers や LLM-Foundry などは普通に動くと思うので、今回は MegatronLM が動くか確認します。
①まずはノートブックに GPU(A10) をアタッチ
ノートブックの右側のメニューの「環境」をクリックすると、Acceralator を選択できるようになっているので「A10」を選択して、画面右下の「Apply」ボタンを押下します。ホットスタンバイされている A10 が即座にアタッチされます。これまでのようにGPUのコンピュートクラスターを明示的に起動しなくて良いので非常に楽になりました。
②GPUを確認
念の為、以下のコマンドでGPU情報を見てみます。
!nvidia-smi
Wed Jun 18 05:59:07 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 27C P8 25W / 300W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
A10が一枚、動いていることが確認できますね。
③ライブラリーのインストール
ライブラリをインストールしていきますが、注意したいのはPython環境です。
Notebook配下には複数のPython環境が存在しているのですが、今回のMegatronLMを動かすには、torchrun
コマンドを使用するので、このコマンドのバイナリが存在する環境に対してインストールをする必要があります。なお、当該環境はDATABRICKS_ROOT_VIRTUALENV_ENV
というプリセットの環境変数にパスがセットされているので、そのまま使用します。
PyPiからインストール
まずは以下のライブラリをPyPiからインストールします。
%sh
${DATABRICKS_ROOT_VIRTUALENV_ENV}/bin/python3 -m pip install nltk tiktoken zarr einops pybind11
APEXのインストール
続いて APEX をインストールするのですが、こちらは基本的にソースコードからビルドする必要があります。
%sh
git clone https://github.com/NVIDIA/apex ${APEX_PATH}
cd ${APEX_PATH}
git checkout ${APEX_COMMIT_ID}
# APEXをビルドする。20分ほどかかります。
# wheel を生成して 一時フォルダ(dist/)に保存
pip wheel -v --no-deps --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" -w ${APEX_PATH}/dist .
# 今後のためにwheelファイルを適当なフォルダに保存しておきます。
mkdir -p ${WHEEL_PATH}
cp ${APEX_PATH}/dist/apex-*.whl ${WHEEL_PATH}
# pip install します。
${DATABRICKS_ROOT_VIRTUALENV_ENV}/bin/python3 -m pip install ${WHEEL_PATH}/apex-0.1-cp312-cp312-linux_x86_64.whl
APEXが正常にインストールされているかを確認しておきます。
%sh
${DATABRICKS_ROOT_VIRTUALENV_ENV}/bin/python3 -c "from apex import amp;from apex.parallel import DistributedDataParallel"
④MegatronLMのクローンおよびHelperモジュールのビルド
MegatronLMのレポジトリをクローンして、Helperモジュールをビルドします。本来このビルドは、トレーニング処理を開始すると冒頭で自動的に実行されるのですが、ビルドの際に使用される「python-config」というコマンドがランタイムに含まれていない、かつ、インストールも困難(apt install要)のため、以下のように明示的にg++
でのインストールを実行してあげます。
%sh
git clone https://github.com/NVIDIA/Megatron-LM /tmp/Megatron-LM
cd /tmp/Megatron-LM
git checkout d905ac7f1b45bde189e8df51d1ecd1a726984d32
# "VIRTUAL_ENV"という環境変数はプリセットされています。
cd /tmp/Megatron-LM/megatron/core/datasets
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.12 -I${VIRTUAL_ENV}/lib/python3.12/site-packages/pybind11/include helpers.cpp -o helpers_cpp.cpython-312-x86_64-linux-gnu.so
いずれにしても、helpers_cpp.cpython-312-x86_64-linux-gnu.so
が出来上がればOKです。
⑤トレーニング設定と実行
さあ、いよいよトレーニングです。
今回はA10が1枚だけなので、モデルをかなり小さくしています。かつ、FP8などハイエンドGPUしか持たない機能については不使用で行きます。動かすことが目的なので。
なお、データセットはUnity Catalog Volumeにおいてある前提です。
%sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=1
VOCAB_FILE=/Volumes/YOUR_CATALOG/YOUR_SCHEMA/YOUR_VOLUME/gpt2/gpt2-vocab.json
MERGE_FILE=/Volumes/YOUR_CATALOG/YOUR_SCHEMA/YOUR_VOLUME/gpt2/gpt2-merges.txt
DATA_PATH=/Volumes/YOUR_CATALOG/YOUR_SCHEMA/YOUR_VOLUME/gpt2/my-gpt2_text_document
DISTRIBUTED_ARGS=(
--nproc_per_node $GPUS_PER_NODE #この環境変数はランタイムにプリセットされています
--nnodes 1
--node_rank 0
--master_addr $HOST_IP #この環境変数はランタイムにプリセットされています
--master_port 6001
)
GPT_MODEL_ARGS=(
--num-layers 12
--hidden-size 512
--num-attention-heads 8
--seq-length 1024
--max-position-embeddings 1024
)
TRAINING_ARGS=(
--micro-batch-size 1
--global-batch-size 16
--train-iters 200
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.95
--init-method-std 0.006
--clip-grad 1.0
--fp16
--lr 6.0e-5
--lr-decay-style cosine
--min-lr 6.0e-6
--lr-warmup-fraction .001
--lr-decay-iters 430000
--transformer-impl local
--no-persist-layer-norm
--no-gradient-accumulation-fusion
--no-masked-softmax-fusion
)
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 1
)
DATA_ARGS=(
--data-path $DATA_PATH
--vocab-file $VOCAB_FILE
--merge-file $MERGE_FILE
--split 949,50,1
)
EVAL_AND_LOGGING_ARGS=(
--eval-interval 100
--eval-iters 10
)
cd ${MEGATRON_PATH}
torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \
${GPT_MODEL_ARGS[@]} \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${DATA_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
--distributed-backend nccl
無事起動できました。少々長いですが、実行ログを以下に貼り付けておきます。
using world size: 1, data-parallel size: 1, context-parallel size: 1, hierarchical context-parallel sizes: Nonetensor-model-parallel size: 1, encoder-tensor-model-parallel size: 0, pipeline-model-parallel size: 1, encoder-pipeline-model-parallel size: 0
Number of virtual stages per pipeline stage: None
WARNING: Setting args.check_for_nan_in_loss_and_grad to False since dynamic loss scaling is being used
using torch.float16 for parameters ...
------------------------ arguments ------------------------
account_for_embedding_in_pipeline_split ......... False
account_for_loss_in_pipeline_split .............. False
accumulate_allreduce_grads_in_fp32 .............. False
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
add_bias_linear ................................. True
add_position_embedding .......................... True
add_qkv_bias .................................... True
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
align_grad_reduce ............................... True
align_param_gather .............................. False
app_tag_run_name ................................ None
app_tag_run_version ............................. 0.0.0
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... False
async_save ...................................... None
async_tensor_model_parallel_allreduce ........... True
attention_backend ............................... AttnBackend.auto
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
auto_detect_ckpt_format ......................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
bias_swiglu_fusion .............................. True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
calc_ft_timeouts ................................ False
calculate_per_token_loss ........................ False
check_for_large_grads ........................... False
check_for_nan_in_loss_and_grad .................. False
check_for_spiky_loss ............................ False
check_weight_hash_across_dp_replicas_interval ... None
ckpt_assume_constant_structure .................. False
ckpt_convert_format ............................. None
ckpt_convert_save ............................... None
ckpt_convert_update_legacy_dist_opt_format ...... False
ckpt_format ..................................... torch_dist
ckpt_fully_parallel_load ........................ False
ckpt_fully_parallel_save ........................ True
ckpt_fully_parallel_save_deprecated ............. False
ckpt_step ....................................... None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clone_scatter_output_in_embedding ............... True
config_logger_dir ...............................
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_size ........................... 1
cp_comm_type .................................... ['p2p']
create_attention_mask_in_dataloader ............. True
cross_entropy_fusion_impl ....................... native
cross_entropy_loss_fusion ....................... False
cuda_graph_scope ................................ full
cuda_graph_warmup_steps ......................... 3
data_args_path .................................. None
data_cache_path ................................. None
data_parallel_random_init ....................... False
data_parallel_sharding_strategy ................. no_shard
data_parallel_size .............................. 1
data_path ....................................... ['/Volumes/hiroshi/air/dataset/gpt2/my-gpt2_text_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
ddp_average_in_collective ....................... False
ddp_bucket_size ................................. None
ddp_num_buckets ................................. None
ddp_pad_buckets_for_high_nccl_busbw ............. False
decoder_first_pipeline_num_layers ............... None
decoder_last_pipeline_num_layers ................ None
decoder_num_layers .............................. None
decoder_seq_length .............................. None
decoupled_lr .................................... None
decoupled_min_lr ................................ None
decrease_batch_size_if_needed ................... False
defer_embedding_wgrad_compute ................... False
deprecated_use_mcore_models ..................... False
deterministic_mode .............................. False
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
disable_bf16_reduced_precision_matmul ........... False
disable_mamba_mem_eff_path ...................... False
disable_straggler_on_startup .................... False
dist_ckpt_format_deprecated ..................... None
dist_ckpt_strictness ............................ assume_ok_unexpected
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 10
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_cuda_graph ............................... False
enable_ft_package ............................... False
enable_gloo_process_groups ...................... True
enable_msc ...................................... True
enable_one_logger ............................... True
encoder_num_layers .............................. 12
encoder_pipeline_model_parallel_size ............ 0
encoder_seq_length .............................. 1024
encoder_tensor_model_parallel_size .............. 0
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
error_injection_rate ............................ 0
error_injection_type ............................ transient_error
eval_interval ................................... 100
eval_iters ...................................... 10
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
exp_avg_dtype ................................... torch.float32
exp_avg_sq_dtype ................................ torch.float32
expert_model_parallel_size ...................... 1
expert_tensor_parallel_size ..................... 1
external_cuda_graph ............................. False
ffn_hidden_size ................................. 2048
finetune ........................................ False
first_last_layers_bf16 .......................... False
flash_decode .................................... False
fp16 ............................................ True
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_param_gather ................................ False
fp8_recipe ...................................... delayed
fp8_wgrad ....................................... True
global_batch_size ............................... 16
grad_reduce_in_bf16 ............................. False
gradient_accumulation_fusion .................... False
gradient_reduce_div_fusion ...................... True
group_query_attention ........................... False
head_lr_mult .................................... 1.0
heterogeneous_layers_config_encoded_json ........ None
heterogeneous_layers_config_path ................ None
hidden_dropout .................................. 0.1
hidden_size ..................................... 512
hierarchical_context_parallel_sizes ............. None
hybrid_attention_ratio .......................... 0.0
hybrid_mlp_ratio ................................ 0.0
hybrid_override_pattern ......................... None
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... -1
inference_dynamic_batching ...................... False
inference_dynamic_batching_buffer_guaranteed_fraction 0.2
inference_dynamic_batching_buffer_overflow_factor None
inference_dynamic_batching_buffer_size_gb ....... 40.0
inference_dynamic_batching_chunk_size ........... 256
inference_dynamic_batching_max_requests_override None
inference_dynamic_batching_max_tokens_override .. None
inference_max_batch_size ........................ 8
inference_max_seq_length ........................ 2560
inference_rng_tracker ........................... False
init_method_std ................................. 0.006
init_method_xavier_uniform ...................... False
init_model_with_meta_device ..................... False
initial_loss_scale .............................. 4294967296
is_hybrid_model ................................. False
iter_per_epoch .................................. 1250
iterations_to_skip .............................. []
keep_fp8_transpose_cache_when_using_custom_fsdp . False
kv_channels ..................................... 64
kv_lora_rank .................................... 32
lazy_mpu_init ................................... None
load ............................................ None
local_rank ...................................... 0
log_interval .................................... 100
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... False
log_straggler ................................... False
log_throughput .................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
logging_level ................................... None
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 6e-05
lr_decay_iters .................................. 430000
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. 0.001
lr_warmup_init .................................. 0.0
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
lr_wsd_decay_iters .............................. None
lr_wsd_decay_samples ............................ None
lr_wsd_decay_style .............................. exponential
main_grads_dtype ................................ torch.float32
main_params_dtype ............................... torch.float32
make_vocab_size_divisible_by .................... 128
mamba_head_dim .................................. 64
mamba_num_groups ................................ 8
mamba_num_heads ................................. None
mamba_state_dim ................................. 128
manual_gc ....................................... False
manual_gc_eval .................................. True
manual_gc_interval .............................. 0
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... False
max_position_embeddings ......................... 1024
max_tokens_to_oom ............................... 12000
memory_snapshot_path ............................ snapshot.pickle
merge_file ...................................... /Volumes/hiroshi/air/dataset/gpt2/gpt2-merges.txt
micro_batch_size ................................ 1
microbatch_group_size_per_vp_stage .............. None
mid_level_dataset_surplus ....................... 0.005
min_loss_scale .................................. 1.0
min_lr .......................................... 6e-06
mlp_chunks_for_prefill .......................... 1
mmap_bin_files .................................. True
mock_data ....................................... False
moe_aux_loss_coeff .............................. 0.0
moe_enable_deepep ............................... False
moe_expert_capacity_factor ...................... None
moe_extended_tp ................................. False
moe_ffn_hidden_size ............................. None
moe_grouped_gemm ................................ False
moe_input_jitter_eps ............................ None
moe_layer_freq .................................. 1
moe_layer_recompute ............................. False
moe_pad_expert_input_to_capacity ................ False
moe_per_layer_logging ........................... False
moe_permute_fusion .............................. False
moe_router_bias_update_rate ..................... 0.001
moe_router_dtype ................................ None
moe_router_enable_expert_bias ................... False
moe_router_group_topk ........................... None
moe_router_load_balancing_type .................. aux_loss
moe_router_num_groups ........................... None
moe_router_pre_softmax .......................... False
moe_router_score_function ....................... softmax
moe_router_topk ................................. 2
moe_router_topk_scaling_factor .................. None
moe_shared_expert_intermediate_size ............. None
moe_shared_expert_overlap ....................... False
moe_token_dispatcher_type ....................... allgather
moe_token_drop_policy ........................... probs
moe_use_legacy_grouped_gemm ..................... False
moe_use_upcycling ............................... False
moe_z_loss_coeff ................................ None
mrope_section ................................... None
mscale .......................................... 1.0
mscale_all_dim .................................. 1.0
mtp_loss_scaling_factor ......................... 0.1
mtp_num_layers .................................. None
multi_latent_attention .......................... False
nccl_communicator_config_path ................... None
no_load_optim ................................... None
no_load_rng ..................................... None
no_persist_layer_norm ........................... True
no_save_optim ................................... None
no_save_rng ..................................... None
non_persistent_ckpt_type ........................ None
non_persistent_global_ckpt_dir .................. None
non_persistent_local_ckpt_algo .................. fully_parallel
non_persistent_local_ckpt_dir ................... None
non_persistent_save_interval .................... None
norm_epsilon .................................... 1e-05
normalization ................................... LayerNorm
num_attention_heads ............................. 8
num_channels .................................... 3
num_classes ..................................... 1000
num_dataset_builder_threads ..................... 1
num_distributed_optimizer_instances ............. 1
num_experts ..................................... None
num_layers ...................................... 12
num_layers_at_end_in_bf16 ....................... 1
num_layers_at_start_in_bf16 ..................... 1
num_layers_per_virtual_pipeline_stage ........... None
num_query_groups ................................ 1
num_virtual_stages_per_pipeline_rank ............ None
num_workers ..................................... 2
object_storage_cache_path ....................... None
one_logger_async ................................ False
one_logger_project .............................. megatron-lm
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
optimizer_cpu_offload ........................... False
optimizer_offload_fraction ...................... 1.0
output_bert_embeddings .......................... False
overlap_cpu_optimizer_d2h_h2d ................... False
overlap_grad_reduce ............................. False
overlap_p2p_comm ................................ False
overlap_p2p_comm_warmup_flush ................... False
overlap_param_gather ............................ False
overlap_param_gather_with_optimizer_step ........ False
override_opt_param_scheduler .................... False
params_dtype .................................... torch.float16
patch_dim ....................................... 16
per_split_data_args_path ........................ None
perform_initialization .......................... True
pin_cpu_grads ................................... True
pin_cpu_params .................................. True
pipeline_model_parallel_comm_backend ............ None
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... learned_absolute
pretrained_checkpoint ........................... None
profile ......................................... False
profile_ranks ................................... [0]
profile_step_end ................................ 12
profile_step_start .............................. 10
q_lora_rank ..................................... None
qk_head_dim ..................................... 128
qk_layernorm .................................... False
qk_pos_emb_head_dim ............................. 64
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_modules ............................... None
recompute_num_layers ............................ None
record_memory_history ........................... False
relative_attention_max_distance ................. 128
relative_attention_num_buckets .................. 32
replication ..................................... False
replication_factor .............................. 2
replication_jump ................................ None
rerun_mode ...................................... disabled
reset_attention_mask ............................ False
reset_position_ids .............................. False
result_rejected_tracker_filename ................ None
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
rope_scaling_factor ............................. 8.0
rotary_base ..................................... 10000
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_scaling_factor ........................... 1.0
rotary_seq_len_interpolation_factor ............. None
run_workload_inspector_server ................... False
sample_rate ..................................... 1.0
save ............................................ None
save_interval ................................... None
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 1024
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_train ...................................... False
skipped_train_samples ........................... 0
spec ............................................ None
split ........................................... 949,50,1
squared_relu .................................... False
start_weight_decay .............................. 0.1
straggler_ctrlr_port ............................ 65535
straggler_minmax_count .......................... 1
suggested_communication_unit_size ............... None
swiglu .......................................... False
swin_backbone_type .............................. tiny
te_rng_tracker .................................. False
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_mode ....................................... False
tiktoken_num_special_tokens ..................... 1000
tiktoken_pattern ................................ None
tiktoken_special_tokens ......................... None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. None
tokenizer_type .................................. GPT2BPETokenizer
tp_comm_bootstrap_backend ....................... nccl
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_ag .............................. True
tp_comm_overlap_cfg ............................. None
tp_comm_overlap_rs .............................. True
tp_comm_overlap_rs_dgrad ........................ False
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
train_data_path ................................. None
train_iters ..................................... 200
train_samples ................................... None
train_sync_interval ............................. None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 1
untie_embeddings_and_output_weights ............. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_cpu_initialization .......................... None
use_custom_fsdp ................................. False
use_dist_ckpt ................................... True
use_dist_ckpt_deprecated ........................ False
use_distributed_optimizer ....................... False
use_flash_attn .................................. False
use_legacy_models ............................... False
use_mp_args_from_checkpoint_args ................ False
use_one_sent_docs ............................... False
use_persistent_ckpt_worker ...................... False
use_precision_aware_optimizer ................... False
use_pytorch_profiler ............................ False
use_ring_exchange_p2p ........................... False
use_rope_scaling ................................ False
use_rotary_position_embeddings .................. False
use_tokenizer_model_from_checkpoint_args ........ True
use_torch_fsdp2 ................................. False
use_torch_optimizer_for_cpu_offload ............. False
use_tp_pp_dp_mapping ............................ False
v_head_dim ...................................... 128
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... /Volumes/hiroshi/air/dataset/gpt2/gpt2-vocab.json
vocab_size ...................................... None
wandb_exp_name ..................................
wandb_project ...................................
wandb_save_dir ..................................
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
wgrad_deferral_limit ............................ 0
world_size ...................................... 1
yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 16
> building GPT2BPETokenizer tokenizer ...
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
WARNING:megatron.core.rerun_state_machine:RerunStateMachine initialized in mode RerunMode.DISABLED
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/tmp/hiroshi.ouchiyama@databricks.com/20250618045849/Megatron-LM/megatron/core/datasets'
make: python3-config: No such file or directory
make: python3-config: No such file or directory
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.12 -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-9dc2b951-5bfc-4f90-87ba-6ff17400a912/lib/python3.12/site-packages/pybind11/include helpers.cpp -o helpers_cpp
In file included from helpers.cpp:12:
In function ‘pybind11::ssize_t pybind11::detail::byte_offset_unsafe(const Strides&, pybind11::ssize_t, Ix ...) [with long int Dim = 0; Strides = std::array<long int, 1>; Ix = {}]’,
inlined from ���const T& pybind11::detail::unchecked_reference<T, Dims>::operator()(Ix ...) const [with Ix = {long int}; T = long int; long int Dims = 1]’ at /local_disk0/.ephemeral_nfs/envs/pythonEnv-9dc2b951-5bfc-4f90-87ba-6ff17400a912/lib/python3.12/site-packages/pybind11/include/pybind11/numpy.h:545:65,
inlined from ‘const T& pybind11::detail::unchecked_reference<T, Dims>::operator[](pybind11::ssize_t) const [with long int D = 1; <template-parameter-2-2> = void; T = long int; long int Dims = 1]’ at /local_disk0/.ephemeral_nfs/envs/pythonEnv-9dc2b951-5bfc-4f90-87ba-6ff17400a912/lib/python3.12/site-packages/pybind11/include/pybind11/numpy.h:553:26,
inlined from ‘void build_exhaustive_blending_indices(pybind11::array_t<short int>&, pybind11::array_t<long int>&, const pybind11::array_t<long int>&, int32_t)’ at helpers.cpp:67:31:
/local_disk0/.ephemeral_nfs/envs/pythonEnv-9dc2b951-5bfc-4f90-87ba-6ff17400a912/lib/python3.12/site-packages/pybind11/include/pybind11/numpy.h:494:14: warning: ���error_argmax’ may be used uninitialized [-Wmaybe-uninitialized]
494 | return i * strides[Dim] + byte_offset_unsafe<Dim + 1>(strides, index...);
| ~~^~~~~~~~~~
helpers.cpp: In function ‘void build_exhaustive_blending_indices(pybind11::array_t<short int>&, pybind11::array_t<long int>&, const pybind11::array_t<long int>&, int32_t)’:
helpers.cpp:49:13: note: ‘error_argmax’ was declared here
49 | int64_t error_argmax;
| ^~~~~~~~~~~~
make: Leaving directory '/tmp/hiroshi.ouchiyama@databricks.com/20250618045849/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 6.658 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
[rank0]:[W618 04:59:39.002912817 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
>>> done with compiling and loading fused kernels. Compilation time: 1.331 seconds
time to initialize megatron (seconds): 13.500
[after megatron is initialized] datetime: 2025-06-18 04:59:45
building GPT model ...
/tmp/hiroshi.ouchiyama@databricks.com/20250618045849/Megatron-LM/megatron/core/models/gpt/gpt_layer_specs.py:205: UserWarning: The fp8 argument in "get_gpt_layer_local_spec" has been deprecated and will be removed soon. Please update your code accordingly.
warnings.warn(
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 64109568
INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=False, overlap_grad_reduce=False, overlap_param_gather=False, align_param_gather=False, use_distributed_optimizer=False, num_distributed_optimizer_instances=1, check_for_nan_in_grad=False, check_for_large_grads=False, bucket_size=None, pad_buckets_for_high_nccl_busbw=False, average_in_collective=False, fp8_param_gather=False, use_custom_fsdp=False, data_parallel_sharding_strategy='no_shard', gradient_reduce_div_fusion=True, suggested_communication_unit_size=None, preserve_fp32_weights=True, keep_fp8_transpose_cache_when_using_custom_fsdp=False)
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
Params for bucket 1 (64109568 elements, 64109568 padded size):
module.decoder.layers.4.mlp.linear_fc1.weight
module.decoder.layers.3.pre_mlp_layernorm.bias
module.decoder.layers.10.input_layernorm.bias
module.decoder.layers.7.mlp.linear_fc1.bias
module.decoder.layers.5.self_attention.linear_proj.bias
module.decoder.layers.3.input_layernorm.weight
module.decoder.layers.1.mlp.linear_fc2.weight
module.embedding.word_embeddings.weight
module.decoder.layers.11.pre_mlp_layernorm.weight
module.decoder.layers.8.self_attention.linear_qkv.weight
module.decoder.layers.6.mlp.linear_fc1.weight
module.decoder.layers.5.pre_mlp_layernorm.bias
module.decoder.layers.3.mlp.linear_fc2.bias
module.decoder.layers.8.mlp.linear_fc1.bias
module.decoder.layers.7.pre_mlp_layernorm.weight
module.decoder.layers.7.self_attention.linear_proj.bias
module.decoder.layers.5.input_layernorm.weight
module.decoder.layers.2.mlp.linear_fc2.weight
module.decoder.layers.0.mlp.linear_fc2.weight
module.decoder.layers.10.self_attention.linear_qkv.weight
module.decoder.layers.5.mlp.linear_fc2.bias
module.decoder.layers.3.self_attention.linear_proj.weight
module.decoder.layers.0.self_attention.linear_qkv.bias
module.decoder.layers.10.mlp.linear_fc1.bias
module.decoder.layers.9.pre_mlp_layernorm.weight
module.decoder.layers.7.input_layernorm.weight
module.decoder.layers.4.mlp.linear_fc2.weight
module.decoder.layers.2.input_layernorm.weight
module.decoder.layers.11.self_attention.linear_proj.bias
module.decoder.layers.9.self_attention.linear_qkv.bias
module.decoder.layers.5.self_attention.linear_proj.weight
module.decoder.layers.3.input_layernorm.bias
module.decoder.layers.1.mlp.linear_fc1.weight
module.decoder.layers.1.self_attention.linear_qkv.weight
module.decoder.layers.0.input_layernorm.weight
module.decoder.layers.11.pre_mlp_layernorm.bias
module.decoder.layers.6.mlp.linear_fc2.weight
module.decoder.layers.0.mlp.linear_fc2.bias
module.decoder.layers.11.self_attention.linear_qkv.bias
module.decoder.layers.8.mlp.linear_fc1.weight
module.decoder.layers.7.pre_mlp_layernorm.bias
module.decoder.layers.7.self_attention.linear_proj.weight
module.decoder.layers.5.input_layernorm.bias
module.decoder.layers.11.mlp.linear_fc2.bias
module.decoder.layers.9.self_attention.linear_proj.bias
module.decoder.layers.3.self_attention.linear_qkv.weight
module.decoder.layers.0.mlp.linear_fc1.weight
module.decoder.final_layernorm.weight
module.decoder.layers.10.mlp.linear_fc1.weight
module.decoder.layers.9.pre_mlp_layernorm.bias
module.decoder.layers.7.input_layernorm.bias
module.decoder.layers.3.mlp.linear_fc1.bias
module.decoder.layers.2.pre_mlp_layernorm.weight
module.decoder.layers.0.pre_mlp_layernorm.weight
module.decoder.layers.11.self_attention.linear_proj.weight
module.decoder.layers.9.input_layernorm.weight
module.decoder.layers.5.self_attention.linear_qkv.weight
module.decoder.layers.9.mlp.linear_fc2.bias
module.decoder.layers.5.mlp.linear_fc1.bias
module.decoder.layers.4.pre_mlp_layernorm.weight
module.decoder.layers.11.input_layernorm.bias
module.decoder.layers.8.mlp.linear_fc2.weight
module.decoder.layers.7.self_attention.linear_qkv.weight
module.decoder.layers.4.self_attention.linear_qkv.bias
module.decoder.layers.0.self_attention.linear_proj.bias
module.decoder.layers.0.input_layernorm.bias
module.decoder.layers.9.self_attention.linear_proj.weight
module.decoder.layers.6.pre_mlp_layernorm.weight
module.embedding.position_embeddings.weight
module.decoder.layers.10.mlp.linear_fc2.weight
module.decoder.layers.6.self_attention.linear_qkv.bias
module.decoder.layers.3.mlp.linear_fc1.weight
module.decoder.layers.2.pre_mlp_layernorm.bias
module.decoder.layers.1.mlp.linear_fc2.bias
module.decoder.layers.1.self_attention.linear_proj.weight
module.decoder.layers.11.self_attention.linear_qkv.weight
module.decoder.layers.9.input_layernorm.bias
module.decoder.layers.4.self_attention.linear_proj.bias
module.decoder.layers.2.input_layernorm.bias
module.decoder.layers.1.input_layernorm.bias
module.decoder.layers.11.mlp.linear_fc1.bias
module.decoder.layers.5.mlp.linear_fc1.weight
module.decoder.layers.4.pre_mlp_layernorm.bias
module.decoder.layers.2.mlp.linear_fc2.bias
module.decoder.layers.1.input_layernorm.weight
module.decoder.layers.7.mlp.linear_fc2.bias
module.decoder.layers.6.self_attention.linear_proj.bias
module.decoder.layers.4.input_layernorm.weight
module.decoder.layers.9.self_attention.linear_qkv.weight
module.decoder.layers.6.pre_mlp_layernorm.bias
module.decoder.layers.4.mlp.linear_fc2.bias
module.decoder.layers.2.self_attention.linear_qkv.weight
module.decoder.layers.9.mlp.linear_fc1.bias
module.decoder.layers.8.pre_mlp_layernorm.weight
module.decoder.layers.6.input_layernorm.weight
module.decoder.layers.3.mlp.linear_fc2.weight
module.decoder.layers.1.pre_mlp_layernorm.bias
module.decoder.layers.8.self_attention.linear_qkv.bias
module.decoder.layers.6.mlp.linear_fc2.bias
module.decoder.layers.4.self_attention.linear_proj.weight
module.decoder.layers.11.mlp.linear_fc1.weight
module.decoder.layers.10.pre_mlp_layernorm.weight
module.decoder.layers.5.mlp.linear_fc2.weight
module.decoder.layers.1.self_attention.linear_proj.bias
module.decoder.layers.10.self_attention.linear_qkv.bias
module.decoder.layers.7.mlp.linear_fc1.weight
module.decoder.layers.7.self_attention.linear_qkv.bias
module.decoder.layers.6.self_attention.linear_proj.weight
module.decoder.layers.4.input_layernorm.bias
module.decoder.layers.8.self_attention.linear_proj.bias
module.decoder.layers.2.self_attention.linear_proj.bias
module.decoder.layers.0.self_attention.linear_proj.weight
module.decoder.final_layernorm.bias
module.decoder.layers.9.mlp.linear_fc1.weight
module.decoder.layers.8.pre_mlp_layernorm.bias
module.decoder.layers.6.input_layernorm.bias
module.decoder.layers.2.mlp.linear_fc1.bias
module.decoder.layers.2.self_attention.linear_qkv.bias
module.decoder.layers.10.self_attention.linear_proj.bias
module.decoder.layers.8.input_layernorm.weight
module.decoder.layers.4.self_attention.linear_qkv.weight
module.decoder.layers.1.self_attention.linear_qkv.bias
module.decoder.layers.0.self_attention.linear_qkv.weight
module.decoder.layers.11.mlp.linear_fc2.weight
module.decoder.layers.10.pre_mlp_layernorm.bias
module.decoder.layers.8.mlp.linear_fc2.bias
module.decoder.layers.4.mlp.linear_fc1.bias
module.decoder.layers.3.pre_mlp_layernorm.weight
module.decoder.layers.0.mlp.linear_fc1.bias
module.decoder.layers.10.input_layernorm.weight
module.decoder.layers.7.mlp.linear_fc2.weight
module.decoder.layers.6.self_attention.linear_qkv.weight
module.decoder.layers.3.self_attention.linear_qkv.bias
module.decoder.layers.1.mlp.linear_fc1.bias
module.decoder.layers.10.mlp.linear_fc2.bias
module.decoder.layers.8.self_attention.linear_proj.weight
module.decoder.layers.6.mlp.linear_fc1.bias
module.decoder.layers.5.pre_mlp_layernorm.weight
module.decoder.layers.11.input_layernorm.weight
module.decoder.layers.9.mlp.linear_fc2.weight
module.decoder.layers.5.self_attention.linear_qkv.bias
module.decoder.layers.2.mlp.linear_fc1.weight
module.decoder.layers.1.pre_mlp_layernorm.weight
module.decoder.layers.10.self_attention.linear_proj.weight
module.decoder.layers.8.input_layernorm.bias
module.decoder.layers.3.self_attention.linear_proj.bias
module.decoder.layers.2.self_attention.linear_proj.weight
module.decoder.layers.0.pre_mlp_layernorm.bias
INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=6e-05, min_lr=6e-06, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=True, bf16=False, params_dtype=torch.float16, use_precision_aware_optimizer=False, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=False, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=False, optimizer_offload_fraction=1.0, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x7fe020e2b230>, config_logger_dir='')
INFO:megatron.core.optimizer_param_scheduler:> learning rate decay style: cosine
[after model, optimizer, and learning rate scheduler are built] datetime: 2025-06-18 04:59:46
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 3200
validation: 480
test: 160
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.949), (0.949, 0.999), (0.999, 1.0)]
> building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building GPTDataset splits with sizes=(3200, 480, 160) and config=GPTDatasetConfig(random_seed=1234, sequence_length=1024, blend=(['/Volumes/hiroshi/air/dataset/gpt2/my-gpt2_text_document'], None), blend_per_split=None, split='949,50,1', split_matrix=[(0, 0.949), (0.949, 0.999), (0.999, 1.0)], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<megatron.training.tokenizer.tokenizer._GPT2BPETokenizer object at 0x7fe020aff920>, mid_level_dataset_surplus=0.005, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=True, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True, object_storage_cache_path=None)
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from /Volumes/hiroshi/air/dataset/gpt2/my-gpt2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset: Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 86400
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 86400
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset: Load the document index from eb33695630bbb79a379091699f1e6d2e-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the sample index from eb33695630bbb79a379091699f1e6d2e-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from eb33695630bbb79a379091699f1e6d2e-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 118428
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:megatron.core.datasets.gpt_dataset: Load the document index from d5e628ae007a5962ac41e1cad89d76cd-GPTDataset-valid-document_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the sample index from d5e628ae007a5962ac41e1cad89d76cd-GPTDataset-valid-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from d5e628ae007a5962ac41e1cad89d76cd-GPTDataset-valid-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 6239
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:megatron.core.datasets.gpt_dataset: Load the document index from 0b1d5f9095bafbb844813872cf9e6135-GPTDataset-test-document_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the sample index from 0b1d5f9095bafbb844813872cf9e6135-GPTDataset-test-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from 0b1d5f9095bafbb844813872cf9e6135-GPTDataset-test-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 245
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2025-06-18 04:59:47
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (174.86, 174.86)
train/valid/test-data-iterators-setup ..........: (1167.20, 1167.20)
training ...
Setting rerun_state_machine.current_iteration to 0...
[before the start of training step] datetime: 2025-06-18 04:59:48
[2025-06-18 05:01:25] iteration 100/ 200 | consumed samples: 1600 | elapsed time per iteration (ms): 967.3 | learning rate: 1.172093E-05 | global batch size: 16 | lm loss: 1.038274E+01 | loss scale: 131072.0 | grad norm: 2.749 | num zeros: 1221.0 | number of skipped iterations: 16 | number of nan iterations: 0 |
Number of parameters in transformer block in billions: 0.04
Number of parameters in embedding layers in billions: 0.03
Total number of parameters in billions: 0.06
Number of parameters in most loaded shard in billions: 0.0635
Theoretical memory footprints: weight and optimizer=1090.56 MB
[Rank 0] (after 100 iterations) memory (MB) | allocated: 1267.92333984375 | max allocated: 2012.49951171875 | reserved: 2130.0 | max reserved: 2130.0
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
(min, max) time across ranks (ms):
evaluate .......................................: (6381.33, 6381.33)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
-----------------------------------------------------------------------------------------------
validation loss at iteration 100 | lm loss value: 9.953546E+00 | lm loss PPL: 2.102664E+04 |
-----------------------------------------------------------------------------------------------
[2025-06-18 05:03:00] iteration 200/ 200 | consumed samples: 3200 | elapsed time per iteration (ms): 884.6 | learning rate: 2.567442E-05 | global batch size: 16 | lm loss: 9.336311E+00 | loss scale: 131072.0 | grad norm: 3.071 | num zeros: 886.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
(min, max) time across ranks (ms):
evaluate .......................................: (3951.58, 3951.58)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
-----------------------------------------------------------------------------------------------
validation loss at iteration 200 | lm loss value: 8.537742E+00 | lm loss PPL: 5.103805E+03 |
-----------------------------------------------------------------------------------------------
[after training is done] datetime: 2025-06-18 05:03:04
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
Evaluating on 160 samples
Evaluating iter 1/10
Evaluating iter 2/10
Evaluating iter 3/10
Evaluating iter 4/10
Evaluating iter 5/10
Evaluating iter 6/10
Evaluating iter 7/10
Evaluating iter 8/10
Evaluating iter 9/10
Evaluating iter 10/10
(min, max) time across ranks (ms):
evaluate .......................................: (3926.28, 3926.28)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
-----------------------------------------------------------------------------------------------------------------
validation loss at iteration 200 on validation set | lm loss value: 8.532976E+00 | lm loss PPL: 5.079541E+03 |
-----------------------------------------------------------------------------------------------------------------
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
Evaluating on 160 samples
Evaluating iter 1/10
Evaluating iter 2/10
Evaluating iter 3/10
Evaluating iter 4/10
Evaluating iter 5/10
Evaluating iter 6/10
Evaluating iter 7/10
Evaluating iter 8/10
Evaluating iter 9/10
Evaluating iter 10/10
(min, max) time across ranks (ms):
evaluate .......................................: (3950.18, 3950.18)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
-----------------------------------------------------------------------------------------------------------
validation loss at iteration 200 on test set | lm loss value: 8.540809E+00 | lm loss PPL: 5.119483E+03 |
-----------------------------------------------------------------------------------------------------------
[rank0]:[W618 05:03:12.704246014 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
BFN!
Discussion