🚀

Databricks Serverless GPU を使ってみる ~MegatronLM編~

2025/06/24に公開

Data + AI Summit 2025にて、「Serverless GPU」という機能が発表されました。本機能は2025年6月18日現在、ベータ版として利用可能です。将来はH100のマルチGPUまで拡張されるようですが、現時点では単一のA10をオンデマンドに利用することができます。

HuggingFace Transformers や LLM-Foundry などは普通に動くと思うので、今回は MegatronLM が動くか確認します。

①まずはノートブックに GPU（A10）をアタッチ

ノートブックの右側のメニューの「環境」をクリックすると、Acceralator を選択できるようになっているので「A10」を選択して、画面右下の「Apply」ボタンを押下します。ホットスタンバイされている A10 が即座にアタッチされます。これまでのようにGPUのコンピュートクラスターを明示的に起動しなくて良いので非常に楽になりました。

②GPUを確認

念の為、以下のコマンドでGPU情報を見てみます。

!nvidia-smi

Wed Jun 18 05:59:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   27C    P8             25W /  300W |       1MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

A10が一枚、動いていることが確認できますね。

③ライブラリーのインストール

ライブラリをインストールしていきますが、注意したいのはPython環境です。
Notebook配下には複数のPython環境が存在しているのですが、今回のMegatronLMを動かすには、torchrun コマンドを使用するので、このコマンドのバイナリが存在する環境に対してインストールをする必要があります。なお、当該環境はDATABRICKS_ROOT_VIRTUALENV_ENV というプリセットの環境変数にパスがセットされているので、そのまま使用します。

PyPiからインストール

まずは以下のライブラリをPyPiからインストールします。

%sh
${DATABRICKS_ROOT_VIRTUALENV_ENV}/bin/python3 -m pip install nltk tiktoken zarr einops pybind11

APEXのインストール

続いて APEX をインストールするのですが、こちらは基本的にソースコードからビルドする必要があります。

%sh
git clone https://github.com/NVIDIA/apex ${APEX_PATH}
cd ${APEX_PATH}
git checkout ${APEX_COMMIT_ID} 

# APEXをビルドする。20分ほどかかります。
# wheel を生成して 一時フォルダ（dist/）に保存
pip wheel -v --no-deps --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" -w ${APEX_PATH}/dist .

# 今後のためにwheelファイルを適当なフォルダに保存しておきます。
mkdir -p ${WHEEL_PATH}
cp ${APEX_PATH}/dist/apex-*.whl ${WHEEL_PATH}

# pip install します。
${DATABRICKS_ROOT_VIRTUALENV_ENV}/bin/python3 -m pip install ${WHEEL_PATH}/apex-0.1-cp312-cp312-linux_x86_64.whl

APEXが正常にインストールされているかを確認しておきます。

%sh
${DATABRICKS_ROOT_VIRTUALENV_ENV}/bin/python3 -c "from apex import amp;from apex.parallel import DistributedDataParallel"

④MegatronLMのクローンおよびHelperモジュールのビルド

MegatronLMのレポジトリをクローンして、Helperモジュールをビルドします。本来このビルドは、トレーニング処理を開始すると冒頭で自動的に実行されるのですが、ビルドの際に使用される「python-config」というコマンドがランタイムに含まれていない、かつ、インストールも困難（apt install要）のため、以下のように明示的にg++でのインストールを実行してあげます。

%sh
git clone https://github.com/NVIDIA/Megatron-LM /tmp/Megatron-LM
cd /tmp/Megatron-LM
git checkout d905ac7f1b45bde189e8df51d1ecd1a726984d32

# "VIRTUAL_ENV"という環境変数はプリセットされています。
cd /tmp/Megatron-LM/megatron/core/datasets
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.12 -I${VIRTUAL_ENV}/lib/python3.12/site-packages/pybind11/include helpers.cpp -o helpers_cpp.cpython-312-x86_64-linux-gnu.so

いずれにしても、helpers_cpp.cpython-312-x86_64-linux-gnu.so が出来上がればOKです。

⑤トレーニング設定と実行

さあ、いよいよトレーニングです。
今回はA10が1枚だけなので、モデルをかなり小さくしています。かつ、FP8などハイエンドGPUしか持たない機能については不使用で行きます。動かすことが目的なので。

なお、データセットはUnity Catalog Volumeにおいてある前提です。

%sh
export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=1

VOCAB_FILE=/Volumes/YOUR_CATALOG/YOUR_SCHEMA/YOUR_VOLUME/gpt2/gpt2-vocab.json
MERGE_FILE=/Volumes/YOUR_CATALOG/YOUR_SCHEMA/YOUR_VOLUME/gpt2/gpt2-merges.txt
DATA_PATH=/Volumes/YOUR_CATALOG/YOUR_SCHEMA/YOUR_VOLUME/gpt2/my-gpt2_text_document

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE #この環境変数はランタイムにプリセットされています
    --nnodes 1 
    --node_rank 0
    --master_addr $HOST_IP #この環境変数はランタイムにプリセットされています
    --master_port 6001
)

GPT_MODEL_ARGS=(
    --num-layers 12 
    --hidden-size 512 
    --num-attention-heads 8 
    --seq-length 1024 
    --max-position-embeddings 1024 
)

TRAINING_ARGS=(
    --micro-batch-size 1 
    --global-batch-size 16 
    --train-iters 200 
    --weight-decay 0.1 
    --adam-beta1 0.9 
    --adam-beta2 0.95 
    --init-method-std 0.006 
    --clip-grad 1.0 
    --fp16
    --lr 6.0e-5 
    --lr-decay-style cosine 
    --min-lr 6.0e-6
    --lr-warmup-fraction .001 
    --lr-decay-iters 430000 
    --transformer-impl local
    --no-persist-layer-norm
    --no-gradient-accumulation-fusion
    --no-masked-softmax-fusion
)

MODEL_PARALLEL_ARGS=(
  --tensor-model-parallel-size 1 
  --pipeline-model-parallel-size 1 
)

DATA_ARGS=(
    --data-path $DATA_PATH 
    --vocab-file $VOCAB_FILE 
    --merge-file $MERGE_FILE 
    --split 949,50,1
)

EVAL_AND_LOGGING_ARGS=(
    --eval-interval 100 
    --eval-iters 10
)

cd ${MEGATRON_PATH}
torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \
    ${GPT_MODEL_ARGS[@]} \
    ${TRAINING_ARGS[@]} \
    ${MODEL_PARALLEL_ARGS[@]} \
    ${DATA_ARGS[@]} \
    ${EVAL_AND_LOGGING_ARGS[@]} \
    --distributed-backend nccl

無事起動できました。少々長いですが、実行ログを以下に貼り付けておきます。

using world size: 1, data-parallel size: 1, context-parallel size: 1, hierarchical context-parallel sizes: Nonetensor-model-parallel size: 1, encoder-tensor-model-parallel size: 0, pipeline-model-parallel size: 1, encoder-pipeline-model-parallel size: 0
Number of virtual stages per pipeline stage: None
WARNING: Setting args.check_for_nan_in_loss_and_grad to False since dynamic loss scaling is being used
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  account_for_embedding_in_pipeline_split ......... False
  account_for_loss_in_pipeline_split .............. False
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.95
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  add_qkv_bias .................................... True
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  align_grad_reduce ............................... True
  align_param_gather .............................. False
  app_tag_run_name ................................ None
  app_tag_run_version ............................. 0.0.0
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... False
  async_save ...................................... None
  async_tensor_model_parallel_allreduce ........... True
  attention_backend ............................... AttnBackend.auto
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  auto_detect_ckpt_format ......................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  calc_ft_timeouts ................................ False
  calculate_per_token_loss ........................ False
  check_for_large_grads ........................... False
  check_for_nan_in_loss_and_grad .................. False
  check_for_spiky_loss ............................ False
  check_weight_hash_across_dp_replicas_interval ... None
  ckpt_assume_constant_structure .................. False
  ckpt_convert_format ............................. None
  ckpt_convert_save ............................... None
  ckpt_convert_update_legacy_dist_opt_format ...... False
  ckpt_format ..................................... torch_dist
  ckpt_fully_parallel_load ........................ False
  ckpt_fully_parallel_save ........................ True
  ckpt_fully_parallel_save_deprecated ............. False
  ckpt_step ....................................... None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  config_logger_dir ............................... 
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  cp_comm_type .................................... ['p2p']
  create_attention_mask_in_dataloader ............. True
  cross_entropy_fusion_impl ....................... native
  cross_entropy_loss_fusion ....................... False
  cuda_graph_scope ................................ full
  cuda_graph_warmup_steps ......................... 3
  data_args_path .................................. None
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_sharding_strategy ................. no_shard
  data_parallel_size .............................. 1
  data_path ....................................... ['/Volumes/hiroshi/air/dataset/gpt2/my-gpt2_text_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  ddp_average_in_collective ....................... False
  ddp_bucket_size ................................. None
  ddp_num_buckets ................................. None
  ddp_pad_buckets_for_high_nccl_busbw ............. False
  decoder_first_pipeline_num_layers ............... None
  decoder_last_pipeline_num_layers ................ None
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  decoupled_lr .................................... None
  decoupled_min_lr ................................ None
  decrease_batch_size_if_needed ................... False
  defer_embedding_wgrad_compute ................... False
  deprecated_use_mcore_models ..................... False
  deterministic_mode .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  disable_bf16_reduced_precision_matmul ........... False
  disable_mamba_mem_eff_path ...................... False
  disable_straggler_on_startup .................... False
  dist_ckpt_format_deprecated ..................... None
  dist_ckpt_strictness ............................ assume_ok_unexpected
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_cuda_graph ............................... False
  enable_ft_package ............................... False
  enable_gloo_process_groups ...................... True
  enable_msc ...................................... True
  enable_one_logger ............................... True
  encoder_num_layers .............................. 12
  encoder_pipeline_model_parallel_size ............ 0
  encoder_seq_length .............................. 1024
  encoder_tensor_model_parallel_size .............. 0
  end_weight_decay ................................ 0.1
  eod_mask_loss ................................... False
  error_injection_rate ............................ 0
  error_injection_type ............................ transient_error
  eval_interval ................................... 100
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  exp_avg_dtype ................................... torch.float32
  exp_avg_sq_dtype ................................ torch.float32
  expert_model_parallel_size ...................... 1
  expert_tensor_parallel_size ..................... 1
  external_cuda_graph ............................. False
  ffn_hidden_size ................................. 2048
  finetune ........................................ False
  first_last_layers_bf16 .......................... False
  flash_decode .................................... False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_param_gather ................................ False
  fp8_recipe ...................................... delayed
  fp8_wgrad ....................................... True
  global_batch_size ............................... 16
  grad_reduce_in_bf16 ............................. False
  gradient_accumulation_fusion .................... False
  gradient_reduce_div_fusion ...................... True
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  heterogeneous_layers_config_encoded_json ........ None
  heterogeneous_layers_config_path ................ None
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 512
  hierarchical_context_parallel_sizes ............. None
  hybrid_attention_ratio .......................... 0.0
  hybrid_mlp_ratio ................................ 0.0
  hybrid_override_pattern ......................... None
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... -1
  inference_dynamic_batching ...................... False
  inference_dynamic_batching_buffer_guaranteed_fraction  0.2
  inference_dynamic_batching_buffer_overflow_factor  None
  inference_dynamic_batching_buffer_size_gb ....... 40.0
  inference_dynamic_batching_chunk_size ........... 256
  inference_dynamic_batching_max_requests_override  None
  inference_dynamic_batching_max_tokens_override .. None
  inference_max_batch_size ........................ 8
  inference_max_seq_length ........................ 2560
  inference_rng_tracker ........................... False
  init_method_std ................................. 0.006
  init_method_xavier_uniform ...................... False
  init_model_with_meta_device ..................... False
  initial_loss_scale .............................. 4294967296
  is_hybrid_model ................................. False
  iter_per_epoch .................................. 1250
  iterations_to_skip .............................. []
  keep_fp8_transpose_cache_when_using_custom_fsdp . False
  kv_channels ..................................... 64
  kv_lora_rank .................................... 32
  lazy_mpu_init ................................... None
  load ............................................ None
  local_rank ...................................... 0
  log_interval .................................... 100
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_progress .................................... False
  log_straggler ................................... False
  log_throughput .................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  logging_level ................................... None
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 6e-05
  lr_decay_iters .................................. 430000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. 0.001
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  lr_wsd_decay_iters .............................. None
  lr_wsd_decay_samples ............................ None
  lr_wsd_decay_style .............................. exponential
  main_grads_dtype ................................ torch.float32
  main_params_dtype ............................... torch.float32
  make_vocab_size_divisible_by .................... 128
  mamba_head_dim .................................. 64
  mamba_num_groups ................................ 8
  mamba_num_heads ................................. None
  mamba_state_dim ................................. 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... False
  max_position_embeddings ......................... 1024
  max_tokens_to_oom ............................... 12000
  memory_snapshot_path ............................ snapshot.pickle
  merge_file ...................................... /Volumes/hiroshi/air/dataset/gpt2/gpt2-merges.txt
  micro_batch_size ................................ 1
  microbatch_group_size_per_vp_stage .............. None
  mid_level_dataset_surplus ....................... 0.005
  min_loss_scale .................................. 1.0
  min_lr .......................................... 6e-06
  mlp_chunks_for_prefill .......................... 1
  mmap_bin_files .................................. True
  mock_data ....................................... False
  moe_aux_loss_coeff .............................. 0.0
  moe_enable_deepep ............................... False
  moe_expert_capacity_factor ...................... None
  moe_extended_tp ................................. False
  moe_ffn_hidden_size ............................. None
  moe_grouped_gemm ................................ False
  moe_input_jitter_eps ............................ None
  moe_layer_freq .................................. 1
  moe_layer_recompute ............................. False
  moe_pad_expert_input_to_capacity ................ False
  moe_per_layer_logging ........................... False
  moe_permute_fusion .............................. False
  moe_router_bias_update_rate ..................... 0.001
  moe_router_dtype ................................ None
  moe_router_enable_expert_bias ................... False
  moe_router_group_topk ........................... None
  moe_router_load_balancing_type .................. aux_loss
  moe_router_num_groups ........................... None
  moe_router_pre_softmax .......................... False
  moe_router_score_function ....................... softmax
  moe_router_topk ................................. 2
  moe_router_topk_scaling_factor .................. None
  moe_shared_expert_intermediate_size ............. None
  moe_shared_expert_overlap ....................... False
  moe_token_dispatcher_type ....................... allgather
  moe_token_drop_policy ........................... probs
  moe_use_legacy_grouped_gemm ..................... False
  moe_use_upcycling ............................... False
  moe_z_loss_coeff ................................ None
  mrope_section ................................... None
  mscale .......................................... 1.0
  mscale_all_dim .................................. 1.0
  mtp_loss_scaling_factor ......................... 0.1
  mtp_num_layers .................................. None
  multi_latent_attention .......................... False
  nccl_communicator_config_path ................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... True
  no_save_optim ................................... None
  no_save_rng ..................................... None
  non_persistent_ckpt_type ........................ None
  non_persistent_global_ckpt_dir .................. None
  non_persistent_local_ckpt_algo .................. fully_parallel
  non_persistent_local_ckpt_dir ................... None
  non_persistent_save_interval .................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... LayerNorm
  num_attention_heads ............................. 8
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_dataset_builder_threads ..................... 1
  num_distributed_optimizer_instances ............. 1
  num_experts ..................................... None
  num_layers ...................................... 12
  num_layers_at_end_in_bf16 ....................... 1
  num_layers_at_start_in_bf16 ..................... 1
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_virtual_stages_per_pipeline_rank ............ None
  num_workers ..................................... 2
  object_storage_cache_path ....................... None
  one_logger_async ................................ False
  one_logger_project .............................. megatron-lm
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  optimizer_cpu_offload ........................... False
  optimizer_offload_fraction ...................... 1.0
  output_bert_embeddings .......................... False
  overlap_cpu_optimizer_d2h_h2d ................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_p2p_comm_warmup_flush ................... False
  overlap_param_gather ............................ False
  overlap_param_gather_with_optimizer_step ........ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.float16
  patch_dim ....................................... 16
  per_split_data_args_path ........................ None
  perform_initialization .......................... True
  pin_cpu_grads ................................... True
  pin_cpu_params .................................. True
  pipeline_model_parallel_comm_backend ............ None
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... learned_absolute
  pretrained_checkpoint ........................... None
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  q_lora_rank ..................................... None
  qk_head_dim ..................................... 128
  qk_layernorm .................................... False
  qk_pos_emb_head_dim ............................. 64
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_modules ............................... None
  recompute_num_layers ............................ None
  record_memory_history ........................... False
  relative_attention_max_distance ................. 128
  relative_attention_num_buckets .................. 32
  replication ..................................... False
  replication_factor .............................. 2
  replication_jump ................................ None
  rerun_mode ...................................... disabled
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  result_rejected_tracker_filename ................ None
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_project_dir ............................... None
  retro_verify_neighbor_count ..................... True
  rope_scaling_factor ............................. 8.0
  rotary_base ..................................... 10000
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_scaling_factor ........................... 1.0
  rotary_seq_len_interpolation_factor ............. None
  run_workload_inspector_server ................... False
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 1024
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  skipped_train_samples ........................... 0
  spec ............................................ None
  split ........................................... 949,50,1
  squared_relu .................................... False
  start_weight_decay .............................. 0.1
  straggler_ctrlr_port ............................ 65535
  straggler_minmax_count .......................... 1
  suggested_communication_unit_size ............... None
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  te_rng_tracker .................................. False
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_mode ....................................... False
  tiktoken_num_special_tokens ..................... 1000
  tiktoken_pattern ................................ None
  tiktoken_special_tokens ......................... None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. GPT2BPETokenizer
  tp_comm_bootstrap_backend ....................... nccl
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_ag .............................. True
  tp_comm_overlap_cfg ............................. None
  tp_comm_overlap_rs .............................. True
  tp_comm_overlap_rs_dgrad ........................ False
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... 200
  train_samples ................................... None
  train_sync_interval ............................. None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_custom_fsdp ................................. False
  use_dist_ckpt ................................... True
  use_dist_ckpt_deprecated ........................ False
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_legacy_models ............................... False
  use_mp_args_from_checkpoint_args ................ False
  use_one_sent_docs ............................... False
  use_persistent_ckpt_worker ...................... False
  use_precision_aware_optimizer ................... False
  use_pytorch_profiler ............................ False
  use_ring_exchange_p2p ........................... False
  use_rope_scaling ................................ False
  use_rotary_position_embeddings .................. False
  use_tokenizer_model_from_checkpoint_args ........ True
  use_torch_fsdp2 ................................. False
  use_torch_optimizer_for_cpu_offload ............. False
  use_tp_pp_dp_mapping ............................ False
  v_head_dim ...................................... 128
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... /Volumes/hiroshi/air/dataset/gpt2/gpt2-vocab.json
  vocab_size ...................................... None
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.1
  weight_decay_incr_style ......................... constant
  wgrad_deferral_limit ............................ 0
  world_size ...................................... 1
  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 16
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
WARNING:megatron.core.rerun_state_machine:RerunStateMachine initialized in mode RerunMode.DISABLED
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/tmp/hiroshi.ouchiyama@databricks.com/20250618045849/Megatron-LM/megatron/core/datasets'
make: python3-config: No such file or directory
make: python3-config: No such file or directory
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.12 -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-9dc2b951-5bfc-4f90-87ba-6ff17400a912/lib/python3.12/site-packages/pybind11/include helpers.cpp -o helpers_cpp
In file included from helpers.cpp:12:
In function ‘pybind11::ssize_t pybind11::detail::byte_offset_unsafe(const Strides&, pybind11::ssize_t, Ix ...) [with long int Dim = 0; Strides = std::array<long int, 1>; Ix = {}]’,
    inlined from ���const T& pybind11::detail::unchecked_reference<T, Dims>::operator()(Ix ...) const [with Ix = {long int}; T = long int; long int Dims = 1]’ at /local_disk0/.ephemeral_nfs/envs/pythonEnv-9dc2b951-5bfc-4f90-87ba-6ff17400a912/lib/python3.12/site-packages/pybind11/include/pybind11/numpy.h:545:65,
    inlined from ‘const T& pybind11::detail::unchecked_reference<T, Dims>::operator[](pybind11::ssize_t) const [with long int D = 1; <template-parameter-2-2> = void; T = long int; long int Dims = 1]’ at /local_disk0/.ephemeral_nfs/envs/pythonEnv-9dc2b951-5bfc-4f90-87ba-6ff17400a912/lib/python3.12/site-packages/pybind11/include/pybind11/numpy.h:553:26,
    inlined from ‘void build_exhaustive_blending_indices(pybind11::array_t<short int>&, pybind11::array_t<long int>&, const pybind11::array_t<long int>&, int32_t)’ at helpers.cpp:67:31:
/local_disk0/.ephemeral_nfs/envs/pythonEnv-9dc2b951-5bfc-4f90-87ba-6ff17400a912/lib/python3.12/site-packages/pybind11/include/pybind11/numpy.h:494:14: warning: ���error_argmax’ may be used uninitialized [-Wmaybe-uninitialized]
  494 |     return i * strides[Dim] + byte_offset_unsafe<Dim + 1>(strides, index...);
      |            ~~^~~~~~~~~~
helpers.cpp: In function ‘void build_exhaustive_blending_indices(pybind11::array_t<short int>&, pybind11::array_t<long int>&, const pybind11::array_t<long int>&, int32_t)’:
helpers.cpp:49:13: note: ‘error_argmax’ was declared here
   49 |     int64_t error_argmax;
      |             ^~~~~~~~~~~~
make: Leaving directory '/tmp/hiroshi.ouchiyama@databricks.com/20250618045849/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 6.658 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
[rank0]:[W618 04:59:39.002912817 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
>>> done with compiling and loading fused kernels. Compilation time: 1.331 seconds
time to initialize megatron (seconds): 13.500
[after megatron is initialized] datetime: 2025-06-18 04:59:45 
building GPT model ...
/tmp/hiroshi.ouchiyama@databricks.com/20250618045849/Megatron-LM/megatron/core/models/gpt/gpt_layer_specs.py:205: UserWarning: The fp8 argument in "get_gpt_layer_local_spec" has been deprecated and will be removed soon. Please update your code accordingly.
  warnings.warn(
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 64109568
INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=False, overlap_grad_reduce=False, overlap_param_gather=False, align_param_gather=False, use_distributed_optimizer=False, num_distributed_optimizer_instances=1, check_for_nan_in_grad=False, check_for_large_grads=False, bucket_size=None, pad_buckets_for_high_nccl_busbw=False, average_in_collective=False, fp8_param_gather=False, use_custom_fsdp=False, data_parallel_sharding_strategy='no_shard', gradient_reduce_div_fusion=True, suggested_communication_unit_size=None, preserve_fp32_weights=True, keep_fp8_transpose_cache_when_using_custom_fsdp=False)
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
Params for bucket 1 (64109568 elements, 64109568 padded size):
	module.decoder.layers.4.mlp.linear_fc1.weight
	module.decoder.layers.3.pre_mlp_layernorm.bias
	module.decoder.layers.10.input_layernorm.bias
	module.decoder.layers.7.mlp.linear_fc1.bias
	module.decoder.layers.5.self_attention.linear_proj.bias
	module.decoder.layers.3.input_layernorm.weight
	module.decoder.layers.1.mlp.linear_fc2.weight
	module.embedding.word_embeddings.weight
	module.decoder.layers.11.pre_mlp_layernorm.weight
	module.decoder.layers.8.self_attention.linear_qkv.weight
	module.decoder.layers.6.mlp.linear_fc1.weight
	module.decoder.layers.5.pre_mlp_layernorm.bias
	module.decoder.layers.3.mlp.linear_fc2.bias
	module.decoder.layers.8.mlp.linear_fc1.bias
	module.decoder.layers.7.pre_mlp_layernorm.weight
	module.decoder.layers.7.self_attention.linear_proj.bias
	module.decoder.layers.5.input_layernorm.weight
	module.decoder.layers.2.mlp.linear_fc2.weight
	module.decoder.layers.0.mlp.linear_fc2.weight
	module.decoder.layers.10.self_attention.linear_qkv.weight
	module.decoder.layers.5.mlp.linear_fc2.bias
	module.decoder.layers.3.self_attention.linear_proj.weight
	module.decoder.layers.0.self_attention.linear_qkv.bias
	module.decoder.layers.10.mlp.linear_fc1.bias
	module.decoder.layers.9.pre_mlp_layernorm.weight
	module.decoder.layers.7.input_layernorm.weight
	module.decoder.layers.4.mlp.linear_fc2.weight
	module.decoder.layers.2.input_layernorm.weight
	module.decoder.layers.11.self_attention.linear_proj.bias
	module.decoder.layers.9.self_attention.linear_qkv.bias
	module.decoder.layers.5.self_attention.linear_proj.weight
	module.decoder.layers.3.input_layernorm.bias
	module.decoder.layers.1.mlp.linear_fc1.weight
	module.decoder.layers.1.self_attention.linear_qkv.weight
	module.decoder.layers.0.input_layernorm.weight
	module.decoder.layers.11.pre_mlp_layernorm.bias
	module.decoder.layers.6.mlp.linear_fc2.weight
	module.decoder.layers.0.mlp.linear_fc2.bias
	module.decoder.layers.11.self_attention.linear_qkv.bias
	module.decoder.layers.8.mlp.linear_fc1.weight
	module.decoder.layers.7.pre_mlp_layernorm.bias
	module.decoder.layers.7.self_attention.linear_proj.weight
	module.decoder.layers.5.input_layernorm.bias
	module.decoder.layers.11.mlp.linear_fc2.bias
	module.decoder.layers.9.self_attention.linear_proj.bias
	module.decoder.layers.3.self_attention.linear_qkv.weight
	module.decoder.layers.0.mlp.linear_fc1.weight
	module.decoder.final_layernorm.weight
	module.decoder.layers.10.mlp.linear_fc1.weight
	module.decoder.layers.9.pre_mlp_layernorm.bias
	module.decoder.layers.7.input_layernorm.bias
	module.decoder.layers.3.mlp.linear_fc1.bias
	module.decoder.layers.2.pre_mlp_layernorm.weight
	module.decoder.layers.0.pre_mlp_layernorm.weight
	module.decoder.layers.11.self_attention.linear_proj.weight
	module.decoder.layers.9.input_layernorm.weight
	module.decoder.layers.5.self_attention.linear_qkv.weight
	module.decoder.layers.9.mlp.linear_fc2.bias
	module.decoder.layers.5.mlp.linear_fc1.bias
	module.decoder.layers.4.pre_mlp_layernorm.weight
	module.decoder.layers.11.input_layernorm.bias
	module.decoder.layers.8.mlp.linear_fc2.weight
	module.decoder.layers.7.self_attention.linear_qkv.weight
	module.decoder.layers.4.self_attention.linear_qkv.bias
	module.decoder.layers.0.self_attention.linear_proj.bias
	module.decoder.layers.0.input_layernorm.bias
	module.decoder.layers.9.self_attention.linear_proj.weight
	module.decoder.layers.6.pre_mlp_layernorm.weight
	module.embedding.position_embeddings.weight
	module.decoder.layers.10.mlp.linear_fc2.weight
	module.decoder.layers.6.self_attention.linear_qkv.bias
	module.decoder.layers.3.mlp.linear_fc1.weight
	module.decoder.layers.2.pre_mlp_layernorm.bias
	module.decoder.layers.1.mlp.linear_fc2.bias
	module.decoder.layers.1.self_attention.linear_proj.weight
	module.decoder.layers.11.self_attention.linear_qkv.weight
	module.decoder.layers.9.input_layernorm.bias
	module.decoder.layers.4.self_attention.linear_proj.bias
	module.decoder.layers.2.input_layernorm.bias
	module.decoder.layers.1.input_layernorm.bias
	module.decoder.layers.11.mlp.linear_fc1.bias
	module.decoder.layers.5.mlp.linear_fc1.weight
	module.decoder.layers.4.pre_mlp_layernorm.bias
	module.decoder.layers.2.mlp.linear_fc2.bias
	module.decoder.layers.1.input_layernorm.weight
	module.decoder.layers.7.mlp.linear_fc2.bias
	module.decoder.layers.6.self_attention.linear_proj.bias
	module.decoder.layers.4.input_layernorm.weight
	module.decoder.layers.9.self_attention.linear_qkv.weight
	module.decoder.layers.6.pre_mlp_layernorm.bias
	module.decoder.layers.4.mlp.linear_fc2.bias
	module.decoder.layers.2.self_attention.linear_qkv.weight
	module.decoder.layers.9.mlp.linear_fc1.bias
	module.decoder.layers.8.pre_mlp_layernorm.weight
	module.decoder.layers.6.input_layernorm.weight
	module.decoder.layers.3.mlp.linear_fc2.weight
	module.decoder.layers.1.pre_mlp_layernorm.bias
	module.decoder.layers.8.self_attention.linear_qkv.bias
	module.decoder.layers.6.mlp.linear_fc2.bias
	module.decoder.layers.4.self_attention.linear_proj.weight
	module.decoder.layers.11.mlp.linear_fc1.weight
	module.decoder.layers.10.pre_mlp_layernorm.weight
	module.decoder.layers.5.mlp.linear_fc2.weight
	module.decoder.layers.1.self_attention.linear_proj.bias
	module.decoder.layers.10.self_attention.linear_qkv.bias
	module.decoder.layers.7.mlp.linear_fc1.weight
	module.decoder.layers.7.self_attention.linear_qkv.bias
	module.decoder.layers.6.self_attention.linear_proj.weight
	module.decoder.layers.4.input_layernorm.bias
	module.decoder.layers.8.self_attention.linear_proj.bias
	module.decoder.layers.2.self_attention.linear_proj.bias
	module.decoder.layers.0.self_attention.linear_proj.weight
	module.decoder.final_layernorm.bias
	module.decoder.layers.9.mlp.linear_fc1.weight
	module.decoder.layers.8.pre_mlp_layernorm.bias
	module.decoder.layers.6.input_layernorm.bias
	module.decoder.layers.2.mlp.linear_fc1.bias
	module.decoder.layers.2.self_attention.linear_qkv.bias
	module.decoder.layers.10.self_attention.linear_proj.bias
	module.decoder.layers.8.input_layernorm.weight
	module.decoder.layers.4.self_attention.linear_qkv.weight
	module.decoder.layers.1.self_attention.linear_qkv.bias
	module.decoder.layers.0.self_attention.linear_qkv.weight
	module.decoder.layers.11.mlp.linear_fc2.weight
	module.decoder.layers.10.pre_mlp_layernorm.bias
	module.decoder.layers.8.mlp.linear_fc2.bias
	module.decoder.layers.4.mlp.linear_fc1.bias
	module.decoder.layers.3.pre_mlp_layernorm.weight
	module.decoder.layers.0.mlp.linear_fc1.bias
	module.decoder.layers.10.input_layernorm.weight
	module.decoder.layers.7.mlp.linear_fc2.weight
	module.decoder.layers.6.self_attention.linear_qkv.weight
	module.decoder.layers.3.self_attention.linear_qkv.bias
	module.decoder.layers.1.mlp.linear_fc1.bias
	module.decoder.layers.10.mlp.linear_fc2.bias
	module.decoder.layers.8.self_attention.linear_proj.weight
	module.decoder.layers.6.mlp.linear_fc1.bias
	module.decoder.layers.5.pre_mlp_layernorm.weight
	module.decoder.layers.11.input_layernorm.weight
	module.decoder.layers.9.mlp.linear_fc2.weight
	module.decoder.layers.5.self_attention.linear_qkv.bias
	module.decoder.layers.2.mlp.linear_fc1.weight
	module.decoder.layers.1.pre_mlp_layernorm.weight
	module.decoder.layers.10.self_attention.linear_proj.weight
	module.decoder.layers.8.input_layernorm.bias
	module.decoder.layers.3.self_attention.linear_proj.bias
	module.decoder.layers.2.self_attention.linear_proj.weight
	module.decoder.layers.0.pre_mlp_layernorm.bias
INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=6e-05, min_lr=6e-06, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=True, bf16=False, params_dtype=torch.float16, use_precision_aware_optimizer=False, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=False, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=False, optimizer_offload_fraction=1.0, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x7fe020e2b230>, config_logger_dir='')
INFO:megatron.core.optimizer_param_scheduler:> learning rate decay style: cosine
[after model, optimizer, and learning rate scheduler are built] datetime: 2025-06-18 04:59:46 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      3200
    validation: 480
    test:       160
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.949), (0.949, 0.999), (0.999, 1.0)]
> building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building GPTDataset splits with sizes=(3200, 480, 160) and config=GPTDatasetConfig(random_seed=1234, sequence_length=1024, blend=(['/Volumes/hiroshi/air/dataset/gpt2/my-gpt2_text_document'], None), blend_per_split=None, split='949,50,1', split_matrix=[(0, 0.949), (0.949, 0.999), (0.999, 1.0)], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<megatron.training.tokenizer.tokenizer._GPT2BPETokenizer object at 0x7fe020aff920>, mid_level_dataset_surplus=0.005, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=True, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True, object_storage_cache_path=None)
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from /Volumes/hiroshi/air/dataset/gpt2/my-gpt2_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 86400
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 86400
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from eb33695630bbb79a379091699f1e6d2e-GPTDataset-train-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from eb33695630bbb79a379091699f1e6d2e-GPTDataset-train-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from eb33695630bbb79a379091699f1e6d2e-GPTDataset-train-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 118428
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from d5e628ae007a5962ac41e1cad89d76cd-GPTDataset-valid-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from d5e628ae007a5962ac41e1cad89d76cd-GPTDataset-valid-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from d5e628ae007a5962ac41e1cad89d76cd-GPTDataset-valid-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 6239
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 0b1d5f9095bafbb844813872cf9e6135-GPTDataset-test-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 0b1d5f9095bafbb844813872cf9e6135-GPTDataset-test-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 0b1d5f9095bafbb844813872cf9e6135-GPTDataset-test-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 245
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2025-06-18 04:59:47 
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (174.86, 174.86)
    train/valid/test-data-iterators-setup ..........: (1167.20, 1167.20)
training ...
Setting rerun_state_machine.current_iteration to 0...
[before the start of training step] datetime: 2025-06-18 04:59:48 
 [2025-06-18 05:01:25] iteration      100/     200 | consumed samples:         1600 | elapsed time per iteration (ms): 967.3 | learning rate: 1.172093E-05 | global batch size:    16 | lm loss: 1.038274E+01 | loss scale: 131072.0 | grad norm: 2.749 | num zeros: 1221.0 | number of skipped iterations:  16 | number of nan iterations:   0 |
Number of parameters in transformer block in billions:  0.04
Number of parameters in embedding layers in billions: 0.03
Total number of parameters in billions: 0.06
Number of parameters in most loaded shard in billions: 0.0635
Theoretical memory footprints: weight and optimizer=1090.56 MB
[Rank 0] (after 100 iterations) memory (MB) | allocated: 1267.92333984375 | max allocated: 2012.49951171875 | reserved: 2130.0 | max reserved: 2130.0
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
(min, max) time across ranks (ms):
    evaluate .......................................: (6381.33, 6381.33)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
-----------------------------------------------------------------------------------------------
 validation loss at iteration 100 | lm loss value: 9.953546E+00 | lm loss PPL: 2.102664E+04 | 
-----------------------------------------------------------------------------------------------
 [2025-06-18 05:03:00] iteration      200/     200 | consumed samples:         3200 | elapsed time per iteration (ms): 884.6 | learning rate: 2.567442E-05 | global batch size:    16 | lm loss: 9.336311E+00 | loss scale: 131072.0 | grad norm: 3.071 | num zeros: 886.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
(min, max) time across ranks (ms):
    evaluate .......................................: (3951.58, 3951.58)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
-----------------------------------------------------------------------------------------------
 validation loss at iteration 200 | lm loss value: 8.537742E+00 | lm loss PPL: 5.103805E+03 | 
-----------------------------------------------------------------------------------------------
[after training is done] datetime: 2025-06-18 05:03:04 
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
Evaluating on 160 samples
Evaluating iter 1/10
Evaluating iter 2/10
Evaluating iter 3/10
Evaluating iter 4/10
Evaluating iter 5/10
Evaluating iter 6/10
Evaluating iter 7/10
Evaluating iter 8/10
Evaluating iter 9/10
Evaluating iter 10/10
(min, max) time across ranks (ms):
    evaluate .......................................: (3926.28, 3926.28)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
-----------------------------------------------------------------------------------------------------------------
 validation loss at iteration 200 on validation set | lm loss value: 8.532976E+00 | lm loss PPL: 5.079541E+03 | 
-----------------------------------------------------------------------------------------------------------------
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
Evaluating on 160 samples
Evaluating iter 1/10
Evaluating iter 2/10
Evaluating iter 3/10
Evaluating iter 4/10
Evaluating iter 5/10
Evaluating iter 6/10
Evaluating iter 7/10
Evaluating iter 8/10
Evaluating iter 9/10
Evaluating iter 10/10
(min, max) time across ranks (ms):
    evaluate .......................................: (3950.18, 3950.18)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode RerunMode.DISABLED
-----------------------------------------------------------------------------------------------------------
 validation loss at iteration 200 on test set | lm loss value: 8.540809E+00 | lm loss PPL: 5.119483E+03 | 
-----------------------------------------------------------------------------------------------------------
[rank0]:[W618 05:03:12.704246014 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

BFN!