iTranslated by AI
Minitron-4B Series Can Also Be Fine-tuned with Unsloth
Hello! I'm Kurogane.
Recently, I've realized that smaller models are surprisingly smart and have started working with them, so I'll share some things about fine-tuning.
Minitron-4B-width Series
nvidia/Llama-3.1-Minitron-4B-Width-Base is a small model developed by NVIDIA.
It is a model created by pruning and distilling Llama-3.1-8B. According to the technical report, it appears to have a higher GSM8K score (41.2 for Width vs. 24.1 for Base) compared to nvidia/Minitron-4B-Base.
In this project, I will use Magpie-Align/MagpieLM-4B-Chat-v0.1.
Also, to confirm that fine-tuning is working correctly, I will train it using the Alpaca format, which differs from its original chat format.
The trained model can be downloaded from here:
kurogane/MagpieLM-4B-Chat-v0.1-alpaca-cleaned-ja
[Note] Updates for transformers and accelerate are required
In my environment, the following updates were necessary to load the Minitron-4B-width series models:
- transformers-4.45.0.dev0
- accelerate-0.34.2
As of now (September 18), transformers needed to be updated from the source. This is required for loading the model.
pip install git+https://github.com/huggingface/transformers
Additionally, I updated accelerate because the following error occurred during training with Unsloth:
AttributeError: 'AdamW' object has no attribute 'train'
Referring to https://github.com/QwenLM/Qwen2-VL/issues/163, it was stated to update LLaMA Factory and accelerate to 0.34.0 as shown here:
After these updates, you just need to replace the model_name with "Magpie-Align/MagpieLM-4B-Chat-v0.1" in the Unsloth notebook.
I tried training with the Unsloth notebook's default max_steps = 60,.
The dataset used was also the default yahma/alpaca-cleaned.
It took about 1 minute and 30 seconds on an RTX A6000.
100%|██████████| 60/60 [01:23<00:00, 1.39s/it]
{'train_runtime': 83.4627, 'train_samples_per_second': 5.751, 'train_steps_per_second': 0.719, 'train_loss': 1.0135845323403676, 'epoch': 0.01}
Since the peak memory consumption was 11GB, it might be possible to run it locally even on an RTX 3060.
83.4627 seconds used for training.
1.39 minutes used for training.
Peak reserved memory = 11.314 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 23.808 %.
Peak reserved memory for training % of max memory = 0.0 %.
Here is an example of training on the entire shi3z/alpaca_cleaned_ja_json dataset.
Since I didn't change the batch size, the VRAM usage remained the same.
7062.299 seconds used for training.
117.7 minutes used for training.
Peak reserved memory = 11.314 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 23.808 %.
Peak reserved memory for training % of max memory = 0.0 %.
(I wish I could do something about the A6000 hitting the 80°C range during training...)
Wed Sep 18 22:49:35 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 On | Off |
| 63% 84C P2 292W / 300W | 16083MiB / 49140MiB | 94% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX A2000 12GB Off | 00000000:02:00.0 Off | Off |
| 30% 33C P8 10W / 70W | 12MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 11918 G /usr/lib/xorg/Xorg 282MiB |
| 0 N/A N/A 12074 C+G ...libexec/gnome-remote-desktop-daemon 264MiB |
| 0 N/A N/A 12129 G /usr/bin/gnome-shell 224MiB |
| 0 N/A N/A 12547 G /usr/libexec/gnome-initial-setup 4MiB |
| 0 N/A N/A 13339 G ...yOnDemand --variations-seed-version 51MiB |
| 0 N/A N/A 13795 G ...irefox/4848/usr/lib/firefox/firefox 807MiB |
| 0 N/A N/A 13838 G ...erProcess --variations-seed-version 103MiB |
| 0 N/A N/A 53959 C ...e/anaconda3/envs/llmfact/bin/python 3374MiB |
| 0 N/A N/A 61570 C ...e/anaconda3/envs/llmfact/bin/python 4880MiB |
| 0 N/A N/A 2581221 G gnome-control-center 4MiB |
| 0 N/A N/A 2583312 C ...e/anaconda3/envs/llmfact/bin/python 6020MiB |
| 1 N/A N/A 11918 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
Trained Model
I have uploaded the model here:
You can verify the operation with the following code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "kurogane/MagpieLM-4B-Chat-v0.1-alpaca-cleaned-ja"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
model_name,
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
# trust_remote_code=True,
)
model = model.to('cuda')
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
promtp_ = alpaca_prompt.format(
"Continue the fibonnaci sequence.", # instruction
"1, 1, 2, 3, 5, 8", # input
"", # output - leave this blank for generation!
)
inputs = tokenizer(
[
alpaca_prompt.format(
"以下のフィボナッチ数列の続きを書いてください。", # instruction
"1, 1, 2, 3, 5, 8", # input
"", # output - leave this blank for generation!
)
],
add_special_tokens=False ,
return_tensors = "pt",
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
It will produce results like this:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
以下のフィボナッチ数列の続きを書いてください。
### Input:
1, 1, 2, 3, 5, 8
### Response:
13、21、34、55、89、144、233、377、610、987
<|end_of_text|>
Just in case, here is what the response looked like before fine-tuning:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
以下のフィボナッチ数列の続きを書いてください。
### Input:
1, 1, 2, 3, 5, 8
### Response:
次の数列の続きは 13, 21, 34, 55, 89 です。
Answer: 次の数列の続きは 13, 21, 34, 55, 89 です。<|end_of_text|>
You can see that after fine-tuning, it correctly follows the Alpaca format.
Conclusion
That's all. Thanks for reading.
Since the original Minitron-4B-Width-Base itself was already quite capable, I was intrigued by MagpieLM-4B-Chat-v0.1, which improved performance using Magpie's synthetic dataset, so I gave fine-tuning a try.
It’s really thanks to the flexibility of the libraries.
Just to be safe, the dataset for MagpieLM-4B-Chat-v0.1 is quite a mix, so it's worth double-checking.
It combines the Meta Llama 3.1 Community License and the Gemma License.
Because of this, while commercial use appears to be allowed, there might be various stipulations, so please use caution.
Anyway, that was Kurogane!
If you enjoyed this, please feel free to give a like, leave a comment, or follow!
Discussion