RTX3060でMoE付きTransformerの事前学習をしてみる
はじめに
イベントに参加したいけどまだ事前学習ができていない方は、こちらの内容を実施すれば3日程度で、とりあえずテキストを出力できる何かができあがります。
基本的には以下の記事の内容を参考にしましたが、一部動作しない部分の修正やカスタムなどを行い、事前学習を実施しています。
あくまで、Transformerの事前学習のみのため、作成されるのはbaseモデルになります。
指示チューニングや選好チューニング、Tokenizerの学習などは実施しておりません。ご了承ください。
環境
OS:ubuntu 20.04
GPU:RTX3060 12GB
Memory:64GB(12GB程度しか利用していない)
SSD:最低でも200GB-300GB程度の空きを推奨
Python:3.11
事前学習モデルやデータセットなど
つらつらと事実だけ書いていきます。
やり方だけ知りたいという方は飛ばしてください。
モデルの特徴
- 400M程度の小さなTransformerを利用
- 位置埋め込みにRoPEを利用
- Mixture of Experts(MoE)を利用
- DeepSeekMoEと同様にshared expertsも利用
- Grouped Query Attention(GQA)を利用してメモリ削減
- 活性化関数にSwiGLUを利用
- 推論時KVキャッシュが可能
モデルパラメータ
- 実際のパラメータ数
- 418.5M
- 学習可能パラメータ
- 418.5M
- Activeパラメータ
- 191.93M
モデル構造
- MHAヘッド数
- 16
- GQAのグループ数
- 4
- MoE
- Expert:4
- Shared Expert:1
- Active:2
- Expertから1つ、Shared Expertが1つの合計2つ使われる
以下詳細です。
Transformer(
(rotary_emb): Rotary()
(tokens_embedding): Embedding(49152, 512)
(blocks): ModuleList(
(0-31): 32 x Block(
(attention): GroupedQueryAttention(
(wq): Linear(in_features=512, out_features=512, bias=False)
(wk): Linear(in_features=512, out_features=128, bias=False)
(wv): Linear(in_features=512, out_features=128, bias=False)
(wo): Linear(in_features=512, out_features=512, bias=False)
)
(ffn): FFNwMoE(
(router): Linear(in_features=512, out_features=4, bias=False)
(experts): ModuleList(
(0-3): 4 x ModuleList(
(0): Linear(in_features=512, out_features=2048, bias=False)
(1): Linear(in_features=2048, out_features=512, bias=False)
(2): Linear(in_features=512, out_features=2048, bias=False)
)
)
(shared_experts): ModuleList(
(0): ModuleList(
(0): Linear(in_features=512, out_features=2048, bias=False)
(1): Linear(in_features=2048, out_features=512, bias=False)
(2): Linear(in_features=512, out_features=2048, bias=False)
)
)
)
(norm_attention): RMSNorm((512,), eps=1e-06, elementwise_affine=True)
(norm_ffn): RMSNorm((512,), eps=1e-06, elementwise_affine=True)
)
)
(norm): RMSNorm((512,), eps=1e-06, elementwise_affine=True)
(ll_head): Linear(in_features=512, out_features=49152, bias=False)
)
Tokenizer
今回はtokenizerの学習は行わず、既存のtokenizerをそのまま利用します。
今回はHuggingFaceTB/SmolLM-360M
を利用します。これは参考にした記事内で利用されていたものをそのまま利用しました。
日本語のデータセットを利用して事前学習を行いたい場合は、tokenizerを変更することをお勧めします。
データセット
今回はHuggingFaceFW/fineweb-edu
を利用しました。
元々の記事でも紹介されていたデータセットで、ぱっと見使いやすそうだったので選びました。
データ自体はtext
カラムに格納されており、またフォルダごとにデータが分割されているのも、今回の用途で使いやすかったです。
なお、本データセットすべてを利用しようとすると、うちのPCのHDDが死ぬので、data/CC-MAIN-2025-26
フォルダ内のデータのみを利用するようにしました。
なお、それでも総token数は7,186,815,488 token(70億token)になります...
私の自宅のRTX3060では、数日程度では1epochも満足に回せません。もう少し絞っても良かったかも。
(今回の結果は、大体20%程度の15億token程度学習を行いました)
なお中身のドキュメントや、tokenizerによるidを一部紹介します。
データセットの中身
512tokenに分割したドキュメント(最初の5つのみ)
Table of Contents
Friendship is one of the most important aspects of our lives. The deep yearning for a sense of belonging transcends all cultural, economic, and political borders. Our social support is right beside us throughout the different stages of our lives. Friends keep us grounded and help us remember what we value and want to achieve in life – even when things get tricky.
In this blog post, we will explore why friendship is important, the different types of friendships, how to cultivate and maintain them, the benefits of being a good friend, and tips on making new friends.
Why is Friendship Important?
We all know that having friends makes us happy, but did you know that friendships have a huge impact on your mental health and happiness? Research shows that friends bring more happiness into our lives than virtually anything else. Friends are actually even more important to our psychological welfare than finding that right person who we think will make us happy and fulfilled.
Mental Health Benefits
Friendships enrich our lives in many ways. Friends give us both practical and emotional support when we need it. As a result, there are many emotional and physical health benefits of friendships – the more people prioritize friendships, the happier and healthier they are.
According to a recent survey, Americans report having fewer friendships than they once did. And the more traditional ways of making friends (like school, church, or through existing friends) are on the decline. Americans are more likely to make new friends at work than any other avenue. Making friends and keeping those healthy relationships isn’t always a walk in the park. It can be hard to recognize why friendship is important, too. But sometimes we need to evaluate how many friendships we have, what we can actively do to strengthen them, and when to let them go.
Physical Health Benefits
Our friendships help our mental health and overall happiness. We build human connections in our professional lives and personal lives. And over time, those connections may grow. They connect us to our core values at work, when facing challenges, or during our daily life. The emotional support we receive from our close friends helps inspire us when life feels dull and provides encouragement to overcome challenges. Sociability is connected to a reduced risk of getting sick, and good friends improve your mental health. Friends make you feel safe and at home.
Types of Friendships
Not all friendships are created equal. There are different types of friendships that can serve different purposes in our lives. Here are some examples:
These are the people you can count on no matter what
==========================
Ever wonder what life was like before internet search engines like Google? How did people get answers to all-important questions like ‘what’s the average life expectancy of a frog’, or ‘is there a full moon every night in Acapulco?’ Many of them turned to the ‘ask a librarian’ phone service of the New York Public Library, and, believe it or not, some of them still do it today.
They are called the “human google”, a team of real-life people whose main purpose while on the job is to search 120-years worth of archives and provide answers to some of the strangest, most complicated questions available. Set up during the 1940s, when search engines and the internet weren’t even ideas yet, the Ask NYPL department is made up of nine librarians and information assistants who cater to the needs of people who don’t have access to modern technology or simply prefer to interact with another human being.
Read More »<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
==========================
DEER FAMILY (CERVIDS)
The deer family includes deer, reindeer and elk. The largest deer are moose, which can weigh nearly a ton, and the smallest is the Chilean pudu, which is not much larger than a rabbit. Deer belong to the family “Cervidae”, which is part of the order “Artiodactyla” (even-toed hoofed mammals). “Cervidae” are similar to “Bovidae” (cattle, antelopes, sheep and goat) in that they chew the cud but differ in that have solid horns that are shed periodically (“Bovidae” have hollow ones).
A male deer is called a buck or stag. A female is called a doe. Young are called fawns. A group is called a herd. Deer don't hibernate and sometimes group together to stay warm. Particularly cold winters sometimes kill deer outright, mainly by robbing them of food, especially when a hard layer of ice and snow keeps them from getting at food.
The family Cervidae, commonly referred to as "the deer family", consists of 23 genera containing 47 species, and includes three subfamilies: Capriolinae (brocket deer, caribou, deer, moose, and relatives), Cervinae (elk, muntjacs, and tufted deer), and Hydropotinae, which contains only one extant species (Chinese water deer). According to Animal Diversity Web: However, classification of cervids has been controversial and a single well-supported phylogenetic and taxonomic history has yet to be established. Cervids range in mass from nine to 816 kilograms (20 to 1800 pounds), and all but one species, Chinese water deer, have antlers. With the exception of caribou, only males have antlers and some species with smaller antlers have enlarged upper canines. In addition to sexually dimorphic ornamentation, most deer species are size-dimorphic as well with males commonly being 25 percent larger than their female counterparts. [Source: Katie Holmes; Jessica Jenkins; Prashanth Mahalin; John Berini, Animal Diversity Web (ADW) /=]
Cervids have a large number of morphological Synapomorphies (characteristics that are shared within a taxonomic group), and range in color from dark to very light brown; however, young are commonly born with cryptic coloration, such as white spots, that helps camouflage them from potential predators.
==========================
There are mainly two types of materials used to make blades for almost all kitchen knives on the market. One is ceramic and the other is steel.
When comparing ceramic vs steel knives, it is important, to begin with, an understanding of the use, which one is more versatile and a better fit for your personal use.
In this article, I will try to explain in detail the differences between ceramic knives and stainless steel knives, which will help you to choose one of them.
What Is a Ceramic Knife?
The knives that are made with ceramic blades are called ceramic knives. Ceramic blades used in knives are mainly made from zirconium dioxide (ZrO2) which is a powder-type material.
What Is a Stainless Steel Knife?
All knives that are made with stainless steel blades are called stainless steel knives. In general, steel is made from carbon, iron, titanium. Steel with a chromium content of at least 10.5% is considered stainless steel.
This metal is widely used in making knives all over the world as it is strong, durable, and versatile.
A Detail Comparison Between Ceramic Knives and Stainless Steel Knives?
Both types of knives work for almost the same purpose. Both are used to make blades. However, when I analyze their practical aspects, I find many big comparable differences between them. Here I highlight those differences in detail. The differences are-
- Manufacturing Process
- Rust Resistance
1. Manufacturing Process Comparison
Ceramic Knife Manufacturing Process
Ceramic knives have a unique way of manufacturing; this precise method is key for an owner to understand to treat the knife accordingly.
To begin making the knife, there is a ceramic powder consists of a variety of minerals: boron, carbide, nitrate, or bauxite. The powder mixes with water and is placed into a mold to give it shape.
Keeping each knife uniform and precise is important for this work. The mold is placed under pressure and just like any ceramic pottery, it now has to be heated in a large oven called a kiln.
After that, it is taken out, sharpened, and ready to have a handle put on.
The hardness of ceramic knives is determined by changes in pressure and heat.
Higher quality ceramic knives, for example, have a sharper edge of the blade because the ceramic powder used is of stronger minerals and in the kiln for a longer time.
Overall, the process of making a ceramic knife teaches
==========================
Get ready to have an amazing time with this cool coloring page featuring Godzilla! Godzilla is the giant monster (also known as Kaiju in Japanese) that everyone loves. This legendary creature has appeared in tons of movies, TV shows, and comic books. Now, it’s your turn to bring Godzilla to life with your artistic skills!
Unique Features of the Coloring Page
The Godzilla coloring page is packed with awesome details that make it really special:
- Detailed Scales: Look at those intricate patterns on Godzilla’s skin! Each scale is drawn to make the monster look super realistic.
- Sharp Claws: Godzilla’s claws are huge and terrifying. They’re ready to scratch and roar!
- Roaring Stance: Godzilla’s mouth is open, showing off all those menacing teeth. You can almost hear the roar!
- Spiky Back: The spikes on Godzilla’s back are one of the most recognizable features. They look like they’re ready to glow with atomic power!
- Powerful Muscles: The strong arms and legs of Godzilla show just how powerful this creature is.
Symbolism and Cultural Significance
Godzilla is much more than just a monster. This character has a deep cultural significance, especially in Japan where it originated. Godzilla first appeared in Japanese movies in the 1950s as a symbol of nature’s power and the dangers of nuclear weapons. Over the years, Godzilla has become a beloved icon worldwide, representing themes like protection, destruction, and the balance of nature. Coloring this page gives you a chance to connect with this rich history!
Creative Coloring Techniques and Color Palette Ideas
Make your Godzilla masterpiece stand out with these creative coloring techniques and color suggestions:
- Shading: Use different shades of green and gray to add depth to Godzilla’s scales. Lighter shades can be used for highlights, and darker shades for shadows.
- Glowing Spines: Add a glowing effect to the spikes on Godzilla’s back. Use bright blues or even neon colors to make them look electrifying!
- Textured Skin: Use small, controlled strokes to mimic the texture of reptilian skin. This can bring a realistic feel to your Godzilla.
- Background Ideas: Create a dramatic scene by adding a city skyline, mountains, or even an ocean for Godzilla
token id(最初の5つのみ)
[tensor([ 8852, 282, 22157, 198, 40834, 974, 314, 582, 282, 260,
768, 1070, 3260, 282, 653, 2397, 30, 378, 2276, 46697,
327, 253, 2588, 282, 10396, 26945, 511, 2642, 28, 2536,
28, 284, 2356, 10036, 30, 3698, 1329, 1199, 314, 1048,
19032, 468, 2126, 260, 896, 5933, 282, 653, 2397, 30,
14766, 1446, 468, 15141, 284, 724, 468, 2915, 732, 392,
1685, 284, 1277, 288, 3025, 281, 1029, 816, 908, 645,
1495, 820, 17047, 30, 198, 788, 451, 5862, 1681, 28,
392, 523, 2217, 1701, 11414, 314, 1070, 28, 260, 896,
1995, 282, 16277, 28, 638, 288, 11115, 284, 2125, 601,
28, 260, 2624, 282, 1036, 253, 1123, 3289, 28, 284,
5608, 335, 1625, 725, 2428, 30, 198, 4898, 314, 39234,
18006, 47, 198, 1882, 511, 699, 338, 1953, 2428, 2022,
468, 5587, 28, 564, 1250, 346, 699, 338, 16277, 457,
253, 4776, 1645, 335, 469, 2898, 864, 284, 8906, 47,
2904, 2744, 338, 2428, 2635, 540, 8906, 618, 653, 2397,
670, 10210, 3534, 1745, 30, 14766, 359, 2390, 908, 540,
1070, 288, 653, 6478, 9365, 670, 4212, 338, 1048, 1055,
617, 392, 1510, 523, 919, 468, 5587, 284, 21257, 30,
198, 31251, 2379, 14721, 198, 40834, 4452, 10816, 653, 2397,
281, 800, 1853, 30, 14766, 1928, 468, 1062, 4786, 284,
4122, 1199, 645, 392, 737, 357, 30, 1032, 253, 966,
28, 665, 359, 800, 4122, 284, 2099, 864, 2624, 282,
16277, 816, 260, 540, 701, 18537, 16277, 28, 260, 17929,
284, 9168, 502, 359, 30, 198, 5449, 288, 253, 2765,
4867, 28, 3913, 1378, 1953, 6694, 16277, 670, 502, 2268,
1250, 30, 1350, 260, 540, 2556, 1853, 282, 1625, 2428,
365, 2579, 1125, 28, 4538, 28, 355, 738, 3832, 2428,
25, 359, 335, 260, 6683, 30, 3913, 359, 540, 2003,
288, 919, 725, 2428, 418, 746, 670, 750, 550, 27728,
30, 11628, 2428, 284, 4825, 967, 2458, 3355, 3247, 417,
100, 1811, 253, 4549, 281, 260, 5949, 30, 657, 416,
325, 1759, 288, 5221, 1701, 11414, 314, 1070, 28, 1147,
30, 1249, 2406, 392, 737, 288, 6611, 638, 800, 16277,
392, 457, 28, 732, 392, 416, 8082, 536, 288, 5858,
601, 28, 284, 645, 288, 1303, 601, 685, 30, 198,
17832, 2379, 14721, 198, 5425, 16277, 724, 653, 2898, 864,
284, 3043, 8906, 30, 1046, 1235, 1205, 3962, 281, 653,
3544, 2397, 284, 2143, 2397, 30, 1350, 690, 655, 28,
967, 3962, 654, 1075, 30, 1069, 2084, 468, 288, 653,
4007, 2396, 418, 746, 28, 645, 6371, 2529, 28, 355,
981, 653, 2956, 1029, 30, 378, 4122, 1199, 392, 3796,
429, 653, 2800, 2428, 2311, 8368, 468, 645, 1029, 7422,
19724, 284, 2433, 17917, 288, 7666, 2529, 30, 4188, 1307,
314, 4395, 288, 253, 3954, 1621, 282, 2967, 6217, 28,
284, 1123, 2428, 1947, 469, 2898, 864, 30, 14766, 919,
346, 1407, 2991, 284, 418, 1478, 30, 198, 15804, 282,
14766, 4452, 198, 5884, 511, 16277, 359, 2312, 4037, 30,
1385, 359, 896, 1995, 282, 16277, 338, 416, 3809, 896,
5132, 281, 653, 2397, 30, 3726, 359, 634, 3480, 42,
198, 3902, 359, 260, 701, 346, 416, 985, 335, 787,
2631, 732]), tensor([28497, 4558, 732, 1029, 436, 702, 1092, 5594, 3179, 9396,
702, 5837, 47, 1073, 1250, 701, 820, 5360, 288, 511,
29, 24743, 2029, 702, 1551, 5588, 417, 99, 260, 3049,
1029, 19060, 282, 253, 20398, 7500, 355, 1551, 271, 665,
253, 2073, 6783, 897, 3163, 281, 3536, 512, 336, 1881,
24022, 3027, 282, 601, 4642, 288, 260, 1551, 1818, 253,
29463, 417, 5460, 2754, 282, 260, 1315, 2863, 4837, 6750,
28, 284, 28, 2875, 357, 355, 441, 28, 634, 282,
601, 1361, 536, 357, 1834, 30, 198, 4948, 359, 1217,
260, 619, 10621, 23833, 5023, 253, 2299, 282, 1345, 29,
3460, 701, 3449, 1085, 3446, 979, 335, 260, 2288, 314,
288, 3179, 216, 33, 34, 32, 29, 17907, 4048, 282,
18464, 284, 1538, 5360, 288, 634, 282, 260, 29484, 381,
28, 768, 6991, 2029, 1770, 30, 5427, 614, 981, 260,
216, 33, 41, 36, 32, 99, 28, 645, 3179, 9396,
284, 260, 5594, 11535, 417, 100, 908, 2821, 2408, 28,
260, 9430, 12182, 8677, 6151, 314, 1135, 614, 282, 7462,
30572, 284, 1096, 21424, 617, 10502, 288, 260, 1923, 282,
701, 617, 1326, 417, 100, 457, 1594, 288, 2148, 1835,
355, 2788, 4640, 288, 2298, 351, 1372, 1205, 1036, 30,
198, 6340, 2783, 17228, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0]), tensor([ 6279, 1754, 426, 4153, 47319, 365, 51, 1754, 6293, 67,
25, 198, 504, 10752, 1564, 2978, 10752, 28, 41161, 284,
30849, 30, 378, 3995, 10752, 359, 34380, 28, 527, 416,
9529, 3920, 253, 9241, 28, 284, 260, 13341, 314, 260,
34518, 273, 413, 101, 28, 527, 314, 441, 1083, 3227,
670, 253, 16024, 30, 31975, 4523, 288, 260, 1564, 619,
51, 684, 12963, 5023, 527, 314, 599, 282, 260, 1686,
619, 9393, 1371, 42933, 81, 573, 365, 9934, 29, 1141,
277, 41468, 16308, 10793, 595, 619, 51, 684, 12963, 573,
359, 1887, 288, 619, 50, 749, 12963, 573, 365, 83,
3537, 28, 1598, 299, 7875, 28, 11108, 284, 22913, 25,
281, 338, 502, 21976, 260, 265, 413, 564, 9094, 281,
338, 457, 4367, 24619, 338, 359, 7792, 19677, 11208, 50,
749, 12963, 573, 457, 17160, 2911, 595, 198, 49, 4672,
10752, 314, 1217, 253, 10330, 355, 42655, 30, 330, 4263,
314, 1217, 253, 536, 85, 30, 8124, 359, 1217, 3561,
45307, 30, 330, 1528, 314, 1217, 253, 19365, 30, 31975,
1326, 982, 29855, 368, 284, 2406, 1528, 1592, 288, 2951,
3091, 30, 36560, 4011, 21960, 2406, 5619, 10752, 24643, 28,
5630, 411, 3453, 7519, 601, 282, 1114, 28, 2117, 645,
253, 1759, 4070, 282, 3908, 284, 5806, 8211, 601, 429,
2967, 418, 1114, 30, 198, 504, 1564, 36987, 12963, 28,
4278, 5124, 288, 347, 476, 1195, 10752, 1564, 1002, 5956,
282, 216, 34, 35, 25079, 5085, 216, 36, 39, 1772,
28, 284, 2978, 1296, 840, 41857, 42, 7413, 431, 308,
32866, 365, 11699, 936, 305, 10752, 28, 43285, 28, 10752,
28, 34380, 28, 284, 11715, 643, 36987, 32866, 365, 299,
91, 28, 283, 2189, 90, 23293, 28, 284, 7269, 926,
277, 10752, 643, 284, 11486, 1561, 320, 32866, 28, 527,
3593, 805, 582, 28306, 1772, 365, 24925, 913, 10752, 595,
3959, 288, 11967, 18718, 5272, 42, 1423, 28, 9782, 282,
13437, 1319, 553, 719, 11385, 284, 253, 2244, 876, 29,
20376, 28190, 284, 32346, 1463, 553, 2408, 288, 325, 3251,
30, 36987, 1319, 1845, 281, 2389, 429, 7462, 288, 216,
40, 33, 38, 27336, 365, 34, 32, 288, 216, 33,
40, 32, 32, 8473, 643, 284, 511, 564, 582, 1772,
28, 4097, 913, 10752, 28, 457, 1598, 6130, 30, 1929,
260, 7768, 282, 43285, 28, 805, 8225, 457, 1598, 6130,
284, 634, 1772, 351, 3764, 1598, 6130, 457, 21096, 5060,
45231, 30, 533, 1706, 288, 16531, 3633, 48481, 27612, 309,
28, 768, 10752, 1772, 359, 2203, 29, 5039, 48481, 347,
876, 351, 8225, 4278, 1036, 216, 34, 37, 1870, 3227,
670, 480, 4263, 12346, 30, 933, 9227, 42, 42668, 26306,
43, 36481, 30726, 43, 2383, 1317, 11709, 9315, 13453, 43,
2322, 6079, 6538, 28, 11967, 18718, 5272, 365, 3907, 71,
25, 37225, 77, 198, 51, 684, 1319, 457, 253, 1507,
1230, 282, 25829, 16190, 512, 17195, 370, 365, 23206, 2659,
338, 359, 3600, 1127, 253, 32346, 1528, 643, 284, 1845,
281, 2380, 429, 3605, 288, 1035, 1420, 6354, 43, 2107,
28, 1805, 359, 4278, 3988, 351, 42290, 34620, 28, 715,
347, 2537, 9265, 28, 338, 2311, 33790, 601, 429, 1566,
11618, 30]), tensor([ 2122, 359, 5630, 827, 1995, 282, 2254, 804, 288, 919,
18696, 327, 2743, 511, 9902, 29020, 335, 260, 2342, 30,
1963, 314, 18912, 284, 260, 550, 314, 6752, 30, 198,
2427, 10348, 18912, 6375, 6752, 29020, 28, 357, 314, 1070,
28, 288, 1956, 351, 28, 354, 1725, 282, 260, 722,
28, 527, 582, 314, 540, 15154, 284, 253, 1365, 3230,
327, 469, 2143, 722, 30, 198, 788, 451, 2524, 28,
339, 523, 1576, 288, 3556, 281, 2202, 260, 3581, 826,
18912, 29020, 284, 20652, 6752, 29020, 28, 527, 523, 724,
346, 288, 3525, 582, 282, 601, 30, 198, 1780, 1431,
253, 42420, 286, 11802, 830, 47, 198, 504, 29020, 338,
359, 1135, 351, 18912, 18696, 359, 1217, 18912, 29020, 30,
42420, 286, 18696, 804, 281, 29020, 359, 5630, 1135, 429,
1892, 42459, 1460, 7600, 365, 74, 98, 63, 34, 25,
527, 314, 253, 9703, 29, 2058, 1376, 30, 198, 1780,
1431, 253, 44486, 1325, 21088, 11802, 830, 47, 198, 4518,
29020, 338, 359, 1135, 351, 20652, 6752, 18696, 359, 1217,
20652, 6752, 29020, 30, 533, 2210, 28, 6752, 314, 1135,
429, 3118, 28, 4688, 28, 27783, 30, 21088, 351, 253,
29988, 2627, 282, 418, 2124, 216, 33, 32, 30, 37,
21, 314, 2500, 20652, 6752, 30, 198, 1348, 4718, 314,
4889, 804, 281, 1625, 29020, 511, 690, 260, 905, 347,
357, 314, 1837, 28, 16770, 28, 284, 15154, 30, 198,
49, 6440, 636, 24508, 10692, 42420, 286, 11802, 898, 284,
44486, 1325, 21088, 11802, 898, 47, 198, 12857, 1995, 282,
29020, 746, 327, 2743, 260, 1142, 3446, 30, 5148, 359,
804, 288, 919, 18696, 30, 1423, 28, 645, 339, 6524,
480, 4786, 3260, 28, 339, 1042, 800, 2066, 13217, 3581,
826, 601, 30, 3726, 339, 5067, 967, 3581, 281, 2202,
30, 378, 3581, 359, 29, 198, 29, 23990, 8277, 198,
29, 36828, 20373, 198, 33, 30, 23990, 8277, 24508, 198,
51, 11999, 286, 11802, 830, 23990, 8277, 198, 51, 11999,
286, 29020, 457, 253, 2116, 970, 282, 6350, 43, 451,
8212, 1341, 314, 1646, 327, 354, 7615, 288, 1044, 288,
1208, 260, 16760, 9839, 30, 198, 2068, 1956, 1625, 260,
16760, 28, 665, 314, 253, 18912, 9703, 5956, 282, 253,
3175, 282, 7693, 42, 42164, 28, 49134, 28, 20922, 28,
355, 278, 13981, 663, 30, 378, 9703, 28602, 351, 913,
284, 314, 4294, 618, 253, 8329, 288, 1928, 357, 3040,
30, 198, 31269, 971, 16760, 8281, 284, 8212, 314, 1070,
327, 451, 746, 30, 378, 8329, 314, 4294, 656, 2456,
284, 915, 702, 750, 18912, 18720, 28, 357, 1209, 553,
288, 325, 12150, 281, 253, 1507, 14571, 1217, 253, 34813,
30, 198, 4582, 338, 28, 357, 314, 2473, 578, 28,
7858, 2302, 28, 284, 4421, 288, 457, 253, 5605, 1830,
335, 30, 198, 504, 26096, 282, 18912, 29020, 314, 5250,
411, 1971, 281, 2456, 284, 2817, 30, 198, 42877, 2174,
18912, 29020, 28, 327, 1183, 28, 457, 253, 44326, 5595,
282, 260, 16755, 975, 260, 18912, 9703, 804, 314, 282,
6184, 7693, 284, 281, 260, 34813, 327, 253, 2848, 655,
30, 198, 20832, 28, 260, 980, 282, 1625, 253, 18912,
16760, 10769]), tensor([ 6197, 4421, 288, 457, 354, 5485, 655, 351, 451, 3122,
16680, 3013, 9963, 1839, 106, 7712, 17, 1839, 106, 7712,
314, 260, 8455, 23251, 365, 9783, 1343, 347, 37957, 17334,
281, 4931, 25, 338, 2573, 14640, 30, 669, 18961, 12251,
553, 6254, 281, 8274, 282, 10026, 28, 7372, 2744, 28,
284, 13571, 2905, 30, 3569, 28, 357, 417, 99, 469,
1607, 288, 2635, 1839, 106, 7712, 288, 1029, 351, 469,
7540, 1954, 17, 198, 43102, 23308, 282, 260, 42954, 10151,
198, 504, 1839, 106, 7712, 16680, 3013, 314, 13448, 351,
16626, 3841, 338, 919, 357, 2159, 1767, 42, 198, 29,
39344, 48008, 42, 8432, 418, 967, 7564, 3077, 335, 1839,
106, 7712, 417, 99, 2574, 17, 3768, 3964, 314, 7141,
288, 919, 260, 23251, 1492, 2076, 11198, 30, 198, 29,
31381, 1661, 7776, 42, 1839, 106, 7712, 417, 99, 30749,
359, 4776, 284, 31679, 30, 1069, 417, 257, 4421, 288,
15774, 284, 651, 280, 17, 198, 29, 11849, 1807, 625,
524, 42, 1839, 106, 7712, 417, 99, 4388, 314, 1440,
28, 5369, 767, 511, 967, 1800, 4761, 4077, 30, 1206,
416, 2743, 4875, 260, 651, 280, 17, 198, 29, 1691,
1873, 105, 8279, 42, 378, 25289, 335, 1839, 106, 7712,
417, 99, 1056, 359, 582, 282, 260, 768, 26628, 3004,
30, 1069, 1492, 702, 502, 417, 257, 4421, 288, 14654,
351, 12000, 1149, 17, 198, 29, 40149, 3606, 3042, 42,
378, 1837, 6548, 284, 7225, 282, 1839, 106, 7712, 1138,
915, 638, 3269, 451, 12251, 314, 30, 198, 28308, 918,
284, 10351, 22507, 198, 14191, 106, 7712, 314, 1083, 540,
670, 915, 253, 23251, 30, 669, 1843, 553, 253, 2276,
2642, 5017, 28, 2117, 281, 3085, 837, 357, 12097, 30,
1839, 106, 7712, 808, 6254, 281, 4931, 10026, 281, 260,
216, 33, 41, 37, 32, 99, 347, 253, 3573, 282,
2177, 417, 99, 1149, 284, 260, 11919, 282, 5147, 7921,
30, 3636, 260, 929, 28, 1839, 106, 7712, 553, 1438,
253, 12059, 7847, 4969, 28, 7834, 5535, 702, 3542, 28,
8040, 28, 284, 260, 3630, 282, 2177, 30, 42954, 451,
3013, 3894, 346, 253, 4468, 288, 2084, 351, 451, 3428,
1463, 17, 198, 43837, 42954, 12011, 284, 7299, 4864, 7103,
19576, 198, 12847, 469, 1839, 106, 7712, 21720, 1481, 578,
351, 623, 4833, 16680, 2544, 284, 2380, 9851, 42, 198,
29, 1541, 8200, 42, 4076, 896, 18806, 282, 2654, 284,
11013, 288, 803, 5856, 288, 1839, 106, 7712, 417, 99,
10156, 30, 450, 6827, 18806, 416, 325, 804, 327, 9041,
28, 284, 17138, 18806, 327, 18778, 30, 198, 29, 452,
683, 274, 1691, 884, 42, 5419, 253, 26538, 977, 288,
260, 25289, 335, 1839, 106, 7712, 417, 99, 1056, 30,
4076, 5681, 22461, 355, 908, 20734, 4683, 288, 919, 601,
1492, 19892, 4318, 17, 198, 29, 9378, 1915, 16470, 42,
4076, 1165, 28, 5918, 16933, 288, 17347, 260, 9871, 282,
16133, 24587, 2574, 30, 669, 416, 2635, 253, 11198, 1407,
288, 469, 1839, 106, 7712, 30, 198, 29, 20744, 19576,
42, 7779, 253, 9184, 6621, 411, 4990, 253, 2240, 6376,
1311, 28, 8718, 28, 355, 908, 354, 5065, 327, 1839,
106, 7712])]
リポジトリ
学習や推論のコードは、以下のリポジトリをご覧ください。
学習済み重みは以下をご覧ください
使い方
事前準備
環境構築
下記のコマンドで環境構築してください。
pip install -r requirements.txt
また、flash-attnを利用できる場合は、追加で下記で入れてください。
pip install flash-attn --no-build-isolation
uvを利用されている方は、pyproject.toml
内のflash-attn
部分をコメントアウトしたのちに、下記コマンドで入れれるかと思います
uv sync
uv add flash-attn --no-build-isolation
事前学習
実行
train.pyを実行してください。
python train.py
実行すると、model_testing
に250stepごとにチェックポイントが保存されます。著者のPCではおおよそ1時間に一回保存されるイメージです。
また500stepに一回、evalデータによる検証が行われ、結果がlog/eval.txt
に保存されます。
また、このチェックポイントには学習中の全てのデータ(モデル重みだけでなく、OptimizerやScheduler、lr、dataset idxなど)が保存されるように修正しました。従って途中からの学習の再開も可能になりました。
Tokenizerについて
tokenizer_id = "HuggingFaceTB/SmolLM-360M"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
tokenizer.pad_token = tokenizer.eos_token
TokenizerはHuggingFaceTB/SmolLM-360M
を利用します。Tokenizerを変更したい場合はこちらを変更してください。
学習設定
train_config = TrainerConfig(
vocab_size=tokenizer.vocab_size,
num_epochs=4,
use_ddp=False,
use_moe=True,
use_lossfreebalance=False,
clean_cuda_cache=True,
use_compile=True, # PyTorch 2.0 最適化
use_dtype="bfloat16" if device == 'cuda' else "float32",
seed=42,
max_seq_len=512,
batch_size=2,
accumulation_steps=64, # 実効バッチサイズ32を維持
weight_decay=0.1,
warmup_ratio=0.1,
learning_rate=5e-4,
betas=(0.90, 0.95),
update_rate=1e-5,
val_ratio=0.005,
steps_for_eval=10000, # より頻繁な評価
eval_interval=500, # より短い間隔
checkpoints_frequency=250,
path_to_checkpoints="./model_testing",
tokenized_dataset_path = "HuggingFaceFW/fineweb-edu",
sub_target_files = "data/CC-MAIN-2025-26/*.parquet",
eval_log_file="./log/eval.txt",
continue_train = False,
checkpoint_path = 'model_testing/model.checkpoint.epoch0_step16000_global16000.pt',
)
TrainerConfig
を設定することで、学習の設定をすることができます。
(説明するまでもないものや、特に変更が不要なものは記載してませんごめんなさい)
- vocab_size
- 語彙のサイズを指定します。出力層の次元数が決まります
- Tokenizerの語彙数を指定すればOKです
- num_epochs
- 学習時にデータセットを何周するかを設定します
- use_ddp
- 分散学習用の設定ですが、私の環境では動作確認できないので使えません。元記事の名残です
- use_moe
- Mixture of Expartを利用するかどうか
- use_lossfreebalance
- 損失による方法以外でExpartの均衡を図るかどうか
- 今回は単純のためにFalseにしていますが、精度を求めるならTrueにした方が良さそうです。
- max_seq_len
- 学習に利用する1シークエンスのトークン長の最大値
- 基本的にこのトークン数に合わせてデータセットが分割されます
- batch_size
- 1回の推論で処理を行うシークエンス数
- accumulation_steps
- gradient accumulationにて勾配をまとめるstep数
- まとめるほど大きなバッチサイズで学習したのと同様の効果を期待できるが、勾配更新回数は減るので学習時間は伸びる
- val_ratio
- データセットを学習用と検証用に分割する際の検証側割合
- steps_for_eval
- 検証時に検証を行うstep数
- eval_interval
- 検証データによる検証を行うまでのstep数
- checkpoints_frequency
- チェックポイントを保存する頻度(step)
- tokenized_dataset_path
- Huggingfaceにあるデータセットリポジトリ名
- textカラムにデータが格納されている単純なもののみが取得できる状態
- 面倒なので今回使うデータセットに準拠しました
- sub_target_files
- データセットが巨大だとダウンロードすらできないので、一部を指定する
- eval_log_file
- 検証時の結果を出力するファイル
- continue_train
- 途中のチェックポイントから学習を再開するか
- checkpoint_path
- 途中のチェックポイントから学習を再開するときに使われるチェックポイント
- 重みだけでなく学習再開に必要な途中経過の情報をすべて保存してある。
モデル設定
config = ModelConfig(
vocab_size=tokenizer.vocab_size,
num_dims=512,
num_heads=16,
num_kv_heads=4, # GQA による効率化
num_layers=24,
ffn_hidden_dims=512 * 4,
rmsnorm_eps=1e-6,
rope_theta=1e5,
context_len=1024,
use_cache=False,
use_flash=True, # 利用可能な場合
use_moe=True, # シンプル構成
moe_num_experts=4,
moe_active_experts=1,
moe_eps=1e-6,
moe_aux_loss_coef=0.01,
moe_shared_experts=1,
use_lossfreebalance=False,
)
ModelConfig
を変更することで、学習するTransformerのモデル構造を変更することができます。
- num_dims
- 各層での標準的な次元数。(FFNの出力次元数)
- num_heads
- NHAのhead数
- num_kv_heads
- すべてのheadで利用されるKVの数
- GQAを利用するため、異なるHeadにて同じKVを利用するため小さな値が入る
- num_layers
- Transformerブロックの個数
- ffn_hidden_dims
- FFNの隠れ層の次元数
- rmsnorm_eps
- RMS Normで利用されるεパラメータ
- rope_theta
- RoPEで利用されるθパラメータ
- context_len
- コンテキストサイズ
- use_cache
- KVキャッシュを利用するかどうか
- 後述するFlash Attentionとの併用は私の力不足でできませんでした
- use_flash
- Flash Attentionを利用するかどうか
- GPUが搭載されているPCでしか利用できません。MacなどではNGです
- use_moe
- MoEを利用するかどうか
- moe_num_experts
- MoEにおけるエキスパートの個数(共通部分を除く)
- moe_active_experts
- MoEにおける実際にアクティブになるエキスパートの個数(共通部分は除く)
- moe_shared_experts
- MoEにおけるすべてのトークンで共通でアクティブになるエキスパートの個数
- use_lossfreebalance
- Auxiliary-Loss-Free Load Balancing Strategyを利用するかどうか
- これを使わないと、エキスパートごとの不均衡を解消する損失関数を利用する
データローダについて
data_loader = DataLoader(train_config, tokenizer=tokenizer, hf_split="train", cache = "./cache", use_cache=False)
ここで、指定したリポジトリのデータセットのダウンロードから、分割処理、トークン化まで行われる。
ダウンロードしたデータセットの内容は、~/.cache/huggingface
に保存されるが、データセットの分割処理やトークン下の結果などは別途cache = "./cache"
に保存しています。
でないと、学習を再開する際に毎回実行されてしまうので、それを回避するためです。
(おそらくもっといいやり方があると思うが、時間がなかったので今回はこれで)
キャッシュを利用するかどうかはuse_cache=False
で決定されます。
初回の実行はキャッシュがないのでFalseで実行してください。Falseにしていてもcache = "./cache"
フォルダにキャッシュデータが保存されます。
Huggingfaceへのアップロード
変換
学習したモデルのチェックポイントはそのまま推論に利用もできますが、TransformersモジュールのAutoModel
から簡単に使えるように変換し、アップロードを行うことで誰でも簡単に利用できるようになります。
コード内の下記部分を修正したのちconvert_to_hf.pyを実行してください。
def main():
# デフォルトのチェックポイントパス(train.pyから)
default_checkpoint = "model_testing/model.checkpoint.epoch0_step16000_global16000.pt"
下記コマンドを実行するとチェックポイントがsafetensorファイルとして、hf_modelフォルダに保存されます。
python convert_to_hf.py
また、モデル構造などを変更した場合は、ModelConfig
も修正してください。
アップロード
hf_modelフォルダに保存されている変換後のファイルをすべてHuggingfaceにアップロードします。
下記コマンドを実行してください。
python upload_to_hub.py \
--model_dir ./hf_model \
--repo_name "username/repo_name" \
--token hf_xxxxxxxxxxxxxxxxxx
--repo_name
にはアップロードしたい自身のHuggingfaceリポジトリ名を設定してください。
--token
には自身のHuggingfaceのアクセストークンを設定してください。書き込み権限をトークンに付与することを忘れずに。
Huggingfaceページの右上アイコンをクリックして、「Access Tokens」から作成することができます。
推論
checkpointファイルを利用した推論
ローカルに保存されているチェックポイント(例えばmodel.checkpoint.epoch0_step16000_global16000.pt
など)を利用する場合は、infer.pyを実行してください。
python infer_generate.py
以下のパラメータをコード内で設定できます
- text
- 入力されるプロンプト
- baseモデルになるので、プロンプトの続きから出力されます。
- max_tokens
- 出力できる最大トークン数
- temperature
- top_k
- top_p
- repetition_penalty
- 繰り返しのかかるペナルティ
- use_cache
- KVキャッシュの利用有無
Huggingfaceモデルを利用した推論
ローカルにあるhf_model
フォルダや、Huggingfaceにアップロードしたリポジトリを指定して、推論を実行することが可能です。
python hf_inference.py
実行前に、コード内でリポジトリ名の指定や、プロンプトの設定、generate
メソッドでtemperature
などのパラメータを設定してください。
設定可能なパラメータはinfer_generate.py
と同様です。
出力
実際にtrain.pyを利用して学習を行い、23,500step目の結果を表示しておきます。
(大体丸4日程度学習をした結果になります)
学習経過内容
各ステップごとの検証データにおけるlossの推移
Global Step: 500, Epoch: 0, Step: 500, val_loss: 5.9492, norm: 0.8486, lr: 3.0425168129e-05, time: 12.41s, tok/s: 5269.2
Global Step: 1000, Epoch: 0, Step: 1000, val_loss: 5.3255, norm: 0.6868, lr: 3.5839507580e-05, time: 12.45s, tok/s: 5253.7
Global Step: 1500, Epoch: 0, Step: 1500, val_loss: 5.0049, norm: 1.1066, lr: 4.1253847031e-05, time: 12.49s, tok/s: 5237.9
Global Step: 2000, Epoch: 0, Step: 2000, val_loss: 4.7527, norm: 1.0224, lr: 4.6668186481e-05, time: 12.43s, tok/s: 5263.3
Global Step: 2500, Epoch: 0, Step: 2500, val_loss: 4.5338, norm: 0.7729, lr: 5.2240806507e-05, time: 12.36s, tok/s: 5291.6 | dataset idx: 13996382/14036749
Global Step: 3000, Epoch: 0, Step: 3000, val_loss: 4.3769, norm: 0.9872, lr: 5.7682437851e-05, time: 12.50s, tok/s: 5234.0 | dataset idx: 14016382/14036749
Global Step: 3500, Epoch: 0, Step: 3500, val_loss: 4.2353, norm: 0.7856, lr: 6.3124069195e-05, time: 12.41s, tok/s: 5271.1 | dataset idx: 14036382/14036749
Global Step: 4000, Epoch: 0, Step: 4000, val_loss: 4.1052, norm: 0.8533, lr: 6.8565700538e-05, time: 12.34s, tok/s: 5299.1 | dataset idx: 13986198/14036749
Global Step: 4500, Epoch: 0, Step: 4500, val_loss: 3.9782, norm: 0.6816, lr: 7.4007331882e-05, time: 12.49s, tok/s: 5236.2 | dataset idx: 14006198/14036749
Global Step: 5000, Epoch: 0, Step: 5000, val_loss: 3.8549, norm: 0.6257, lr: 7.9448963226e-05, time: 12.52s, tok/s: 5224.8 | dataset idx: 14026198/14036749
Global Step: 5500, Epoch: 0, Step: 5500, val_loss: 3.7511, norm: 0.6076, lr: 8.4890594570e-05, time: 12.65s, tok/s: 5172.4 | dataset idx: 13976014/14036749
Global Step: 6000, Epoch: 0, Step: 6000, val_loss: 3.6625, norm: 0.6298, lr: 9.0332225914e-05, time: 12.33s, tok/s: 5302.7 | dataset idx: 13996014/14036749
Global Step: 6500, Epoch: 0, Step: 6500, val_loss: 3.5878, norm: 0.5791, lr: 9.5773857257e-05, time: 12.45s, tok/s: 5255.2 | dataset idx: 14016014/14036749
Global Step: 7000, Epoch: 0, Step: 7000, val_loss: 3.5198, norm: 0.5710, lr: 1.0121548860e-04, time: 12.51s, tok/s: 5227.7 | dataset idx: 14036014/14036749
Global Step: 7500, Epoch: 0, Step: 7500, val_loss: 3.4485, norm: 0.5154, lr: 1.0665711995e-04, time: 12.58s, tok/s: 5200.7 | dataset idx: 13985830/14036749
Global Step: 8000, Epoch: 0, Step: 8000, val_loss: 3.4143, norm: 0.4877, lr: 1.1209875129e-04, time: 12.40s, tok/s: 5277.0 | dataset idx: 14005830/14036749
Global Step: 8500, Epoch: 0, Step: 8500, val_loss: 3.3639, norm: 0.5223, lr: 1.1754038263e-04, time: 12.39s, tok/s: 5279.2 | dataset idx: 14025830/14036749
Global Step: 9000, Epoch: 0, Step: 9000, val_loss: 3.3330, norm: 0.4476, lr: 1.2298201398e-04, time: 12.40s, tok/s: 5273.8 | dataset idx: 13975646/14036749
Global Step: 9500, Epoch: 0, Step: 9500, val_loss: 3.2958, norm: 0.4881, lr: 1.2842364532e-04, time: 12.48s, tok/s: 5239.5 | dataset idx: 13995646/14036749
Global Step: 10000, Epoch: 0, Step: 10000, val_loss: 3.2575, norm: 0.4186, lr: 1.3386527666e-04, time: 12.54s, tok/s: 5214.6 | dataset idx: 14015646/14036749
Global Step: 10500, Epoch: 0, Step: 10500, val_loss: 3.2171, norm: 0.4518, lr: 1.3930690801e-04, time: 12.40s, tok/s: 5274.5 | dataset idx: 14035646/14036749
Global Step: 11000, Epoch: 0, Step: 11000, val_loss: 3.2012, norm: 0.3990, lr: 1.4474853935e-04, time: 12.48s, tok/s: 5241.4 | dataset idx: 13985462/14036749
Global Step: 11500, Epoch: 0, Step: 11500, val_loss: 3.1675, norm: 0.4280, lr: 1.5019017070e-04, time: 12.32s, tok/s: 5307.0 | dataset idx: 14005462/14036749
Global Step: 12000, Epoch: 0, Step: 12000, val_loss: 3.1577, norm: 0.4015, lr: 1.5563180204e-04, time: 12.52s, tok/s: 5225.5 | dataset idx: 14025462/14036749
Global Step: 12500, Epoch: 0, Step: 12500, val_loss: 3.1262, norm: 0.4024, lr: 1.6107343338e-04, time: 12.45s, tok/s: 5253.0 | dataset idx: 13975278/14036749
Global Step: 13000, Epoch: 0, Step: 13000, val_loss: 3.1110, norm: 0.3950, lr: 1.6651506473e-04, time: 12.59s, tok/s: 5196.2 | dataset idx: 13995278/14036749
Global Step: 13500, Epoch: 0, Step: 13500, val_loss: 3.0869, norm: 0.3840, lr: 1.7195669607e-04, time: 11.69s, tok/s: 5595.7 | dataset idx: 14015278/14036749
Global Step: 14000, Epoch: 0, Step: 14000, val_loss: 3.0776, norm: 0.3832, lr: 1.7739832741e-04, time: 12.33s, tok/s: 5305.0 | dataset idx: 14035278/14036749
Global Step: 14500, Epoch: 0, Step: 14500, val_loss: 3.0565, norm: 0.3679, lr: 1.8283995876e-04, time: 12.53s, tok/s: 5218.8 | dataset idx: 13985094/14036749
Global Step: 15000, Epoch: 0, Step: 15000, val_loss: 3.0368, norm: 0.3729, lr: 1.8828159010e-04, time: 12.45s, tok/s: 5251.8 | dataset idx: 14005094/14036749
Global Step: 15500, Epoch: 0, Step: 15500, val_loss: 3.0335, norm: 0.3604, lr: 1.9372322145e-04, time: 12.41s, tok/s: 5269.1 | dataset idx: 14025094/14036749
Global Step: 16000, Epoch: 0, Step: 16000, val_loss: 3.0106, norm: 0.3694, lr: 1.9916485279e-04, time: 12.34s, tok/s: 5299.0 | dataset idx: 13974910/14036749
Global Step: 16500, Epoch: 0, Step: 16500, val_loss: 2.9915, norm: 0.3634, lr: 2.0460648413e-04, time: 11.50s, tok/s: 5688.1 | dataset idx: 13994910/14036749
Global Step: 17000, Epoch: 0, Step: 17000, val_loss: 2.9861, norm: 0.3599, lr: 2.1004811548e-04, time: 11.48s, tok/s: 5699.1 | dataset idx: 14014910/14036749
Global Step: 17500, Epoch: 0, Step: 17500, val_loss: 2.9890, norm: 0.3543, lr: 2.1548974682e-04, time: 11.49s, tok/s: 5693.4 | dataset idx: 14034910/14036749
Global Step: 18000, Epoch: 0, Step: 18000, val_loss: 2.9757, norm: 0.3568, lr: 2.2093137816e-04, time: 11.58s, tok/s: 5646.0 | dataset idx: 13984726/14036749
Global Step: 18500, Epoch: 0, Step: 18500, val_loss: 2.9617, norm: 0.3656, lr: 2.2637300951e-04, time: 11.51s, tok/s: 5681.4 | dataset idx: 14004726/14036749
Global Step: 19000, Epoch: 0, Step: 19000, val_loss: 2.9480, norm: 0.3613, lr: 2.3181464085e-04, time: 11.50s, tok/s: 5688.7 | dataset idx: 14024726/14036749
Global Step: 19500, Epoch: 0, Step: 19500, val_loss: 2.9368, norm: 0.3493, lr: 2.3725627220e-04, time: 11.50s, tok/s: 5687.1 | dataset idx: 13974542/14036749
Global Step: 20000, Epoch: 0, Step: 20000, val_loss: 2.9310, norm: 0.3604, lr: 2.4269790354e-04, time: 11.54s, tok/s: 5668.6 | dataset idx: 13994542/14036749
Global Step: 20500, Epoch: 0, Step: 20500, val_loss: 2.9194, norm: 0.3619, lr: 2.4813953488e-04, time: 11.66s, tok/s: 5609.6 | dataset idx: 14014542/14036749
Global Step: 21000, Epoch: 0, Step: 21000, val_loss: 2.9157, norm: 0.3612, lr: 2.5358116623e-04, time: 11.63s, tok/s: 5624.5 | dataset idx: 14034542/14036749
Global Step: 21500, Epoch: 0, Step: 21500, val_loss: 2.8953, norm: 0.3504, lr: 2.5902279757e-04, time: 11.58s, tok/s: 5650.7 | dataset idx: 13984358/14036749
Global Step: 22000, Epoch: 0, Step: 22000, val_loss: 2.9018, norm: 0.3499, lr: 2.6446442892e-04, time: 11.60s, tok/s: 5638.1 | dataset idx: 14004358/14036749
Global Step: 22500, Epoch: 0, Step: 22500, val_loss: 2.8960, norm: 0.3700, lr: 2.6990606026e-04, time: 11.51s, tok/s: 5682.4 | dataset idx: 14024358/14036749
Global Step: 23000, Epoch: 0, Step: 23000, val_loss: 2.8937, norm: 0.3620, lr: 2.7534769160e-04, time: 11.53s, tok/s: 5673.2 | dataset idx: 13974174/14036749
Global Step: 23500, Epoch: 0, Step: 23500, val_loss: 2.8894, norm: 0.3713, lr: 2.8078932295e-04, time: 11.75s, tok/s: 5567.9 | dataset idx: 13994174/14036749
23,500step目では大体データセット全体の21.5%くらい学習済みの状態です。
トークン数でいうと1.54BT(約15億token)程度学習できています。
出力設定
output_ids = model.generate(
input_ids,
max_tokens=100,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.2,
use_cache=True,
)
prompt
text = "I am Mike. I live in"
出力結果
下記はMacのCPUでの出力を行ったため、処理時間が少しかかっています。
(ちなみにKVキャッシュをOFFにすると12秒程度かかってしまいます)
⏱️ Generation took 2.37 seconds
==================================================
📄 Generated Text:
==================================================
I'm Mike. I live in a small town on the outskirts of New Zealand, and I've never seen anyone on my phone.
As a child I'd been told to "look at me," but now I think that there's no way she can tell me about what is happening around them or how they are affecting others.
It turns out we have a lot more to do with our actions than we realize. It seems that some people don't think of themselves as being "bad" because of their behavior - while a
==================================================
翻訳
私はマイクです。ニュージーランドの郊外にある小さな町に住んでいますが、携帯電話で誰かを見たことはありません。
子供の頃「こっちを見なさい」と言われてきたけれど、今では彼女たちが周囲で起きていることや、自分たちの行動が他人にどう影響しているかを伝える方法などないと思う。
結局のところ、私たちの行動は自覚している以上に多くの影響を及ぼしている。どうやら一部の人々は、自分の行動が「悪い」とは思っていないようだ——一方で
上記のように、文章はメチャクチャですが、なんとなく文章が生成されています。
もっと長い時間学習すれば、結果は良くなるかもしれません。
まとめ
読んでくださりありがとうございます。
久しぶりにがっつりとPytorchを使えて楽しかったです。
おおよそのモデル構造くらいは理解できていますが、損失関数や学習率の遷移などの細かい部分において完全に理解できているわけではないので、もう少し勉強しようと思いました。
Discussion