gemma tokenizerとその他のtokenizer比較

!pip install -U transformers
!huggingface-cli login

from transformers import AutoTokenizer
tokenizer_llama2 = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer_gemma = AutoTokenizer.from_pretrained("google/gemma-7b-it")
tokenizer_llmjp = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v1.0")

vocab

print('llama2: ', tokenizer_llama2.vocab_size)
print('gemma:  ', tokenizer_gemma.vocab_size)
print('llm-jp: ', tokenizer_llmjp.vocab_size)
print('elyza:  ', tokenizer_elyza.vocab_size)

llama2: 32000
gemma: 256000
llm-jp: 50570
elyza: 45043

if001

token数の比較日本語

text_ja="""Gemma は、Gemini モデルの作成に使用されたのと同じ研究とテクノロジーから構築された、軽量で最先端のオープンモデルのファミリーです。Google DeepMind と Google の他のチームによって開発された Gemma は、ラテン語で「宝石」を意味する「gemma」にちなんで名付けられました。Gemma モデルの重みは、イノベーション、コラボレーション、AI の責任ある使用を促進するデベロッパー ツールによってサポートされています。
Gemma モデルは、アプリケーションだけでなく、ハードウェア、モバイル デバイス、ホスト型サービスで実行できます。また、チューニング手法を使用してこれらのモデルをカスタマイズし、デベロッパーやユーザーにとって重要なタスクを効果的に実行できるようにすることもできます。Gemma モデルは、Gemini モデル ファミリーからインスピレーションと技術的リネージを引き出し、AI 開発コミュニティが拡張と進化を行えるように作成されています。
Gemma モデルはテキスト生成に使用できますが、特定のタスクの実行に特化したようにチューニングすることもできます。チューニングされた Gemma モデルは、ターゲットを絞った効率的な生成 AI ソリューションをユーザーに提供できます。LoRA によるチューニングに関するガイドを確認し ぜひお試しくださいGemma をぜひご活用ください。
このデベロッパー向けドキュメントでは、利用可能な Gemma モデルと開発ガイドの概要を示し、特定のアプリケーションに合わせて Gemma モデルを適用してチューニングする方法について説明します。
"""

print('llama2: ', len(tokenizer_llama2(text_ja)['input_ids']))
print('gemma : ', len(tokenizer_gemma(text_ja)['input_ids']))
print('llm-jp: ', len(tokenizer_llmjp(text_ja)['input_ids']))
print('elyza:  ', len(tokenizer_elyza(text_ja)['input_ids']))

llama2: 700
gemma : 295
llm-jp: 363
elyza: 400

if001

token数の比較英語

text_en="""Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Developed by Google DeepMind and other teams across Google, Gemma is named after the Latin gemma, meaning "precious stone." The Gemma model weights are supported by developer tools that promote innovation, collaboration, and the responsible use of artificial intelligence (AI).

The Gemma models are available to run in your applications and on your hardware, mobile devices, or hosted services. You can also customize these models using tuning techniques so that they excel at performing tasks that matter to you and your users. Gemma models draw inspiration and technological lineage from the Gemini family of models, and are made for the AI development community to extend and take further.

You can use Gemma models for text generation, however you can also tune these models to specialize in performing specific tasks. Tuned Gemma models can provide you and your users with more targeted and efficient generative AI solutions. Check out our guide on tuning with LoRA and try it out! We are excited to see what you build with Gemma!

This developer documentation provides an overview of the available Gemma models and development guides for how to apply them and tune them for specific applications.
"""

print('llama2: ', len(tokenizer_llama2(text_en)['input_ids']))
print('gemma :  ', len(tokenizer_gemma(text_en)['input_ids']))
print('llm-jp: ', len(tokenizer_llmjp(text_en)['input_ids']))
print('elyza:  ', len(tokenizer_elyza(text_en)['input_ids']))

llama2: 286
gemma : 252
llm-jp: 287
elyza: 286

if001

llama2

['▁Gem', 'ma', '▁', 'は', '、', 'G', 'em', 'ini', '▁', 'モ', 'デ', 'ル', 'の', '作', '成', 'に', '使', '用', 'さ', 'れ', 'た', 'の', 'と', '同', 'じ', '研', '究', 'と', 'テ', 'ク', 'ノ', 'ロ', 'ジ', 'ー', 'か', 'ら', '<0xE6>', '<0xA7>', '<0x8B>', '<0xE7>', '<0xAF>', '<0x89>', 'さ', 'れ', 'た', '、', '<0xE8>', '<0xBB>', '<0xBD>', '量', 'で', '最', '先', '<0xE7>', '<0xAB>', '<0xAF>', 'の', 'オ', 'ー', 'プ', 'ン', 'モ', 'デ', 'ル', 'の', 'フ', 'ァ', 'ミ', 'リ', 'ー', 'で', 'す', '。', 'Google', '▁Deep', 'M', 'ind', '▁', 'と', '▁Google', '▁', 'の', '他', 'の', 'チ', 'ー', 'ム', 'に', 'よ', 'っ', 'て', '開', '発', 'さ', 'れ', 'た', '▁Gem', 'ma', '▁', 'は', '、', 'ラ', 'テ', 'ン', '語', 'で', '「', '宝', '石', '」', 'を', '意', '<0xE5>', '<0x91>', '<0xB3>', 'す', 'る', '「', 'gem', 'ma', '」', 'に', 'ち', 'な', 'ん', 'で', '名', '付', 'け', 'ら', 'れ', 'ま', 'し', 'た', '。']

gemma

['Gemma', '▁は', '、', 'Gemini', '▁モデル', 'の作成', 'に使用', 'された', 'の', 'と同じ', '研究', 'と', 'テク', 'ノロ', 'ジー', 'から', '構築', 'された', '、', '軽量', 'で', '最', '先', '端', 'の', 'オープン', 'モデル', 'の', 'ファミリー', 'です', '。', 'Google', '▁Deep', 'Mind', '▁と', '▁Google', '▁の', '他の', 'チーム', 'によって', '開発', 'された', '▁Gemma', '▁は', '、', 'ラ', 'テン', '語', 'で', '「', '宝石', '」', 'を', '意味', 'する', '「', 'gem', 'ma', '」', 'に', 'ち', 'なんで', '名', '付け', 'られました', '。']

llm-jp

['▁Gem', 'ma', '▁', 'は', '、', 'Ge', 'mini', '▁', 'モデル', 'の', '作成', 'に', '使用', 'さ', 'れ', 'た', 'の', 'と', '同じ', '研究', 'と', 'テクノロジー', 'から', '構築', 'さ', 'れ', 'た', '、', '軽量', 'で', '最先端', 'の', 'オープン', 'モデル', 'の', 'ファミリー', 'です', '。', 'Google', '▁Deep', 'Min', 'd', '▁', 'と', '▁Google', '▁', 'の', '他', 'の', 'チーム', 'によって', '開発', 'さ', 'れ', 'た', '▁Gem', 'ma', '▁', 'は', '、', 'ラテン語', 'で', '「', '宝石', '」', 'を', '意味', 'する', '「', 'gemm', 'a', '」', 'に', 'ちなん', 'で', '名付け', 'られ', 'ま', 'した', '。']

elyza

['▁Gem', 'ma', '▁', 'は', '、', 'G', 'em', 'ini', '▁', 'モデ', 'ルの', '作成', 'に', '使用', 'された', 'の', 'と同じ', '研究', 'と', 'テ', 'ク', 'ノ', 'ロ', 'ジ', 'ー', 'から', '構', '築', 'された', '、', '軽', '量', 'で', '最', '先', '端', 'の', 'オープン', 'モデ', 'ルの', 'ファ', 'ミ', 'リー', 'です', '。', 'Google', '▁Deep', 'M', 'ind', '▁', 'と', '▁Google', '▁', 'の', '他の', 'チーム', 'によって', '開発', 'された', '▁Gem', 'ma', '▁', 'は', '、', 'ラ', 'テン', '語', 'で', '「', '宝', '石', '」', 'を', '意味', 'する', '「', 'gem', 'ma', '」', 'に', 'ちな', 'んで', '名', '付け', 'られ', 'ました', '。']

if001

アルゴリズム

llama2: SentencePiece BPE
gemmea: SentencePiece
llm-jp: SentencePiece Unigram byte-fallback
elyza: SentencePiece BPE

if001

llama2

['▁Gem', 'ma', '▁is', '▁a', '▁family', '▁of', '▁light', 'weight', ',', '▁state', '-', 'of', '-', 'the', '-', 'art', '▁open', '▁models', '▁built', '▁from', '▁the', '▁same', '▁research', '▁and', '▁technology', '▁used', '▁to', '▁create', '▁the', '▁Gem', 'ini', '▁models', '.', '▁Develop', 'ed', '▁by', '▁Google', '▁Deep', 'M', 'ind', '▁and', '▁other', '▁teams', '▁across', '▁Google', ',', '▁Gem', 'ma', '▁is', '▁named', '▁after', '▁the', '▁Latin', '▁gem', 'ma', ',', '▁meaning', '▁"', 'pre', 'cious', '▁stone', '."']

gemma

['Gemma', '▁is', '▁a', '▁family', '▁of', '▁lightweight', ',', '▁state', '-', 'of', '-', 'the', '-', 'art', '▁open', '▁models', '▁built', '▁from', '▁the', '▁same', '▁research', '▁and', '▁technology', '▁used', '▁to', '▁create', '▁the', '▁Gemini', '▁models', '.', '▁Developed', '▁by', '▁Google', '▁Deep', 'Mind', '▁and', '▁other', '▁teams', '▁across', '▁Google', ',', '▁Gemma', '▁is', '▁named', '▁after', '▁the', '▁Latin', '▁gem', 'ma', ',', '▁meaning', '▁"', 'precious', '▁stone', '."']

llm-jp

['▁Gem', 'ma', '▁is', '▁a', '▁family', '▁of', '▁lightweight', ',', '▁state', '-', 'of', '-', 'the', '-', 'art', '▁open', '▁models', '▁built', '▁from', '▁the', '▁same', '▁research', '▁and', '▁technology', '▁used', '▁to', '▁create', '▁the', '▁Ge', 'mini', '▁models', '.', '▁', 'Develop', 'ed', '▁by', '▁Google', '▁Deep', 'Min', 'd', '▁and', '▁other', '▁teams', '▁across', '▁Google', ',', '▁Gem', 'ma', '▁is', '▁named', '▁after', '▁the', '▁Latin', '▁', 'gemm', 'a', ',', '▁meaning', '▁', '"', 'pre', 'cious', '▁', 'stone', '.']

このスクラップは2024/02/22にクローズされました

vocab

token数の比較 日本語

token数の比較 英語

llama2

gemma

llm-jp

elyza

アルゴリズム

llama2

gemma

llm-jp

token数の比較日本語

token数の比較英語