🔠

OpenAI 言語モデルごとのエンコーディング一覧

ryohtaka

2023/03/09に公開

はじめに

本家 OpenAI や Azure OpenAI Service で利用できる各言語モデルで使われているエンコーディングについて情報をまとめました。

エンコーディングとは

OpenAI の言語モデルにおけるエンコーディングとは、テキストがトークンに変換される際の (トークナイズされる際の) ルールのようなものです。モデルによって使われるエンコーディングは異なります。

エンコーディングの種類

下記 4 種類のエンコーディングが存在しています。

エンコーディング	コメント
`o200k_base`	`GPT-4o` と `GPT-4o mini` で使われている
`cl100k_base`	`GPT-3.5` 後期のモデル (`gpt-3.5-turbo`) から `GPT-4 Turbo with Vision` までのモデルで使われている
`p50k_base`	現在では非推奨古いモデルでのみ使われている
`r50k_base` (`gpt2`)	現在では非推奨古いモデルでのみ使われている

確認方法

エンコーディングは Python パッケージの tiktoken で確認することができます。下記の例では gpt-3.5-turbo のエンコーディングを確認しています。

コード

import tiktoken
print(tiktoken.encoding_for_model('gpt-3.5-turbo'))

レスポンス

<Encoding 'cl100k_base'>

参考

エンコーディングの比較

openai-cookbook のサンプルコードを Unicode 文字も表示できるように変更してトークナイズ結果を比較してみます。なお、一部の Unicode 文字は 1 文字が複数トークンに分割されて UTF-8 でデコードできなくなってしまいますので、その部分は � で表現します。

import tiktoken

def compare_encodings(example_string: str) -> None:
    """Prints a comparison of three string encodings."""
    print(f'\nExample string: "{example_string}"')
    for encoding_name in ["cl100k_base", "p50k_base", "r50k_base"]:
        encoding = tiktoken.get_encoding(encoding_name)
        token_integers = encoding.encode(example_string)
        num_tokens = len(token_integers)
        token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]
        token_utf8 = []
        for tb in token_bytes:
            try:
                s = tb.decode('utf-8')
            except UnicodeDecodeError:
	        # UTF-8 でデコードできないトークンは � で表現
                s = "�"
            token_utf8.append(s)
        print()
        print(f"{encoding_name}: {num_tokens} tokens")
        print(f"token integers: {token_integers}")
        print(f"token strings: {token_utf8}")

p50k_base と r50k_base では、Unicode 文字は 1 文字 1 トークンもしくは 1 文字がさらに複数トークンに分割されています。下記の例では平仮名の「ち」が 2 トークンに分割されています。
一方、 cl100k_base では Unicode 文字でもある程度のフレーズを 1 トークンにまとめてくれるようです。

compare_encodings("こんにちはOpenAI")

結果

Example string: "こんにちはOpenAI"

cl100k_base: 3 tokens
token integers: [90115, 5109, 15836]
token strings: ['こんにちは', 'Open', 'AI']

p50k_base: 8 tokens
token integers: [46036, 22174, 28618, 2515, 94, 31676, 11505, 20185]
token strings: ['こ', 'ん', 'に', '�', '�', 'は', 'Open', 'AI']

r50k_base: 8 tokens
token integers: [46036, 22174, 28618, 2515, 94, 31676, 11505, 20185]
token strings: ['こ', 'ん', 'に', '�', '�', 'は', 'Open', 'AI']

p50k_base と r50k_base の違いはソースコードの取り扱い方にあるようです。p50k_base ではインデントをひとまとめにするなど、トークン数がやや少なくなっています。前述の openai-cookbook でも下記のように説明されています。

p50k_base overlaps substantially with r50k_base, and for non-code applications, they will usually give the same tokens.

msg = """
def print_message_with_exclamation(message):
    str = message + '!'
    print(str)

print_message_with_exclamation('hello world')
"""

compare_encodings(msg)

結果

Example string: "
def print_message_with_exclamation(message):
    str = message + '!'
    print(str)

print_message_with_exclamation('hello world')
"

cl100k_base: 29 tokens
token integers: [198, 755, 1194, 6598, 6753, 2769, 34084, 7483, 997, 262, 610, 284, 1984, 489, 364, 49827, 262, 1194, 4293, 696, 1374, 6598, 6753, 2769, 34084, 493, 15339, 1917, 1329]
token strings: ['\n', 'def', ' print', '_message', '_with', '_ex', 'clamation', '(message', '):\n', '   ', ' str', ' =', ' message', ' +', " '", "!'\n", '   ', ' print', '(str', ')\n\n', 'print', '_message', '_with', '_ex', 'clamation', "('", 'hello', ' world', "')\n"]

p50k_base: 42 tokens
token integers: [198, 4299, 3601, 62, 20500, 62, 4480, 62, 1069, 20931, 7, 20500, 2599, 198, 50258, 965, 796, 3275, 1343, 705, 13679, 198, 50258, 3601, 7, 2536, 8, 198, 198, 4798, 62, 20500, 62, 4480, 62, 1069, 20931, 10786, 31373, 995, 11537, 198]
token strings: ['\n', 'def', ' print', '_', 'message', '_', 'with', '_', 'ex', 'clamation', '(', 'message', '):', '\n', '   ', ' str', ' =', ' message', ' +', " '", "!'", '\n', '   ', ' print', '(', 'str', ')', '\n', '\n', 'print', '_', 'message', '_', 'with', '_', 'ex', 'clamation', "('", 'hello', ' world', "')", '\n']

r50k_base: 46 tokens
token integers: [198, 4299, 3601, 62, 20500, 62, 4480, 62, 1069, 20931, 7, 20500, 2599, 198, 220, 220, 220, 965, 796, 3275, 1343, 705, 13679, 198, 220, 220, 220, 3601, 7, 2536, 8, 198, 198, 4798, 62, 20500, 62, 4480, 62, 1069, 20931, 10786, 31373, 995, 11537, 198]
token strings: ['\n', 'def', ' print', '_', 'message', '_', 'with', '_', 'ex', 'clamation', '(', 'message', '):', '\n', ' ', ' ', ' ', ' str', ' =', ' message', ' +', " '", "!'", '\n', ' ', ' ', ' ', ' print', '(', 'str', ')', '\n', '\n', 'print', '_', 'message', '_', 'with', '_', 'ex', 'clamation', "('", 'hello', ' world', "')", '\n']

[補足] 日本語を扱う場合のトークン数推定指標

ある程度大きなデータセット (Wiki-40B の日本語データセット) を使って、各エンコーディングにおける 1 トークンあたりの文字数と 1 文字あたりのトークン数が実際どのくらいになるのか確認してみた結果は以下のとおりです。詳細は別の記事でまとめましたので、そちらをご参照ください。

	o200k_base	cl100k_base	p50k_base	r50k_base
トークンあたりの文字数	1.2499	0.9170	0.7136	0.7135
文字あたりのトークン数	0.8001	1.0905	1.4013	1.4015

参考

OpenAI 言語モデルで日本語を扱う際のトークン数推定指標

モデルごとのエンコーディング一覧

主流のモデル

前述のとおり全てのモデルで cl100k_base が使われています。

シリーズ	tiktoken で確認する際のモデル名	エンコーディング
GPT-4o mini	gpt-4o mini	o200k_base
GPT-4o	gpt-4o	o200k_base
GPT-4 Turbo with Vision	gpt-4-vision-preview	cl100k_base
GPT-4 Turbo	gpt-4-1106-preview	cl100k_base
GPT-4 Turbo	gpt-4-0125-preview	cl100k_base
GPT-4	gpt-4	cl100k_base
GPT-4	gpt-4-32k	cl100k_base
GPT-3.5	gpt-3.5-turbo	cl100k_base
GPT-3.5	gpt-35-turbo-16k	cl100k_base
GPT-3.5	gpt-35-turbo-instruct	cl100k_base
Embeddings V3	text-embedding-3-large	cl100k_base
Embeddings V3	text-embedding-3-small	cl100k_base
Embeddings V2	text-embedding-ada-002	cl100k_base

参考

非推奨のモデル (レガシーモデル)

p50k_base もしくは r50k_base が使われています。

シリーズ	tiktoken で確認する際のモデル名	エンコーディング
Codex	code-davinci-002	p50k_base
Codex	code-cushman-001	p50k_base
GPT-3.5	text-davinci-003	p50k_base
GPT-3.5	text-davinci-002	p50k_base
GPT-3	text-curie-001	r50k_base
GPT-3	text-babbage-001	r50k_base
GPT-3	text-ada-001	r50k_base
GPT-3	davinci	r50k_base
GPT-3	curie	r50k_base
GPT-3	babbage	r50k_base
GPT-3	ada	r50k_base
Embeddings (Text Similarity)	text-similarity-davinci-001	r50k_base
Embeddings (Text Similarity)	text-similarity-babbage-001	r50k_base
Embeddings (Text Similarity)	text-similarity-ada-001	r50k_base
Embeddings (Text Search)	text-search-davinci-doc-001	r50k_base
Embeddings (Text Search)	text-search-davinci-query-001	r50k_base
Embeddings (Text Search)	text-search-curie-doc-001	r50k_base
Embeddings (Text Search)	text-search-curie-query-001	r50k_base
Embeddings (Text Search)	text-search-babbage-doc-001	r50k_base
Embeddings (Text Search)	text-search-babbage-query-001	r50k_base
Embeddings (Text Search)	text-search-ada-doc-001	r50k_base
Embeddings (Text Search)	text-search-ada-query-001	r50k_base
Embeddings (Code Search)	code-search-babbage-code-001	r50k_base
Embeddings (Code Search)	code-search-babbage-text-001	r50k_base
Embeddings (Code Search)	code-search-ada-code-001	r50k_base
Embeddings (Code Search)	code-search-ada-text-001	r50k_base

参考

おわりに

以上です。🍵

Microsoft (有志)Publication

Microsoft Azureをはじめとする最新技術情報をお届けします。 ※このPublicationは日本マイクロソフトまたは米Microsoft所属社員による個人の見解であり、所属する組織の公式見解ではありません。 ※Publicationに参加希望の社員は @07JP27までご連絡ください。

Discussion

ログインするとコメントできます