⚗️

DistilabelでLLMの合成データ作成について学ぶ

masakuri

2024/12/27に公開

こんにちは。ZENKIGENデータサイエンスチームの栗原です。現在は『harutaka EF（エントリーファインダー）』の自然言語処理周りの研究開発などに携わっています。
所属チームで Xアカウントを運用しており、AIに関する情報を発信していますので、よろしければこちらも覗いてみてください。

LLM（Large Language Model）の発展により様々なタスクがLLMで精度良く解けるようになってきています。
その中で、これまで人の手でコストをかけて行ってきたデータ作成や出力の評価においてもLLMの活用が進んでいます。
本記事では、合成データ作成とLLMによる出力の評価を効率的に行うためのフレームワーク Distilabel で合成データ作成を行う方法について学びます。

Distilabelとは

https://distilabel.argilla.io/latest/

Distilabelは合成データ作成と出力の評価を効率的に行うためのフレームワークです。
まずは、Distilabelでどのように合成データ作成や出力の評価を行うのか、見てみましょう。

準備

pipでDistilabelをインストールできます。ここでは、huggingfaceからモデルをダウンロードして利用する想定で、hf-inference-endpointsを指定します。

pip install distilabel[hf-inference-endpoints] --upgrade

OpenAIなどの他のモデルを利用する場合は、対象となる依存関係（Extras）を指定（distilabel[*]の * の部分に指定。distilabel[hf-inference-endpoints, openai]などのように複数指定可能）する必要があります。

現在（2024/12/23 現在）利用可能な Extras（LLM）

anthropic
argilla
cohere
groq
hf-inference-endpoints
hf-transformers
litellm
llama-cpp
mistralai
ollama
openai
vertexai
vllm
sentence-transformers

合成データ作成

それでは、合成データ作成の実装と実行の例を見てみましょう。
以下が実装サンプルの全体像です。コメントで各行で行っている内容を簡単に説明しています。

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

with Pipeline(  # simple-text-generation-pipeline という名前と、"シンプルなテキスト生成パイプライン" という説明を付与した Pipeline を定義。
    name="simple-text-generation-pipeline",
    description="シンプルなテキスト生成パイプライン",
) as pipeline:  # コンテキストマネージャー内に定義されたすべてのステップのサブクラスが自動的にパイプラインに追加される。
    load_dataset = LoadDataFromHub(  # LoadDataFromHub ステップの定義。以下の pipeline.run メソッドで実行時パラメータを通じて指定されたHugging Face Hubからデータセットをダウンロードする。ダウンロードしたデータセットの行を出力バッチとして生成し、列promptをinstructionフィールドにマッピングする。
        output_mappings={"prompt": "instruction"},
    )

    text_generation = TextGeneration(  # TextGeneration タスクを定義。データセットからinstructionフィールドに基づいてテキストを生成する。ここでは Meta-Llama-3.1-8B-Instruct モデルを持つ InferenceEndpointsLLM クラスを利用。
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
        ),
        system_prompt="あなたは創造的なAIアシスタントライターです。",
        template="以下の指示に従ってください: {{ instruction }}"
    )

    load_dataset >> text_generation  # load_dataset ステップを text_generation タスクにrshift演算子を使用して接続（load_dataset ステップの出力が text_generation タスクの入力として使用される）。

if __name__ == "__main__":
    distiset = pipeline.run(  # load_dataset ステップと text_generation ステップのパラメータを指定してパイプラインを実行。
        parameters={
            load_dataset.name: {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )
    distiset.save_to_disk("output") # outputフォルダ下に生成されたDistisetを保存。
    # distiset.push_to_hub(repo_id="distilabel-example") # オプションで、生成されたDistisetをHugging Face Hubのリポジトリにプッシュすることも可能。

上記実装では、Hugging Face Hub上のデータセット（distilabel-internal-testing/instruction-dataset-mini）^[1]をダウンロードし、Meta-Llama-3.1-8B-Instruct モデル（meta-llama/Llama-3.1-8B-Instruct）でテキスト生成を行なっています。

他の例としてOpenAIのモデルを利用する場合

text_generationを以下のようにOpenAIのモデルに変更すればOK。

from distilabel.llms import OpenAILLM

...(中略)...

    text_generation = TextGeneration(
        llm=OpenAILLM(
            model="gpt-4o-mini",
            api_key=OPENAI_API_KEY
        ),
        system_prompt="あなたは創造的なAIアシスタントライターです。",
        template="以下の指示に従ってください: {{ instruction }}"
    )
...(中略)...

実行^[2]すると、output フォルダ下に生成結果が保存されます。

保存されたデータを確認してみましょう。
以下のように、instruction（「ジョー・バイデンは第N代大統領です。Nは何ですか？」）に対して LLMが生成を行なった結果 generation（「ジョー・バイデンは第46代大統領です。したがって、N = 46。」）が得られています。

"instruction": "Joe Biden is the Nth president of the United States. What is N?",
"generation": "Joe Biden is the 46th president of the United States. Therefore, N = 46."

データの確認方法一例

以下のように output 下をロードし、結果を確認できる。

from distilabel.distiset import Distiset
import json

distiset = Distiset.load_from_disk("output")
for train_d in distiset["default"]["train"]:
    print(json.dumps(train_d))

結果一例：

{
  "instruction": "Joe Biden is the Nth president of the United States. What is N?",
  "completion": "46",
  "meta": {
    "category": "Commonsense/logic",
    "completion": "46",
    "id": 8,
    "input": null,
    "motivation_app": null,
    "prompt": "Joe Biden is the Nth president of the United States. What is N?",
    "source": "surge",
    "subcategory": "World knowledge"
  },
  "generation": "Joe Biden is the 46th president of the United States. Therefore, N = 46.",
  "distilabel_metadata": {
    "raw_input_text_generation_0": [
      {
        "content": "あなたは創造的なAIアシスタントライターです。",
        "role": "system"
      },
      {
        "content": "以下の指示に従ってください: Joe Biden is the Nth president of the United States. What is N?",
        "role": "user"
      }
    ],
    "raw_output_text_generation_0": "Joe Biden is the 46th president of the United States. Therefore, N = 46."
  },
  "model_name": "gpt-4o-mini"
}

LLMによる出力の評価

次に出力をLLMに評価させる簡単な例を示します。
以下では、instruction（"Joe Biden is the Nth president of the United States. What is N?"）に対して2つのgenerations（"46" と "Joe Biden is the 45th president of the United States. Therefore, N = 45."^[3]）の出力の評価を gpt-4o-mini を利用して行なっています。

from distilabel.steps.tasks import UltraFeedback
from distilabel.llms.huggingface import InferenceEndpointsLLM

uf = UltraFeedback( # UltraFeedback のプロンプトで評価（後述）
    llm=OpenAILLM(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
)
uf.load()
result = next(
    uf.process(
        [
            {
                "instruction": "Joe Biden is the Nth president of the United States. What is N?",
                "generations": [
                    "46",
                    "Joe Biden is the 45th president of the United States. Therefore, N = 45.",
                ],
            }
        ]
    )
)
print(result[0]["distilabel_metadata"]["raw_output_ultra_feedback_0"])

実行すると、以下のような結果が得られました。

#### Output for Text 1
Rating: 5
Rationale: The output correctly identifies Joe Biden as the 46th president of the United States with no inaccuracies. It is concise and directly answers the question posed in the instruction.

#### Output for Text 2
Rating: 1
Rationale: The output inaccurately states that Joe Biden is the 45th president of the United States, which is incorrect. This error leads to a complete failure to follow the instructions accurately, resulting in a low quality rating.

出力の日本語訳

#### Output for Text 1
評価: 5
根拠: 出力は、ジョー・バイデンをアメリカ合衆国の第46代大統領として正確に識別しており、誤りはありません。簡潔で、指示で提示された質問に直接答えています。

#### Output for Text 2
評価: 1
根拠: 出力は、ジョー・バイデンがアメリカ合衆国の第45代大統領であると不正確に述べていますが、これは誤りです。この誤りは、指示に正確に従うことが完全にできていないことを意味し、結果として低い品質評価につながります。

1つ目の出力の評価は5、"N = 45"とした2つ目の出力の評価は1と判断され、その根拠も示されました。

再現実装が利用できる

ここで、実装中に出てくる UltraFeedback について簡単に説明します。
Distilabelでは、合成データ領域における論文の再現実装が提供されており、簡単に利用できます。
UltraFeedback もその一つで、LLMに人間の嗜好に合った出力をさせるための学習（RLHFやDPOなど）を行うための大規模なPreferenceデータセット UltraFeedback があり、このデータセットを作成する上で構築されたプロンプトを Distilabel で利用できます。

https://arxiv.org/abs/2310.01377

UltraFeedbackの評価プロンプト

評価プロンプトは以下の通りで、実装においては、 uf.print() で確認できる。

╭──────────────────────────────────────────── Prompt: UltraFeedback  ─────────────────────────────────────────────╮
│ ╭────────────────────────────────────────────── System Message ───────────────────────────────────────────────╮ │
│ │ Your role is to evaluate text quality based on given criteria.                                              │ │
│ │ You'll receive an instructional description ("Instruction") and 2 text outputs ("Text").                    │ │
│ │ Understand and interpret instructions to evaluate effectively.                                              │ │
│ │ Provide annotations for each text with a rating and rationale.                                              │ │
│ │ The 2 texts given are independent, and should be evaluated separately.                                      │ │
│ │                                                                                                             │ │
│ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ ╭─────────────────────────────────────────────── User Message ────────────────────────────────────────────────╮ │
│ │ # General Text Quality Assessment                                                                           │ │
│ │                                                                                                             │ │
│ │ Evaluate the model's outputs based on various criteria:                                                     │ │
│ │                                                                                                             │ │
│ │ 1. **Correctness & Informativeness**: Does the output provide accurate and helpful information?             │ │
│ │ 2. **Honesty & Uncertainty**: How confidently does the model convey its information, and does it express    │ │
│ │ uncertainty appropriately?                                                                                  │ │
│ │ 3. **Truthfulness & Hallucination**: Does the model introduce misleading or fabricated details?             │ │
│ │ 4. **Instruction Following**: Does the model's output align with given instructions and the user's intent?  │ │
│ │                                                                                                             │ │
│ │ Your role is to provide a holistic assessment considering all the above factors.                            │ │
│ │                                                                                                             │ │
│ │ **Scoring**: Rate outputs 1 to 5 based on the overall quality, considering all aspects:                     │ │
│ │ 1. **Low Quality**: Contains inaccuracies, may be entirely wrong or has severe hallucinations.              │ │
│ │ 2. **Moderate Quality**: Addresses some aspects, but has errors or is partially aligned with instructions.  │ │
│ │ 3. **Good**: Generally accurate but may contain minor errors or slight deviations.                          │ │
│ │ 4. **Very Good**: Near perfect, with minor issues in terms of alignment or confidence.                      │ │
│ │ 5, **Excellent**: Accurate, confident, aligned with instructions, and free of hallucinations.               │ │
│ │                                                                                                             │ │
│ │ ## Format:                                                                                                  │ │
│ │                                                                                                             │ │
│ │ ### Input                                                                                                   │ │
│ │ Instruction: [Clearly specify the task goal and restrictions]                                               │ │
│ │                                                                                                             │ │
│ │ Texts:                                                                                                      │ │
│ │ <text 1> [Text 1]                                                                                           │ │
│ │ <text 2> [Text 2]                                                                                           │ │
│ │                                                                                                             │ │
│ │ ### Output                                                                                                  │ │
│ │ #### Output for Text 1                                                                                      │ │
│ │ Rating: [Rating for text 1]                                                                                 │ │
│ │ Rationale: [Rationale for the rating in short sentences]                                                    │ │
│ │                                                                                                             │ │
│ │ #### Output for Text 2                                                                                      │ │
│ │ Rating: [Rating]                                                                                            │ │
│ │ Rationale: [Rationale]                                                                                      │ │
│ │                                                                                                             │ │
│ │ ---                                                                                                         │ │
│ │                                                                                                             │ │
│ │ ## Annotation                                                                                               │ │
│ │                                                                                                             │ │
│ │ ### Input                                                                                                   │ │
│ │ Instruction: <PLACEHOLDER_INSTRUCTION>                                                                      │ │
│ │                                                                                                             │ │
│ │ Texts:                                                                                                      │ │
│ │ <text 1> <PLACEHOLDER_GENERATION_0>                                                                         │ │
│ │ <text 2> <PLACEHOLDER_GENERATION_1>                                                                         │ │
│ │                                                                                                             │ │
│ │ ### Output                                                                                                  │ │
│ │                                                                                                             │ │
│ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

UltraFeedbackの評価プロンプト（日本語訳）

Prompt: UltraFeedback
System Message
あなたの役割は、与えられた基準に基づいてテキストの品質を評価することです。
説明文（「Instruction」）と2つのテキスト出力（「Text」）を受け取ります。
指示を理解し、効果的に評価するために解釈してください。
各テキストに対して評価と理由を付けて注釈を提供してください。
与えられた2つのテキストは独立しており、それぞれ個別に評価する必要があります。

User Message
# 一般的なテキスト品質評価
モデルの出力を以下の基準に基づいて評価します：

1. **正確性と有用性**: 出力は正確で役立つ情報を提供していますか？
2. **誠実さと不確実性**: モデルはどの程度自信を持って情報を伝えており、不確実性を適切に表現していますか？
3. **真実性と幻覚（ハルシネーション）**: モデルは誤解を招く、または虚偽の詳細を含んでいませんか？
4. **指示の遵守**: モデルの出力は指示やユーザーの意図に沿っていますか？

あなたの役割は、上記すべての要素を考慮した総合的な評価を提供することです。

**スコアリング**: 全体的な品質を考慮し、1から5で評価します：
1. **低品質**: 不正確な情報を含み、完全に間違っているか重大な幻覚がある。
2. **中程度の品質**: 一部の側面には対応しているが、エラーや指示との部分的なずれがある。
3. **良い**: 一般的に正確だが、軽微なエラーやわずかなずれがある。
4. **非常に良い**: 完璧に近いが、指示の遵守や自信の表現に軽微な問題がある。
5. **優れている**: 正確で自信があり、指示に沿っており、幻覚がない。

## フォーマット:

### 入力
指示: [タスクの目標と制約を明確に指定]

テキスト:
<text 1> [テキスト1]
<text 2> [テキスト2]

### 出力
#### テキスト1の出力
評価: [テキスト1の評価]
理由: [評価の理由を簡潔に記述]

#### テキスト2の出力
評価: [評価]
理由: [理由]

---

## アノテーション

### 入力
指示: <PLACEHOLDER_INSTRUCTION>

テキスト:
<text 1> <PLACEHOLDER_GENERATION_0>
<text 2> <PLACEHOLDER_GENERATION_1>

### 出力

UltraFeedbackの他にも、合成データ生成周りのさまざまな手法の再現実装が提供されています。
各ページに各手法の解説と、Distilabel をどのように使用していくかが丁寧に記載されており勉強になります。

ユースケースから学ぶ ~ Preferenceデータセットの作成

ここでは、Distilabelのチュートリアルから、実際のユースケースに近い形で Distilabel をどのように使うのか見ていきたいと思います。

DPO（Direct Preference Optimization）など、LLMの出力を人間の好むものに最適化するための学習においては、1つの入力に対する2つの出力に対し、好ましい方をchosen、好ましくない方をrejectedとしたPreferenceデータセットが必要になります。
今回は、2つのLLMを用いてテキスト生成し、その出力をより大きなLLMで評価したデータセットを作成するパイプラインをDistilabelを使って構築します。

それでは、見ていきましょう。以下ではNotebook形式で実装を見ていきます。

準備

必要なライブラリをインストールします。

!pip install "distilabel[hf-inference-endpoints]"
!pip install "transformers~=4.0" "torch~=2.0"

インポートします。

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    LoadDataFromHub,
    GroupColumns,
    FormatTextGenerationDPO,
    PreferenceToArgilla,
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback

Hugging Face上のモデルに対し推論を行うために、HF_TOKENを登録します^[4]。

import os
from huggingface_hub import login

login(token=os.getenv("HF_TOKEN"), add_to_git_credential=True)

パイプラインの定義

Preferenceデータセット作成の流れは以下の図のようになります。

本チュートリアルにおけるDistilabelを利用したPreferenceデータセット作成のパイプライン。（Distilabel公式ドキュメントより引用）
Hugging Face上からデータセットを読み込み（LoadDataFromHub）、2つのLLMを用いてテキスト生成を行います（TextGeneration）。
評価のために出力を整形（GroupColumns）後、出力用プロンプトで出力を評価（UltraFeedback）、最後にPreferenceデータセット用にデータを整形する（FormatTextGenerationDPO）、という流れになります^[5]。

それでは、各ステップを見ていきましょう。

データセットの読み込み（`LoadDataFromHub`）

今回はチュートリアルとして、argilla/10Kprompts-mini という20件からなるデータセットを利用します。
Hugging Faceからダウンロードする場合、以下でダウンロードできます。

load_dataset = LoadDataFromHub(
        repo_id= "argilla/10Kprompts-mini",
        num_examples=1,
        pipeline=Pipeline(name="showcase-pipeline"),
    )
load_dataset.load()
next(load_dataset.process())
# ([{'instruction': 'How can I create an efficient and robust workflow that utilizes advanced automation techniques to extract targeted data, including customer information, from diverse PDF documents and effortlessly integrate it into a designated Google Sheet? Furthermore, I am interested in establishing a comprehensive and seamless system that promptly activates an SMS notification on my mobile device whenever a new PDF document is uploaded to the Google Sheet, ensuring real-time updates and enhanced accessibility.',
#    'topic': 'Software Development'}],
#  True)

ダウンロードしたデータの中身1件を確認すると、instruction^[6]とそのtopicからなっていることがわかります。

応答の生成（`TextGeneration`）

今回は、meta-llama/Meta-Llama-3-8B-Instruct と mistralai/Mixtral-8x7B-Instruct-v0.1 の2つのモデルで応答を生成するステップを作成します^[7]。
以下が実装例です。TextGenerationをリストで渡すことで実現できます。

generate_responses = [
    TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        ),
        pipeline=Pipeline(name="showcase-pipeline"),
    ),
    TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
            tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        ),
        pipeline=Pipeline(name="showcase-pipeline"),
    ),
]
for task in generate_responses:
    task.load()
    print(next(task.process([{"instruction": "Which are the top cities in Spain?"}])))
# [{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}]
# [{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]

実行すると、instruction（「スペインのトップ都市はどこですか？」）に対する2つのモデルからの生成結果を確認できます^[8]^[9]。

応答の整形（`GroupColumns`）

2つのモデルからの生成を評価するために、データを整形します。
以下が実装例です^[10]。

group_responses = GroupColumns(
    columns=["generation", "model_name"],
    output_columns=["generations", "model_names"],
    pipeline=Pipeline(name="showcase-pipeline"),
)
next(
    group_responses.process(
        [
            {
                "generation": "Madrid",
                "model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
            },
        ],
        [
            {
                "generation": "Barcelona",
                "model_name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
            }
        ],
    )
)
# [{'generations': ['Madrid', 'Barcelona'],
#   'model_names': ['meta-llama/Meta-Llama-3-8B-Instruct',
#    'mistralai/Mixtral-8x7B-Instruct-v0.1']}]

応答の評価（`UltraFeedback`）

応答の評価として、こちらでも取り上げました、UltraFeedback を利用し、meta-llama/Meta-Llama-3-70B-Instruct で評価します。
以下が実装例です。

evaluate_responses = UltraFeedback(
    aspect="overall-rating",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
    ),
    pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
    evaluate_responses.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
            }
        ]
    )
)
# [{'instruction': "What's the capital of Spain?",
#   'generations': ['Madrid', 'Barcelona'],
#   'ratings': [5, 1],
#   'rationales': ["The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.",
#    "The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent."],
#   'distilabel_metadata': {'raw_output_ultra_feedback_0': "#### Output for Text 1\nRating: 5 (Excellent)\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\n\n#### Output for Text 2\nRating: 1 (Low Quality)\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent."},
#   'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'}]

実行すると、2つのモデルの生成結果からなる generations に対し、出力の評価スコア ratings と、その評価の根拠 rationales が得られました^[11]^[12]。

Preferenceデータセットへの整形（`FormatTextGenerationDPO`）

Preferenceデータセットでは、1つの入力に対する2つの出力に対し、好ましい方をchosen、好ましくない方をrejectedとしたフォーマットにする必要があります。
以下がその実装例です。

format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name="showcase-pipeline"))
format_dpo.load()
next(
    format_dpo.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
                "generation_models": [
                    "Meta-Llama-3-8B-Instruct",
                    "Mixtral-8x7B-Instruct-v0.1",
                ],
                "ratings": [5, 1],
            }
        ]
    )
)
# [{'instruction': "What's the capital of Spain?",
#   'generations': ['Madrid', 'Barcelona'],
#   'generation_models': ['Meta-Llama-3-8B-Instruct',
#    'Mixtral-8x7B-Instruct-v0.1'],
#   'ratings': [5, 1],
#   'prompt': "What's the capital of Spain?",
#   'prompt_id': '26174c953df26b3049484e4721102dca6b25d2de9e3aa22aa84f25ed1c798512',
#   'chosen': [{'role': 'user', 'content': "What's the capital of Spain?"},
#    {'role': 'assistant', 'content': 'Madrid'}],
#   'chosen_model': 'Meta-Llama-3-8B-Instruct',
#   'chosen_rating': 5,
#   'rejected': [{'role': 'user', 'content': "What's the capital of Spain?"},
#    {'role': 'assistant', 'content': 'Barcelona'}],
#   'rejected_model': 'Mixtral-8x7B-Instruct-v0.1',
#   'rejected_rating': 1}]

以上で、元データセットのダウンロードからPreferenceデータセットとするまでの各ステップが理解できました。

パイプラインの実行

パイプラインの定義で定義した各ステップを全て繋いだパイプライン全体は以下のようになります。

with Pipeline(name="generate-dataset") as pipeline:

    load_dataset = LoadDataFromHub(repo_id="argilla/10Kprompts-mini")

    generate_responses = [
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-8B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
                generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
            )
        ),
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
                tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
                generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
            )
        ),
    ]

    group_responses = GroupColumns(
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"],
    )

    evaluate_responses = UltraFeedback(
        aspect="overall-rating",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        )
    )

    format_dpo = FormatTextGenerationDPO()

    for task in generate_responses:
        load_dataset.connect(task)
        task.connect(group_responses)
    group_responses.connect(evaluate_responses)
    evaluate_responses.connect(format_dpo)

実行してみましょう。

distiset = pipeline.run()
distiset.save_to_disk("output") # outputフォルダ下に生成されたDistisetを保存。
# distiset.push_to_hub("[your-owner-name]/example-preference-dataset") # オプションで、生成されたDistisetをHugging Face Hubのリポジトリにプッシュすることも可能。

結果はローカルの指定したフォルダ下に保存したり、Hugging Face Hubのリポジトリにプッシュしたりして確認できます。
今回の argilla/10Kprompts-mini データセットに対する実行結果はこちらから確認できます。

以上、Distilabelを使用したPreferenceデータセット作成の流れを見ていきました。

まとめ

本記事では、Distilabelとは何かと、ユースケースとしてPreferenceデータセットを作成するパイプラインの構築方法を学びました。
今回は、既存の実装の組み合わせでDistilabel上でどう合成データ作成を行なっていくかを見ていきましたが、パイプラインの各ステップでオリジナルのTaskを定義することも可能です。
様々なLLMを選択して生成から評価、データセット整形までを統一したパイプラインの中で効率的に行なっていける便利なフレームワークだなと感じました。

お知らせ

少しでも弊社にご興味を持っていただけた方は、お気軽にご連絡頂けますと幸いです。まずはカジュアルにお話を、という形でも、副業を検討したいという形でも歓迎しています。

脚注

リンク先から実際のデータの中身を見ることができますが、10件からなるInstruction Tuning用データで、各データが prompt（プロンプト）, completion（promotに対する応答）, meta（メタデータ）からなっています。ここではすでにプロンプトに対する応答が得られているデータセットとなっていますが、得られていないと仮定してLLMにプロンプトに対する応答を生成してもらう設定です。 ↩︎
Hugging Face上のモデルをダウンロードし生成を実行するには、HF_TOKENの登録が必要です。アクセストークンを取得（取得方法はこちら）し、huggingface-cli loginでトークンを登録してください。 ↩︎
敢えて誤りである出力 45 を与えて結果を見てみましょう。 ↩︎
トークンの取得方法はこちら。 ↩︎
図ではPreferenceToArgillaというステップがあり、本記事では詳細は割愛しますが、Argillaというデータアノテーションツールにデータをアップロードし、人間によるフィードバックループを回すステップを入れることもできます。 ↩︎
instructionの日本語訳（DeepL） : "高度な自動化技術を活用して、多様なPDF文書から顧客情報を含む対象データを抽出し、指定したGoogle Sheetに簡単に統合する効率的で堅牢なワークフローを構築するにはどうすればよいでしょうか？さらに、新しいPDF文書がGoogle Sheetにアップロードされるたびに、私の携帯端末にSMS通知が即座に作動し、リアルタイムの更新とアクセシビリティの向上を保証する包括的でシームレスなシステムを確立したいと考えています。" ↩︎
他にも様々なモデルを選択可能です。 ↩︎
meta-llama/Meta-Llama-3-8B-Instructの生成結果の日本語訳（DeepL）: "スペインは豊かな文化、歴史、建築を持つ国で、訪れるべき素晴らしい都市がたくさんある。スペインのトップ都市をいくつか紹介しよう。マドリード スペインの首都で、活気あるナイトライフ、美術館、王宮やプラド美術館などの歴史的建造物で知られています。バルセロナ スペイン第2の都市で、モダニズム建築、ビーチ、アントニ・ガウディが設計したサグラダ・ファミリアやグエル公園などの象徴的なランドマークで有名。バレンシア 地中海沿岸に位置するバレンシアは、美しいビーチや芸術科学都市、パエリアなどのおいしい郷土料理で知られています。セビリア アンダルシア地方の州都セビリアは、見事な大聖堂、王宮アルカサル、活気あるフラメンコ音楽シーンで有名。マラガマラガ：スペイン南部の海岸沿いの都市。豊かな歴史と美しいビーチで知られ、パブロ・ピカソの生誕地でもある。サラゴササラゴサ：アラゴン州の北東部に位置するサラゴサは、ローマ時代の遺跡やゴシック様式の大聖堂、美しい公園で知られる歴史豊かな街。グラナダグラナダ：アンダルシア地方の都市で、ユネスコの世界遺産に登録されている見事なアルハンブラ宮殿とジェネリフェ庭園で有名。ビルバオビルバオ：バスク地方の都市で、グッゲンハイム美術館などの近代建築と豊かな文化遺産で知られる。アリカンテアリカンテ：バレンシア地方の海岸都市で、美しいビーチ、歴史的な城、活気あるナイトライフで有名。サン・セバスティアン：バスク地方の都市サン・セバスティアンは、美しいビーチや美食の街として知られ、サン・セバスティアン国際映画祭のような文化イベントも開催される。" ↩︎
mistralai/Mixtral-8x7B-Instruct-v0.1の生成結果の日本語訳（DeepL）: " 観光、文化、歴史、生活の質など、さまざまな要素からスペインのトップ都市をいくつか紹介しよう。マドリード：スペインの首都であり、最大の都市であるマドリードは、活気あるナイトライフ、プラド美術館やレイナ・ソフィア美術館などの世界的な美術館、レティーロ公園などの美しい公園、おいしい食べ物で知られている。バルセロナユニークな建築で有名なバルセロナには、サグラダ・ファミリアやグエル公園など、アントニ・ガウディ設計のユネスコ世界遺産がいくつもある。美しいビーチ、活気あるアートシーン、おいしいカタルーニャ料理も自慢。バレンシアスペインの東に位置する海岸沿いの都市バレンシアは、プラネタリウム、オペラハウス、インタラクティブ科学博物館を含む近代建築の複合都市である芸術科学都市で知られている。また、米、野菜、魚介類を使った伝統的なスペイン料理、パエリアでも有名。セビリアアンダルシア地方の州都セビリアは、フラメンコ舞踊、見事な大聖堂（ゴシック様式の大聖堂としては世界最大）、いくつもの部屋と中庭からなる美しい宮殿アルカサルで有名。グラナダシエラネバダ山脈のふもとに位置するグラナダは、9世紀に建てられたムーア人の要塞、アルハンブラ宮殿で知られる。また、スペインの伝統料理であるタパスが有名で、飲み物と一緒に無料で提供されることも多い。ビルバオバスク地方の都市ビルバオは、フランク・ゲーリーが設計した現代美術館グッゲンハイム美術館をはじめとする近代建築で有名。また、バーやレストランで供されるバスク風タパスの一種、ピンチョスでも知られる。マラガ：アンダルシア地方の海岸沿いの都市であるマラガは、美しいビーチ、アルカサバ城やジブラルファロ城などの史跡、そしてこの街で生まれた有名なスペイン人画家に捧げられたピカソ美術館で知られている。" ↩︎
ここでは2つのモデルからのgenerationを簡単のためハードコードしていますが実際はモデルからの生成結果を渡します。 ↩︎
meta-llama/Meta-Llama-3-8B-Instructの生成結果（"Madrid"）に対する評価（"5"）根拠文日本語訳（DeepL）: "答えは正しく、質問に直接対応しており、幻覚や不必要な詳細がない。自信を持って正確な情報を提供し、ユーザーの意図に完璧に沿う。" ↩︎
mistralai/Mixtral-8x7B-Instruct-v0.1の生成結果（"Barcelona"）に対する評価（"1"）根拠文日本語訳（DeepL）: "バルセロナはスペインの首都ではないので、答えは不正解です。これは重大な不正確さをもたらし、有益な情報を提供できず、ユーザーの意図から完全に逸脱している。" ↩︎

ZENKIGENテックブログPublication

私たちは「はたらく人の理想」をつくるために、ITと科学で組織課題に立ち向かうAIテックカンパニーです。「テクノロジーを通じて人と企業が全機現できる社会の創出に貢献する」のVisionのもと、人の感覚に寄り添うテクノロジーを用いたでサービスを展開しています。