🦁

Mlflow

MLflowは、機械学習のライフサイクルを管理するためのOSSで，モデルのバージョニング、データセットの管理、実験の追跡、デプロイメントなど、機械学習プロジェクトの全体的なプロセスをサポートしている．
https://mlflow.org/

ドキュメントも充実しており，Mlflowの使い方というより，機械学習の実験管理のベストプラクティスを学ぶ際にもとても有用なものだと思う．

MlflowのLLM用機能

Mlflowの最新バージョン（2.9.2）では，LLMの実験管理用の機能がいくつか追加されている．

内容は大きく分けて以下の3つで，それぞれについて紹介していく．

デプロイメント
ロギング（トラッキング）
評価

デプロイメント

Mlflowのデプロイメントは複数のプロバイダのLLMを統一的に扱うインターフェスを提供するためのもので，主にMlflowのその他のサービスと連携するために用いることが多い．（ドキュメントには以下のようなメリットが書いてあった）

Unified Endpoint

Simplified Integrations

Secure Credential Management

Consistent API Experience

Seamless Provider Swapping

デプロイメントはサポートされているプロバイダーは以下のようなyaml形式で簡単に作成できる．ただ，サポートされていないモデルやそもそも自前で動かしたいものなどはそのままyamlを書くだけでは動かないので，Mlflowのモデルとして一度保存する必要がある．（その保存したモデルを呼び出す形式になるはず）

endpoints:
  - name: completions
    endpoint_type: llm/v1/chat
    model:
      provider: openai
      name: gpt-3.5-turbo
      config:
        openai_api_key: $OPENAI_API_KEY

デプロイメントを作成すると，Mlflowから複数プロバイダのLLMを簡単に利用できるようになり，LLMを利用した実験やプロンプトエンジニアリング用のUI（下の画像）にも用いることができるようになる．プロンプトエンジニアリング用のUIでは，モデル選択の際にデプロイメントを指摘できるようになっているので，複数のデプロイメントを簡単に切り替えて実験を行える．

ロギング（トラッキング）

バージョン2.9.2ではLLMのロギングも行えるようになっている．
今までもモデル用のカスタムクラスを作ることでロギングは行えたが，mlflow.openai.log_modelなどのメソッドを使うことでより簡単にモデルやパラメータを記録することができる．openaiの部分はmlflowではflavorと表現されていて，他のパッケージのもの（pytorch, sklearn, pyfunc, ...）も存在するが，現時点ではgoogle, aws (bedrock)のLLMはサポートされていなさそう．

with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    # Wrap "gpt-4" as an MLflow model.
    logged_model_info = mlflow.openai.log_model(
        model="gpt-4",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

サポートされていないモデルの記録をするにはpyfuncのカスタムクラスを作るのが良さそうで，そのためのサンプルノートブックも用意されていた．

カスタムクラスはmlflow.pyfunc.PythonModelを継承して，load_context, predictメソッドをそれぞれ実装する必要がある．下の例ではモデルをローカルに展開しているが，外部のLLMにリクエストを送るだけなどの使い方もできる．（ベストプラクティスかはわからないが）

class MPT(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            context.artifacts["snapshot"], padding_side="left"
        )
        config = transformers.AutoConfig.from_pretrained(
            context.artifacts["snapshot"], trust_remote_code=True
        )
        self.model = transformers.AutoModelForCausalLM.from_pretrained(
            context.artifacts["snapshot"],
            config=config,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
        )
        self.model.eval()

    def _build_prompt(self, instruction):
        INSTRUCTION_KEY = "### Instruction:"
        RESPONSE_KEY = "### Response:"
        INTRO_BLURB = (
            "Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request."
        )
        return f"""{INTRO_BLURB}
        {INSTRUCTION_KEY}
        {instruction}
        {RESPONSE_KEY}
        """

    def predict(self, context, model_input, params=None):
        prompt = model_input["prompt"][0]
        temperature = params.get("temperature", 0.1) if params else 0.1
        max_tokens = params.get("max_tokens", 1000) if params else 1000
        prompt = self._build_prompt(prompt)
        encoded_input = self.tokenizer.encode(prompt, return_tensors="pt").to("cpu")
        output = self.model.generate(
            encoded_input,
            do_sample=True,
            temperature=temperature,
            max_new_tokens=max_tokens,
        )
        generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
        prompt_length = len(self.tokenizer.encode(prompt, return_tensors="pt")[0])
        generated_response = self.tokenizer.decode(
            output[0][prompt_length:], skip_special_tokens=True
        )
        return {"candidates": [generated_response]}

評価

前の章でロギングしたモデルに対して，最新バージョンではLLM特有の評価を行う機能も備わっている．

一般的な指標

mlflow.evaluateというメソッドを適切に利用することで，そのタスクに適した評価指標を簡単に追加することができる．例えば，以下のような短いコードを書くだけでもQAタスクに適した4つの指標を自動的に計算・記録してくれる．

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

評価指標：

exact-match
toxicity（テキストの毒性判別）
ari_grade_level
flesch_kincaid_grade_level

model_typeは他にも"text-summarization"（要約），"text"（文章生成）などがあり，それぞれ異なる指標を記録してくれるようになっている．

カスタム指標

一般的なユースケースとして，自身のタスクにあった評価指標を自分で設定したい場合があると思う．それも簡単にできるようになっていて，下のように指定した引数を取るような関数を定義し，先程のmlflow.evaluateメソッドの引数にその指標を与えてあげるだけで良い．（ちなみにちょっと躓いた点として，指標を計算する関数（ここではeval_fn）の引数は特定の名前にすると入ってくるデータが変わるようになっている．具体的には，与えるデータセットのカラム名と対応している．）

def eval_fn(predictions, targets, metrics):
    scores = []
    for i in range(len(predictions)):
        if len(predictions[i]) > 10:
            scores.append("yes")
        else:
            scores.append("no")
    return MetricValue(
        scores=scores,
        aggregate_results=standard_aggregations(scores),
    )


# Create an EvaluationMetric object.
passing_code_metric = make_metric(
    eval_fn=eval_fn, greater_is_better=False, name="over_10_chars"
)

results = mlflow.evaluate(
    ...
    extra_metrics=[passing_code_metrics]
)

LLM as a judgeを使った指標

一般的な評価指標とは異なり，「LLM as a judge」を用いた指標も簡単に作ることができる．ここでいう「LLM as a judge」とは，LLMの予測に対して他のLLMを使って評価を行うことで，本来高いコストをかけて人手で評価するところを大幅に削減することができる．実装はとても単純で，EvaluationExampleという「どういう文章だとスコアが高い／低い」という例のクラスをいくつか定義し，それらの例を与えることで指標を作れるというもの．

professionalism_example_score_2 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps "
        "you track experiments, package your code and models, and collaborate with your team, making the whole ML "
        "workflow smoother. It's like your Swiss Army knife for machine learning!"
    ),
    score=2,
    justification=(
        "The response is written in a casual tone. It uses contractions, filler words such as 'like', and "
        "exclamation points, which make it sound less professional. "
    ),
)
professionalism_example_score_4 = mlflow.metrics.genai.EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was "
        "developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning engineers face when "
        "developing, training, and deploying machine learning models.",
    ),
    score=4,
    justification=("The response is written in a formal language and a neutral tone. "),
)

professionalism = mlflow.metrics.genai.make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is "
        "tailored to the context and audience. It often involves avoiding overly casual language, slang, or "
        "colloquialisms, and instead using clear, concise, and respectful language."
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below are the details for different scores: "
        "- Score 0: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for "
        "professional contexts."
        "- Score 1: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in "
        "some informal professional settings."
        "- Score 2: Language is overall formal but still have casual words/phrases. Borderline for professional contexts."
        "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for formal "
        "business or academic settings. "
    ),
    examples=[professionalism_example_score_2, professionalism_example_score_4],
    model="openai:/gpt-3.5-turbo-16k",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance"],
    greater_is_better=True,
)

評価の例を与えるほどに精度が上がっていくような仕組みになっている．
ただ，あくまでLLMでの評価のため，そもそもその評価が人間の感覚とどれほど近いかという検証は必要だと思うが，大規模なデータセットに対して人手で行うよりは少ないコストで行えるのは間違いない．新しいMlflowの機能の中では特にこれを簡単にできるのがすごくメリットに感じた．

まとめ

今回始めてMlflowでLLMの実験管理をしてみたが，想像以上に便利な機能が多かったと思う．
LLMによってモデルの詳しい構造などを知らなくても学習・チューニング（学習は主にコンテキストラーニング）ができるようになったが，評価や結果の保存・比較・共有はまだ敷居が高い部分があると思う．しかし，LLMを使ったサービスにおいて最低限の性能の把握・担保は必須となってくると思うので，そういうときにMlflowがとても役に立つと思う．

あと，モデル・パラメータの探索などで多くの実験を管理するときにスプレッドシートなどで手作業でやっていると疲れるし，本来力を入れたいところに時間をかけられなくなるのは本末転倒なので，Mlflowでなるべく楽をしたいところ．

Discussion

ログインするとコメントできます