😇

論文の分類をするモデルを作ろうとしてみる〜教師データ再作成編〜

2023/02/17に公開

前回のあらすじ

arXivの論文をDiscordの各チャンネルに投げるために自動分類モデルを作ろうとした逆瀬川、しかし彼女の前に立ちはだかるのは教師データ作成という難題であった。教師データをGPT-3.5で水増ししようと企むも、失敗。一体どうなってしまうのーー!

https://zenn.dev/sakasegawa/articles/29feab76ca9afb#gpt-3.5で教師データを作る

Colab

https://colab.research.google.com/drive/1D6DjywwhiGcC90LHVvv8C-VHCSsoMVte?usp=sharing
https://colab.research.google.com/drive/1pL5kygulMD2pba9Foeq6slQVxgJZbTBL?usp=sharing

教師データのやりなおし

前回の教師データ作成手法を見てみると、以下のような改善点が考えられます。

  • GPT-3.5のtemperature等の設定は正しかったか
  • promptの改善
  • GPT-3.5で作ったsoft label表現が手動ラベルとうまく近似しているかの事前確認

GPT-3.5のtemperatureの設定について

前回引用したGPTMix論文を読んでみると(え、読んでなかったんですか?はいぃ……)、

For GPT-3 generation, top-p and the temperature was set to 1 and the frequency penalty was set to 0.02

と記述されており、文中で引用されているtop-p論文に従い、パラメータをtemperature: 0.3 -> 1, top-p: 1 -> 1, frequency penalty 1.0 -> 0.02 へ修正します。

https://arxiv.org/abs/1904.09751

Promptの改善

有名なCoT論文に従い、推論項を追加します。Prompt Engineeringとして推論能力を高める手法としては、Least-to-Most, Decomposed Prompting, PAL, X-Prompt, Faithful Chain-of-Thought Reasoning など、数多くの先行研究がありますが、PALなどのような多段推論はコストがかかるため、今回は採用しません。
また、これは類似論文を見つけられなかったのですが、以下ブログに示されるように、json形式でinput / outputを記述すると、推論項と結論項を区分けしやすいです。後処理の簡便さや in-context learning の性能の強化を期待して、これも導入します。これは、GPT-3.5系がCodexをベースとしていて、コード生成が得意であることに端を発しています。回答方式の限定は、推論能力を制限することに繋がりますが、このjson形式ではそれが損なわれず、なおかつ安定した推論能力を発揮してくれると考えます。

https://www.sensibledefaults.io/blog/chatgpt/gpt-json

また、カテゴリの単純化とカテゴリの説明付与も行いました。

上述の改善案を反映させたものが以下のPromptとなります。

Given title and abstract of an arXiv paper, classify it into one of the following categories and output the probabilities, reasoning step by step. Also, if there is no clear relevance to any of the genres, classify it as others.

## Categories

- text: Genres related to language models and NLP
- speech: Genres related to speech recognition and synthesis
- music: Genres related to music generation
- image: Genres related to image recognition and generation.
- object: Genres related to object recognition
- video: Genres related to video recognition
- 3d: Genres related to NeRF (neural radiance field) and 3D
- others: Genres that have no strong relevance to the above

## Input / Output

input: {title: title, summary: summary}
output: {reason: reason, category: category_probabilities}

## Test 1

input: {title: "Improved Online Conformal Prediction via Strongly Adaptive Online Learning", summary: "We study the problem of uncertainty quantification via prediction sets, in an online setting where the data distribution may vary arbitrarily over time. Recent work develops online conformal prediction techniques that leverage regret minimization algorithms from the online learning literature to learn prediction sets with approximately valid coverage and small regret. However, standard regret minimization could be insufficient for handling changing environments, where performance guarantees may be desired not only over the full time horizon but also in all (sub-)intervals of time. We develop new online conformal prediction methods that minimize the strongly adaptive regret, which measures the worst-case regret over all intervals of a fixed length. We prove that our methods achieve near-optimal strongly adaptive regret for all interval lengths simultaneously, and approximately valid coverage. Experiments show that our methods consistently obtain better coverage and smaller prediction sets than existing methods on real-world tasks, such as time series forecasting and image classification under distribution shift."}
output: {reason: "This paper focuses on the development and evaluation of methods for learning prediction sets, with an emphasis on uncertainty quantification in the field of machine learning and particularly online learning. The techniques developed aim to handle changing environments and varying data distributions, and the paper proposes a new approach for minimizing strongly adaptive regret to achieve approximately valid coverage for all interval lengths. While this paper may be applicable to object recognition, speech recognition and synthesis, image recognition and generation, and 3D, it does not strongly relate to any of these specific genres. It may also have relevance to natural language processing (NLP), but the methods proposed are not limited to text data and can be applied to other types of data as well. Therefore, this paper can be classified as 'others'.", category: {text: 0, speech: 0, music: 0, image: 0, object: 0, video: 0, 3d: 0, others: 1}

## Test 2

input: {title: "One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2", summary: "While recent research has progressively overcome the low-resolution constraint of one-shot face video re-enactment with the help of StyleGAN's high-fidelity portrait generation, these approaches rely on at least one of the following: explicit 2D/3D priors, optical flow based warping as motion descriptors, off-the-shelf encoders, etc., which constrain their performance (e.g., inconsistent predictions, inability to capture fine facial details and accessories, poor generalization, artifacts). We propose an end-to-end framework for simultaneously supporting face attribute edits, facial motions and deformations, and facial identity control for video generation. It employs a hybrid latent-space that encodes a given frame into a pair of latents: Identity latent, ID, and Facial deformation latent, F, that respectively reside in the W+ and SS spaces of StyleGAN2. Thereby, incorporating the impressive editability-distortion trade-off of W+ and the high disentanglement properties of SS. These hybrid latents employ the StyleGAN2 generator to achieve high-fidelity face video re-enactment at 10242. Furthermore, the model supports the generation of realistic re-enactment videos with other latent-based semantic edits (e.g., beard, age, make-up, etc.). Qualitative and quantitative analyses performed against state-of-the-art methods demonstrate the superiority of the proposed approach."} 
output: {reason: "This paper proposes an end-to-end framework for one-shot face video re-enactment, which employs hybrid latent spaces of StyleGAN2 to achieve high-fidelity face video re-enactment. A quantitative and qualitative analysis is performed and shows the superiority of the proposed approach. This paper is strongly related to video recognition and generation, and may also be related to image recognition and generation and object recognition. Therefore, the paper should be classified into 'video': 0.7, 'image': 0.2, 'object': 0.1", category: {text: 0, speech: 0, music: 0, image: 0.2, object: 0.1, video: 0.7, 3d: 0, others: 0}

## Test 3

input: {title: title, summary: summary}
output:

これをもとに OpenAI Playground で意図した通り動くか、性能はよさそうか事前確認します。

改善したプロンプトで想定されるラベルを手動付けしたラベルと比較する

上述のプロンプトを用いて、前回のような手法を用いてラベルを作成します。なお、今回のGPT-3.5を使った出力は以下のようなjsonを模した形式です。

{reason: This paper introduces a technique called Story Shaping, in which a reinforcement learning agent infers tacit knowledge from an exemplar story of how to accomplish a task and intrinsically rewards itself for performing actions that make its current environment adhere to that of the inferred story world. This paper is strongly related to text and may also be related to object recognition. Therefore, the paper should be classified into text: 0.7, object: 0.3, category: {text: 0.7, speech: 0, music: 0, image: 0, object: 0.3, video: 0, 3d: 0, others: 0}}

この文字列を以下の処理でjson形式にします(もっと綺麗に書きたかった)。

import json
import re
def to_json(text: str) -> str:
    # reason処理
    reason = text[0:text.find("category:")]
    replace_dict = {'"': '', "'": '', "\n": "",
                    'reason:': '"reason": "',
                    }

    for key, value in replace_dict.items():
        reason = reason.replace(key, value)
    reason = reason.strip() + '"}'

    # category処理
    category = "{" + text[text.find("category:"):]
    replace_dict = {"{": '{"', ":": '":', ",": ',"', '," ': ',"',
                    "'": '"', "\n": "",
                    }

    for key, value in replace_dict.items():
        category = category.replace(key, value)
    category = re.sub(r"\s+", "", category)

    try:
        reason_json = json.loads(reason)
        category_json = json.loads(category)
        reason_json['reason'] = reason_json['reason'].strip()
        reason_json['category'] = category_json['category']
        return reason_json
    except:
        import traceback
        traceback.print_exc()
        return json.loads('{"reason": "", "category": {"others": 1}}')

ソフトラベル表現で一番大きい値を想定されるカテゴリとします。

estimate_category = max(soft_category, key=soft_category.get)

さて、これを用意した手動ラベルと比較します。手動ラベルは200件用意しました。
比較すると、65.5%の正解率となりました。
ただし、中身を見てみると自分よりも正しい判断をしてカテゴリ分けをしているものが多くありました。

また、それぞれのカテゴリに振り分けられた件数自体を可視化してみると以下のようになります。

そもそも教師データに偏りがあり、othersカテゴリが非常に多いです。モデルを作るにあたって、1000件程度だと、少ないカテゴリの場合、100件以下になってしまいそうです。つらい。
なお、今回作成したデータは以下となります。

https://docs.google.com/spreadsheets/d/1K1iFHXmAgHBQICxRnS5az2O-tDREiogaQMWIGRnp7a4/edit?usp=sharing

ここから得られる学び

  • 文章のクラス分類をするときは曖昧なクラスを設計するのをやめようね
  • GPT-3.5を用いた分類を行う場合は、GPT-3.5が学習している「常識」に依存させるものだと良い
    • ポジネガの感情二値分類はzero-shotでもかなりの精度が出ると思われる
    • カテゴリ自体も、今回は私が用意したCS論文のなかでも狭い領域のものだったが、CS論文はどのように分類されますか?とリストを挙げさせ、それに分類させるほうがよい
  • 別の手法としてはGPT-3.5が作成したCoTプロンプトでfine-tuningする手法がある
  • GPT-3.5でfew-shotしまくって分類タスクを解かせると高い
    • 30ドルかかりました;;

最終章へ続く

Discussion