テキスト抽出特化モデル「NuExtract-1.5-smol」を試す

ここで知った。
https://x.com/etiennebcp/status/1857427310569259149
📢 NuExtractファミリーに新しい仲間が登場 📢
私たち（@numind_ai）は、@huggingface のSmolLM 1.7Bをベースにした NuExtract 1.5 smol をリリースしました！
小さなモデルですが、そのサイズに騙されてはいけません。このモデル、なかなかやります！
huggingface.co/numind/NuExtract-1.5-smol
詳細情報:
これまでに、小型版NuExtractを求める声をたくさんいただいてきました。

以前、Qwen 2.5 0.5Bをベースにした NuExtract 1.5 tiny をリリースしましたが、次の点で課題がありました：
多言語対応ができない
継続能力（すでに抽出済みの情報を与えられた状態でさらに抽出を行う能力）が不足している
最近、Hugging FaceがSmolLM 2 1.7Bをリリースしました。このモデルはベンチマークで素晴らしい性能を示していたため、試してみることにしました。
その結果……非常に良いパフォーマンスを発揮することが判明しました！

NuExtract 1.5と同じ能力を持たせることができ、元のNuExtractに近い性能を実現しながら、サイズは半分以下に収めることができました。
ぜひこのモデルを活用してください！また、フィードバックがあれば、ここやDiscordサーバー（参加はこちら）でぜひお知らせください。

NuExtract 2.0 も現在開発中で、最高のものをお届けしたいと考えています！


refered from https://x.com/etiennebcp/status/1857427310569259149
マルチリンガルに対応している
モデルはこちら
https://huggingface.co/numind/NuExtract-1.5-smol

kun432

公式のブログ記事
https://numind.ai/blog/nuextract-1-5---multilingual-infinite-context-still-small-and-better-than-gpt-4o
以下は冒頭部分のChatGPTによる要約

 NuExtractとは？NuExtractは、小規模なオープンソースモデルで、ドキュメントから情報を抽出し、JSON形式で出力することに特化しています。このため高い精度を誇ります。例えば、英語のゼロショットベンチマークでNuExtract 1.5（3.8Bパラメータ）はGPT-4oを上回り、サイズは500分の1です。
NuExtractの利点:
データを共有せずに利用可能。
ファインチューニングで特定タスクに強い性能を発揮。
繰り返しのタスクや機密データの処理に最適です。

 NuExtract 1.5の概要初版のNuExtractが好評だったため、このプロジェクトを拡張しました。主な要望は次の2点です。
長いドキュメントを処理する能力
英語以外のドキュメントに対応する能力
これらを解決するため、NuExtract 1.5では多言語データセットを使い、最新のモデルで訓練しました。また、「継続」機能によりメモリ使用を抑えつつ無限のコンテキストサイズを実現しました。

 多言語対応NuExtractには英語以外の言語対応の要望が多く寄せられ、Phi-3.5 miniを基盤に採用し、多くの言語（アラビア語、中国語、チェコ語、デンマーク語、オランダ語、英語、フィンランド語、フランス語、ドイツ語、ヘブライ語、ハンガリー語、イタリア語、日本語、韓国語、ノルウェー語、ポーランド語、ポルトガル語、ロシア語、スペイン語、スウェーデン語、タイ語、トルコ語、ウクライナ語）に対応しました。
トレーニングデータセットにはC4データセットから英語とその他の主要言語（フランス語、ドイツ語、スペイン語、イタリア語、ポルトガル語）のドキュメントを使用しました。各ドキュメントには英語またはその言語のテンプレートを使用し、多言語ドキュメントの処理が容易になるようにしています。
このデータセットは純粋に抽出的で、次のリリースで抽象化や再構成の能力を追加予定です。

 無限コンテキストNuExtract 1.5は128kトークン（約200ページ）のコンテキストサイズに対応していますが、長いドキュメントの処理にはメモリと計算量が必要です。10,000トークン以上ではメモリ使用量が急増し、1TBのGPUメモリが必要です。
これを解決するため、NuExtractは前回の情報を活用しながら抽出を行う「継続」能力を持ち、スライディングウィンドウで任意の長さのドキュメントを処理可能です。これによりメモリ使用量はウィンドウサイズに制限されます。
この戦略には、出力を何度も生成する必要がある点や、ウィンドウサイズが小さい場合に性能が低下する可能性がありますが、長いドキュメントには有効です。

kun432

Colaboratoryで。ランタイムはT4。

サンプルコードをそのまま実行。Colaboratoryだとパッケージインストールは不要だった。

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def predict_NuExtract(model, tokenizer, texts, template, batch_size=1, max_length=10_000, max_new_tokens=4_000):
    template = json.dumps(json.loads(template), indent=4)
    prompts = [f"""<|input|>\n### Template:\n{template}\n### Text:\n{text}\n\n<|output|>""" for text in texts]
    
    outputs = []
    with torch.no_grad():
        for i in range(0, len(prompts), batch_size):
            batch_prompts = prompts[i:i+batch_size]
            batch_encodings = tokenizer(batch_prompts, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(model.device)

            pred_ids = model.generate(**batch_encodings, max_new_tokens=max_new_tokens)
            outputs += tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    return [output.split("<|output|>")[1] for output in outputs]

model_name = "numind/NuExtract-1.5-smol"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

text = """We introduce Mistral 7B, a 7–billion-parameter language model engineered for
superior performance and efficiency. Mistral 7B outperforms the best open 13B
model (Llama 2) across all evaluated benchmarks, and the best released 34B
model (Llama 1) in reasoning, mathematics, and code generation. Our model
leverages grouped-query attention (GQA) for faster inference, coupled with sliding
window attention (SWA) to effectively handle sequences of arbitrary length with a
reduced inference cost. We also provide a model fine-tuned to follow instructions,
Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
automated benchmarks. Our models are released under the Apache 2.0 license.
Code: <https://github.com/mistralai/mistral-src>
Webpage: <https://mistral.ai/news/announcing-mistral-7b/>"""

template = """{
    "Model": {
        "Name": "",
        "Number of parameters": "",
        "Number of max token": "",
        "Architecture": []
    },
    "Usage": {
        "Use case": [],
        "Licence": ""
    }
}"""

prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
print(prediction)

出力

{
    "Model": {
        "Name": "Mistral 7B",
        "Number of parameters": "7\u00d7\u00d713B",
        "Number of max token": "",
        "Architecture": [
            "grouped-query attention (GQA)",
            "sliding window attention (SWA)"
        ]
    },
    "Usage": {
        "Use case": [
            "reasoning",
            "mathematics",
            "code generation"
        ],
        "Licence": "Apache 2.0"
    }
}

モデルロードを除いた推論時間は約5秒ぐらい。

まるっと日本語に置き換えてみる。

text = """Mistral 7Bをご紹介します。これは、優れた性能と効率を追求して設計された70億パラメータの言語モデルです。
Mistral 7Bは、すべての評価ベンチマークにおいて最良のオープン13Bモデル（Llama 2）を上回り、さらに推論、数学、
コード生成の分野では最良の34Bモデル（Llama 1）も凌駕しています。本モデルは、グループクエリアテンション（GQA）を
活用して高速な推論を可能にし、さらにスライディングウィンドウアテンション（SWA）を組み合わせて、任意の長さのシーケンス
を効率的に処理しつつ推論コストを削減しています。また、指示に従うようファインチューニングされたモデル「Mistral 7B
 – Instruct」も提供しており、Llama 2 13B – チャットモデルを、人間および自動ベンチマークの両方で上回る性能を持ちます。
 これらのモデルは、Apache 2.0ライセンスの下で公開されています。

コード: https://github.com/mistralai/mistral-src
ウェブページ: https://mistral.ai/news/announcing-mistral-7b/"""

template = """{
    "Model": {
        "Name": "",
        "Number of parameters": "",
        "Number of max token": "",
        "Architecture": []
    },
    "Usage": {
        "Use case": [],
        "Licence": ""
    }
}"""

prediction = predict_NuExtract(model, tokenizer, [text], template)[0]
print(prediction)

出力

{
    "Model": {
        "Name": "Mistral 7B",
        "Number of parameters": "70億パラメータ",
        "Number of max token": "",
        "Architecture": [
            "Llama 2",
            "Llama 1"
        ]
    },
    "Usage": {
        "Use case": [
            "GQA",
            "SWA"
        ],
        "Licence": "Apache 2.0"
    }
}

ちょっとハルシネーションしてるかな。

kun432

スライディングウインドウを使用したロングコンテキストの処理。

青空文庫の「走れメロス」を使用する。

! wget https://www.aozora.gr.jp/cards/000035/files/1567_ruby_4948.zip
! unzip 1567_ruby_4948.zip

import re

with open('hashire_merosu.txt', 'r', encoding='SJIS') as f:
    data = f.read()
    data = re.sub(r'-------------------------------------------------------.*?-------------------------------------------------------\n', '', data, flags=re.DOTALL)
    data = re.sub(r'［＃地から１字上げ[\s\S]*$', '', data)
    data = re.sub("\n\u3000", "", data)
    data = re.sub("《[^》]+》", "", data)

print(data[:100])
print("〜")
print(data[-100:])
print(f"({len(data)})")

出力

走れメロス
太宰治

メロスは激怒した。必ず、かの邪智暴虐の王を除かなければならぬと決意した。メロスには政治がわからぬ。メロスは、村の牧人である。笛を吹き、羊と遊んで暮して来た。けれども邪悪に対しては
〜
気をきかせて教えてやった。
「メロス、君は、まっぱだかじゃないか。早くそのマントを着るがいい。この可愛い娘さんは、メロスの裸体を、皆に見られるのが、たまらなく口惜しいのだ。」勇者は、ひどく赤面した。

(9862)

import json

MAX_INPUT_SIZE = 20_000
MAX_NEW_TOKENS = 6000

def clean_json_text(text):
    text = text.strip()
    text = text.replace("\#", "#").replace("\&", "&")
    return text

def predict_chunk(text, template, current, model, tokenizer):
    current = clean_json_text(current)

    input_llm =  f"<|input|>\n### Template:\n{template}\n### Current:\n{current}\n### Text:\n{text}\n\n<|output|>" + "{"
    input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=MAX_INPUT_SIZE).to("cuda")
    output = tokenizer.decode(model.generate(**input_ids, max_new_tokens=MAX_NEW_TOKENS)[0], skip_special_tokens=True)

    return clean_json_text(output.split("<|output|>")[1])

def split_document(document, window_size, overlap):
    tokens = tokenizer.tokenize(document)
    print(f"\tLength of document: {len(tokens)} tokens")

    chunks = []
    if len(tokens) > window_size:
        for i in range(0, len(tokens), window_size-overlap):
            print(f"\t{i} to {i + len(tokens[i:i + window_size])}")
            chunk = tokenizer.convert_tokens_to_string(tokens[i:i + window_size])
            chunks.append(chunk)

            if i + len(tokens[i:i + window_size]) >= len(tokens):
                break
    else:
        chunks.append(document)
    print(f"\tSplit into {len(chunks)} chunks")

    return chunks

def handle_broken_output(pred, prev):
    try:
        if all([(v in ["", []]) for v in json.loads(pred).values()]):
            # if empty json, return previous
            pred = prev
    except:
        # if broken json, return previous
        pred = prev

    return pred

def sliding_window_prediction(text, template, model, tokenizer, window_size=4000, overlap=128):
    # split text into chunks of n tokens
    tokens = tokenizer.tokenize(text)
    chunks = split_document(text, window_size, overlap)

    # iterate over text chunks
    prev = template
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i}...")
        pred = predict_chunk(chunk, template, prev, model, tokenizer)

        # handle broken output
        pred = handle_broken_output(pred, prev)
            
        # iterate
        prev = pred

    return pred

template = """{
    "Characters": [
        {
            "Name": "",
            "Role": "",
            "Characteristics": []
        }
    ]
}"""

prediction = sliding_window_prediction(data, template, model, tokenizer, window_size=1000, overlap=200)
print(prediction)

こんな感じで各チャンクごとに処理されていく。

出力

	Length of document: 17598 tokens
	0 to 1000
	800 to 1800
	1600 to 2600
	2400 to 3400
	3200 to 4200
	4000 to 5000
	4800 to 5800
	5600 to 6600
	6400 to 7400
	7200 to 8200
	8000 to 9000
	8800 to 9800
	9600 to 10600
	10400 to 11400
	11200 to 12200
	12000 to 13000
	12800 to 13800
	13600 to 14600
	14400 to 15400
	15200 to 16200
	16000 to 17000
	16800 to 17598
	Split into 22 chunks
Processing chunk 0...
Processing chunk 1...
Processing chunk 2...
Processing chunk 3...
Processing chunk 4...
(snip)

結果

出力

{
    "Characters": [
        {
            "Name": "メロス",
            "Role": "刑吏",
            "Characteristics": [
                "黒い風",
                "死なせて下さい"
            ]
        },
        {
            "Name": "セリヌンティウス",
            "Role": "",
            "Characteristics": [
                "黒い風",
                "釣り上げられてゆく"
            ]
        }
    ]
}

元のドキュメントサイズ、ウインドウサイズの大きさ、テンプレートの複雑度合いで、かかる時間が変わってくるような印象。テンプレートの階層を深くしたりせずに、かつ、キーも具体的に書くのが良いような気はする。どういう風に調整したらいいのかはちょっとわからない。

kun432

バリエーションは以下
Phi-3.5-mini-instructベースの3.8Bモデル
https://huggingface.co/numind/NuExtract-1.5
Qwen2.5-0.5Bベースの0.5Bモデル
https://huggingface.co/numind/NuExtract-1.5-tiny
GGUF化してくれている人がいる
https://huggingface.co/bartowski/NuExtract-v1.5-GGUF
https://huggingface.co/MaziyarPanahi/NuExtract-1.5-smol-GGUF

kun432

3.8BのNuExtract-1.5でも試してみた。ColaboratoryのランタイムはL4。

何も考えずにやってみたら、flash-attentionのwarningが出たので、事前ビルドしたものをインストールした。経緯はここ。

!wget https://github.com/kun432/flash-attention-prebuild-wheels/releases/download/v0.0.0-test/flash_attn-2.6.3+cu121torch2.5-cp310-cp310-linux_x86_64.whl
!pip install --no-dependencies --upgrade flash_attn-2.6.3+cu121torch2.5-cp310-cp310-linux_x86_64.whl

あとはREADMEに従って。

import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "numind/NuExtract-v1.5"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

import json

MAX_INPUT_SIZE = 20_000
MAX_NEW_TOKENS = 6000

def clean_json_text(text):
    text = text.strip()
    text = text.replace("\#", "#").replace("\&", "&")
    return text

def predict_chunk(text, template, current, model, tokenizer):
    current = clean_json_text(current)

    input_llm =  f"<|input|>\n### Template:\n{template}\n### Current:\n{current}\n### Text:\n{text}\n\n<|output|>" + "{"
    input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=MAX_INPUT_SIZE).to("cuda")
    output = tokenizer.decode(model.generate(**input_ids, max_new_tokens=MAX_NEW_TOKENS)[0], skip_special_tokens=True)

    return clean_json_text(output.split("<|output|>")[1])

def split_document(document, window_size, overlap):
    tokens = tokenizer.tokenize(document)
    print(f"\tLength of document: {len(tokens)} tokens")

    chunks = []
    if len(tokens) > window_size:
        for i in range(0, len(tokens), window_size-overlap):
            print(f"\t{i} to {i + len(tokens[i:i + window_size])}")
            chunk = tokenizer.convert_tokens_to_string(tokens[i:i + window_size])
            chunks.append(chunk)

            if i + len(tokens[i:i + window_size]) >= len(tokens):
                break
    else:
        chunks.append(document)
    print(f"\tSplit into {len(chunks)} chunks")

    return chunks

def handle_broken_output(pred, prev):
    try:
        if all([(v in ["", []]) for v in json.loads(pred).values()]):
            # if empty json, return previous
            pred = prev
    except:
        # if broken json, return previous
        pred = prev

    return pred

def sliding_window_prediction(text, template, model, tokenizer, window_size=4000, overlap=128):
    # split text into chunks of n tokens
    tokens = tokenizer.tokenize(text)
    chunks = split_document(text, window_size, overlap)

    # iterate over text chunks
    prev = template
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i}...")
        pred = predict_chunk(chunk, template, prev, model, tokenizer)

        # handle broken output
        pred = handle_broken_output(pred, prev)
            
        # iterate
        prev = pred

    return pred

L4だとウインドウサイズいじらなくてもメモリ内に収まった。

template = """{
    "Characters": [
        {
            "Name": "",
            "Role": "",
            "Characteristics": []
        }
    ]
}"""

prediction = sliding_window_prediction(data, template, model, tokenizer)
print(prediction)

出力

	Length of document: 12281 tokens
	0 to 4000
	3872 to 7872
	7744 to 11744
	11616 to 12281
	Split into 4 chunks
Processing chunk 0...
Processing chunk 1...
Processing chunk 2...
Processing chunk 3...
{
    "Characters": [
        {
            "Name": "メロス",
            "Role": "",
            "Characteristics": [
                "単純な男",
                "悪びれずに",
                "信じる事が出来ぬ"
            ]
        },
        {
            "Name": "ディオニス",
            "Role": "王",
            "Characteristics": [
                "暴君",
                "憫笑",
                "悧巧"
            ]
        },
        {
            "Name": "セリヌンティウス",
            "Role": "石工",
            "Characteristics": [
                "友人",
                "無言",
                "首肯き"
            ]
        },
        {
            "Name": "若い衆",
            "Role": "",
            "Characteristics": []
        },
        {
            "Name": "老爺",
            "Role": "",
            "Characteristics": []
        },
        {
            "Name": "妹",
            "Role": "",
            "Characteristics": []
        }
    ]
}

NuExtract-1.5-smolに比べるとパラメータ数が増えているだけあって、精度が上がっている感はある。

kun432

 まとめ（まとめ忘れてた）
上にも書いているけども、一応日本語もいけるけど精度的にはどうなんだろう？ベンチマーク見る限りはマルチリンガルのほうが良いようには見えるんだけど、今回試した限りは英語のほうがよい印象。
あと、
元のドキュメントサイズ、ウインドウサイズの大きさ、テンプレートの複雑度合いで、かかる時間が変わってくるような印象。テンプレートの階層を深くしたりせずに、かつ、キーも具体的に書くのが良いような気はする。どういう風に調整したらいいのかはちょっとわからない。
ここは昔見たFunction Callingのベストプラクティスなやつと同じような印象がある。階層はネストを少なくしてなるべくフラットにすべき、とか、階層を深くしたり複雑な構造にすると精度が下がったり、とかね。今はだいぶ賢くなっていると思うけど。いい感じに抽出するにはもう少し使い込んで見る必要があると思う。
あと、Function Callingと異なるのは、モデルカードに書いてあるこの部分。
注：このモデルは純粋な抽出を優先するように学習されているため、ほとんどの場合、モデルによって生成されたテキストはすべて元のテキストにそのまま存在します。
あくまでも抽出、というところを意識する必要はある。というか、今だと位置づけ的にはStructured Outputのほうが近いのかも。
とりあえず、個人的にはマッチするユースケースを思いつかないのだけども、公式ブログ記事にあるように
反復的な作業で高い抽出性能を得る必要がある場合、または機密データを処理する必要がある場合は、NuExtract が最適です。
ローカルで、ってところがやはり重要になるのだと思う。汎用モデルでもできるとは思うのだけど、どっちが性能が良いか？どっちがリソースあたりのパフォーマンスが良いか？を天秤にかけて判断するんだろうなとは思う。

このスクラップは5ヶ月前にクローズされました

ログインするとコメントできます