🫢

👌Phi-4-mini × Copilot+ PC (Intel Ultra2) が爆速

阿久津

2025/04/23に公開

結論、Copilot+ PC (Intel Ultra2)×phi4-mini(8bit)のトークン生成速度は、21.09 token/secです。

 背景Copilot+ PCには、40TOPS以上の性能を持つNPUが搭載されていることが紹介されています。

強そうです。
https://www.microsoft.com/ja-jp/windows/copilot-plus-pcs?msockid=30f69b18bea863210f1e8ecebff96209&r=1
で、実際どんなものなのか気になったので、SLM(Small Language Model)でベンチ―マークをしてみます。
具体的には、Phi4-miniをCPU内蔵のNPUとGPUとで実行したいと思います。

 検証ハードIntel Core Ultra 7 258V
NPU Driver: 32.0.100.3967
GPU Driver: 32.0.101.6556

DRAM 32GB
Windows 11
以下、GPUはIntel Arc 140Vを指します。

 NPU実行 vs GPU実行 結果Phi-4-miniをINT8・floatで実行した結果を以下に記載します。
演算能力は仕様書から引用しました。


演算器
モデル量子化ビット
トークン生成速度 (token/sec)
仕様上TOPS


GPU
int8
21.09
64

NPU
int8
3.13
47

NPU
(float16)
2.89
-

ただし、NPU×int8は実行自体はできますがモデル出力がGibberishでなってしまいます。量子化に工夫が必要な気がします。

 感想現状のChatGPTのトークン生成速度は、20 token/sec ほどらしい[要出展]ですので、既にCopilot+ PCのGPUの演算性能でChatGPTに比肩する生成速度が出せるということがわかりました。
また、本結果は単純にGPUがNPUより凄い！という結果ではないです。GPUに負荷が掛からずに、言い換えると描画性能を落とさずに、SLMが実行可能というところが凄いポイントかなと思います。
現状、GPUのトークン生成速度はNPUのそれより圧倒的に早いです。しかし、仕様上の演算性能に大差はなく、ユニファイドメモリアーキでなのでメモリ帯域も一緒であることを踏まえると、NPU側の最適化が進めばNPUでも十分な速さで生成が可能になるかと思います。
GPU実行で使用しているOpenVinoの最適化&量子化が現状でも相応に優れているということが分かりました。

 実行手順 GPUhttps://huggingface.co/OpenVINO/Phi-4-mini-instruct-int8-ov#running-model-inference-with-openvino-genai
import huggingface_hub as hf_hub
model_id = "OpenVINO/Phi-4-mini-instruct-int8-ov"
model_path = "Phi-4-mini-instruct-int8-ov"
hf_hub.snapshot_download(model_id, local_dir=model_path)
import openvino_genai as ov_genai
device = "GPU"
pipe = ov_genai.LLMPipeline(model_path, device)


import time
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
start_time = time.time()
output = pipe.generate("What is OpenVINO?", max_length=200)
elapsed = time.time() - start_time

print(output)
tokens = tokenizer.tokenize(output)
num_tokens = len(tokens)
print(f"生成時間: {elapsed:.4f} 秒")
print(f"生成トークン数: {num_tokens}")
if elapsed > 0:
    print(f"トークン毎秒: {num_tokens/elapsed:.2f} tokens/sec")

 実行手順 NPUIntel NPUで演算を行うためには、以下を経由してモデルを変換する必要があります。

OpenVinoやDirrectMLにおいてNPU対応がうたわれていますが、それでものバックエンドとしてこちらが動作しています。
https://github.com/intel/intel-npu-acceleration-library?tab=readme-ov-file
Phi-4-miniで実行したいときには、まずはPhi-3向けのexampleとして、以下を見ます。
https://github.com/intel/intel-npu-acceleration-library/blob/main/examples/phi-3.py
こちらを参考にして、NPU実行時は以下のコードで実行しています。
import torch
from transformers import AutoTokenizer, pipeline, TextStreamer
from intel_npu_acceleration_library.compiler import CompilerConfig
import intel_npu_acceleration_library as npu_lib
import warnings

torch.random.manual_seed(0)

compiler_conf = CompilerConfig(dtype=torch.float16)
model = npu_lib.NPUModelForCausalLM.from_pretrained(
    "microsoft/Phi-4-mini-instruct",
    config=compiler_conf,
    torch_dtype="auto",
    weights_only=False,
    export=False,
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-4-mini-instruct",
    trust_remote_code=True,
    export=False,
    use_default_system_prompt=True,
)
streamer = TextStreamer(tokenizer, skip_prompt=True)

messages = [
    {
        "role": "system",
        "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user.",
    },
    {
        "role": "user",
        "content": "Can you provide ways to eat combinations of bananas and dragonfruits?",
    },
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
)

generation_args = {
    "max_new_tokens": 100,
    "return_full_text": False,
    "temperature": 0.7,
    "do_sample": True,
    "streamer": streamer,
}

import time
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    start = time.time()
    run_args = { k:v for k,v in generation_args.items() if k != "streamer" }
    outputs = pipe(messages, **run_args)
    elapsed = time.time() - start
    text = outputs[0]["generated_text"]
    print(text)
    num_tokens = len(tokenizer.tokenize(text))
    print(f"\n生成トークン数: {num_tokens}, 経過時間: {elapsed:.2f}s, 速度: {num_tokens/elapsed:.2f} tok/s")
int8で実行した場合にGibberishな回答となる点については、現状の量子化実装が以下のような単純なリスケーリングに近いという実装だった上で、Phi-3-miniからPhi-4-miniへの進化で重みの情報が拡大したからではないかと推測しています。（詳細に分析する必要はあるかと思います。。）
def quantize_fit(
    model: torch.nn.Module, weights_dtype: str, algorithm: str = "RTN"
) -> torch.nn.Module:
    """Quantize a model with a given configuration.

    Args:
        model (torch.nn.Module): The model to quantize
        weights_dtype (str): The datatype for the weights
        algorithm (str, optional): The quantization algorithm. Defaults to "RTN".

    Raises:
        RuntimeError: Quantization error: unsupported datatype

    Returns:
        torch.nn.Module: The quantized model
    """
    if weights_dtype == "int4":
        bits = 4
    elif weights_dtype == "int8":
        bits = 8
    else:
        raise RuntimeError(f"Quantization error: unsupported datatype {weights_dtype}")

    conf = PostTrainingQuantConfig(
        approach="weight_only",
        tuning_criterion=TuningCriterion(timeout=100000),
        op_type_dict={
            ".*": {  # match all ops
                "weight": {
                    "dtype": weights_dtype,
                    "bits": bits,
                    "group_size": -1,
                    "scheme": "sym",
                    "algorithm": algorithm,
                },
                "activation": {
                    "dtype": "fp16",
                },
            }
        },
    )

    return fit(model=model, conf=conf)

演算器	モデル量子化ビット	トークン生成速度 (token/sec)	仕様上TOPS
GPU	int8	21.09	64
NPU	int8	3.13	47
NPU	(float16)	2.89	-

ヘッドウォータース

株式会社ヘッドウォータースのテックブログです。 AIエージェント、生成AI、LLM、Azureのサービスや資格、IoT、XR系などData&AIとApp modernizeに関して幅広く投稿します！

👌Phi-4-mini × Copilot+ PC (Intel Ultra2) が爆速

背景

検証ハード

NPU実行 vs GPU実行結果

感想

実行手順 GPU

実行手順 NPU

Discussion

背景

検証ハード

NPU実行 vs GPU実行 結果

感想

実行手順 GPU

実行手順 NPU

Discussion

NPU実行 vs GPU実行結果