🎯

Phi-4 multimodal を VRAM12GB に載せる

阿久津

2025/03/01に公開

 WhyPhi-4 multimodalのsample codeをそのまま動作させると、VRAM12GBに収まりません。

細かく見ていませんが、16GBあればそのまま動作させられそうです。
手元のRTX4080 Laptop 12GBで動作させたいのでどうにかします。

 What前提
画像とテキストのモーダルだけ使う
Python Debbugerでモデルの中を触りたいので、なるべく純正仕様が良い

(この前提がなければ、ONNX量子化版を触る方が早そう)
その前提で以下を実施します。
Audio encoder / projector の削除 (使うやつだけ残す)
↑に伴う内部変数の修正
入力画像サイズの縮小

 修正版のsample codeオリジナルはこちら。
import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    attn_implementation='flash_attention_2'
).cuda()

## Audio encoder / projector の削除 (使うやつだけ残す)
if hasattr(model, 'model'):
    if hasattr(model.model, 'embed_tokens_extend'):
        if hasattr(model.model.embed_tokens_extend, 'audio_embed'):
            del model.model.embed_tokens_extend.audio_embed

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
## 入力画像サイズの縮小
image = image.resize((256, 256))
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

## Audio encoder / projector の削除に伴う内部変数の修正
inputs.data["input_audio_embeds"] = None

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

 入出力
>>> Prompt
<|user|><|image_1|>What is shown in this image?<|end|><|assistant|>
>>> Response
The image features a stop sign on a pole, located on a city street. The stop sign is positioned in the foreground, with a red and white color scheme. In the background, there is a car parked on the street, and a building can be seen further back. 

There are also two people in the scene, one standing closer to the stop sign and the other person is located further back on the street. The presence of the stop sign and the people in the scene suggest that it is a busy urban area with traffic and pedestrians.

 VRAM消費11.6GB、ぎりぎりOK！

 最後にOSSモデル、こういったカスタマイズ性がGoodです！

Phi4シリーズ、実は弊社のエンドユーザーコメントがAzure blogで紹介されています。


link

ヘッドウォータース

株式会社ヘッドウォータースのテックブログです。 AIエージェント、生成AI、LLM、Azureのサービスや資格、IoT、XR系などData&AIとApp modernizeに関して幅広く投稿します！

Why

What

修正版のsample code

入出力

VRAM消費

最後に

Discussion