🎯
Phi-4 multimodal を VRAM12GB に載せる
Why
Phi-4 multimodalのsample codeをそのまま動作させると、VRAM12GBに収まりません。
細かく見ていませんが、16GBあればそのまま動作させられそうです。
手元のRTX4080 Laptop 12GBで動作させたいのでどうにかします。
What
前提
- 画像とテキストのモーダルだけ使う
- Python Debbugerでモデルの中を触りたいので、なるべく純正仕様が良い
(この前提がなければ、ONNX量子化版を触る方が早そう)
その前提で以下を実施します。
- Audio encoder / projector の削除 (使うやつだけ残す)
- ↑に伴う内部変数の修正
- 入力画像サイズの縮小
修正版のsample code
オリジナルはこちら。
import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen
# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
attn_implementation='flash_attention_2'
).cuda()
## Audio encoder / projector の削除 (使うやつだけ残す)
if hasattr(model, 'model'):
if hasattr(model.model, 'embed_tokens_extend'):
if hasattr(model.model.embed_tokens_extend, 'audio_embed'):
del model.model.embed_tokens_extend.audio_embed
# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)
# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
## 入力画像サイズの縮小
image = image.resize((256, 256))
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')
## Audio encoder / projector の削除に伴う内部変数の修正
inputs.data["input_audio_embeds"] = None
# Generate response
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
入出力
>>> Prompt
<|user|><|image_1|>What is shown in this image?<|end|><|assistant|>
>>> Response
The image features a stop sign on a pole, located on a city street. The stop sign is positioned in the foreground, with a red and white color scheme. In the background, there is a car parked on the street, and a building can be seen further back.
There are also two people in the scene, one standing closer to the stop sign and the other person is located further back on the street. The presence of the stop sign and the people in the scene suggest that it is a busy urban area with traffic and pedestrians.
VRAM消費
11.6GB、ぎりぎりOK!
最後に
OSSモデル、こういったカスタマイズ性がGoodです!
Phi4シリーズ、実は弊社のエンドユーザーコメントがAzure blogで紹介されています。
Discussion