🚎

ローカル環境でQwen3-VL-30B-A3Bを動かす

2025/10/05に公開

 はじめにQwen3-VL-30B-A3Bが公開されたので早速ローカル環境で使ってみる。
なお、本記事ではユニファイドメモリが96GB以上のMacを対象とする。おそらくそれ以下だと、VRAMが足りずにモデルの読み込みに失敗する。
!NVIDIA GPUなら、24GB以上のVRAMがあれば4bit量子化により推論が可能かも知れない。

ただし私はそのようなGPUを所有しておらず確認できないため、本記事ではNVIDIA GPUは対象外とする。

※まあNVIDIA GPUの場合は公式ドキュメントに従えば、おそらく問題なく動くだろう。
https://github.com/QwenLM/Qwen3-VL
https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct

 環境Mac Studio (M2 Ultra, 128GB)



 Macでの環境構築$ git clone https://github.com/QwenLM/Qwen3-VL.git
$ cd Qwen3-VL
$ uv init --python 3.12
$ uv venv
$ uv pip install git+https://github.com/huggingface/transformers # 後でtransformersをアップデートしているので不要かも
$ uv pip install qwen-vl-utils==0.0.14
$ uv pip install -r requirements_web_demo.txt
$ uv pip install "websockets>=13.0"
$ uv pip install gradio transformers -U
!ただし、uv pip install -r requirements_web_demo.txtはエラーが出るので、下記のようにrequirements_web_demo.txtを書き換えてから実行する。
requirements_web_demo.txt
 1 # Core dependencies
- 2 gradio==5.46.1
+ 2 gradio
 3 gradio_client==1.13.1
 4 transformers-stream-generator==0.0.5
 5 git+https://github.com/huggingface/transformers.git # これも不要かも
 6 torch
 7 torchvision
 8 accelerate
 9
10 # Optional dependency
11 # Uncomment the following line if you need flash-attn
12 # flash-attn

 モデルのダウンロード../models/Qwen/Qwen3-VL-30B-A3B-Instructにあらかじめモデルをダウンロードしておく。
https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct
以下はコマンド例。
$ uv pip install hf_transfer
$ HF_TRANSFER=1 uv run hf download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir ../models/Qwen/Qwen3-VL-30B-A3B-Instruct

 実行今回はweb_demo_mm.pyを用いた。2箇所ソースコードを修正すべき箇所があったのであらかじめ修正しておく。この修正をしないとメモリが倍必要になったり、タイムアウトになったりする。
web_demo_mm.py
--- a/web_demo_mm.py
+++ b/web_demo_mm.py
@@ -99,7 +99,7 @@ def _load_model_processor(args):
                                                                     attn_implementation='flash_attention_2',
                                                                     device_map=device_map)
         else:
-            model = AutoModelForImageTextToText.from_pretrained(args.checkpoint_path, device_map=device_map)
+            model = AutoModelForImageTextToText.from_pretrained(args.checkpoint_path, torch_dtype='auto', device_map=device_map)
 
         processor = AutoProcessor.from_pretrained(args.checkpoint_path)
         return model, processor, 'hf'
@@ -226,7 +226,7 @@ def _launch_demo(args, model, processor, backend):
             )
 
             tokenizer = processor.tokenizer
-            streamer = TextIteratorStreamer(tokenizer, timeout=20.0, skip_prompt=True, skip_special_tokens=True)
+            streamer = TextIteratorStreamer(tokenizer, timeout=120.0, skip_prompt=True, skip_special_tokens=True)
 
             inputs = {k: v.to(model.device) for k, v in inputs.items()}
             gen_kwargs = {'max_new_tokens': 1024, 'streamer': streamer, **inputs}
$ uv run web_demo_mm.py --checkpoint-path ../models/Qwen/Qwen3-VL-30B-A3B-Instruct --server-name 0.0.0.0 --backend hf

Warning: vLLM not available. Install vllm and qwen-vl-utils to use vLLM backend.
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:29<00:00,  2.30s/it]
/Volumes/20240625/AI/Qwen3-VL/web_demo_mm.py:339: UserWarning: You have not specified a value for the `type` parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys.
  chatbot = gr.Chatbot(label='Qwen3-VL', elem_classes='control-height', height=500)
* Running on local URL:  http://0.0.0.0:7860
* To create a public link, set `share=True` in `launch()`.
以上でアプリの起動ができた。手元のマシンで動かしている場合はhttp://localhost:7860にアクセスすることでWebアプリが立ち上がる。

 推論の例こちらのポストにまとめているが、推論はわりと遅い。
https://x.com/gosrum/status/1974825022306029954
単純なグラフであるが、グラフ・日本語の読み取りはかなり良い印象。
一方で他の例では生成が無限に続くことがあり、そこはいまいちなポイント。

 おまけ：EVO-X2で動かせるか？EVO-X2でも環境構築・推論は一応できたが、かなりの確率でPC自体がフリーズしてしまった。

ちなみにROCm6.4.4では起動できず、The rockのROCm7.9.0rcを使用した。
フリーズ率がかなり高くストレスがたまること必至なので、本記事ではこれ以上述べないことにする。

 まとめ本記事では、Mac StudioでQwen3-VL-30B-A3B-Instructを動かす方法についてまとめました。
思ったより速度は出ていないですが、性能自体はかなり良さそうです。

今後ggufで量子化されればもっと早くなるのではないかと期待しています。
あとは、回答が終わらず無限に生成が続いたケースもあったので、それを防げるような最適なパラメータなどがわかればより実用性が高まりそうです。
また、本筋ではないですがEVO-X2は発売から半年が経とうとしている状況ですがまだROCm + pytorchの安定版が出ていないのが気になりますね。。。

そろそろストレスフリーに使えるようにして欲しいところです。
ここまで見ていただきありがとうございました。次回もぜひ、よろしくお願いします。

はじめに

環境

Macでの環境構築

モデルのダウンロード

実行

推論の例

おまけ：EVO-X2で動かせるか？

まとめ

Discussion