iTranslated by AI
Running Qwen3-VL-30B-A3B Locally
Introduction
Qwen3-VL-30B-A3B has been released, so I'm going to try it out in a local environment right away.
Note that this article targets Macs with 96GB or more of unified memory. If you have less than that, model loading will likely fail due to insufficient VRAM.
Environment
- Mac Studio (M2 Ultra, 128GB)
Environment Setup on Mac
$ git clone https://github.com/QwenLM/Qwen3-VL.git
$ cd Qwen3-VL
$ uv init --python 3.12
$ uv venv
$ uv pip install git+https://github.com/huggingface/transformers # It might be unnecessary since transformers is updated later
$ uv pip install qwen-vl-utils==0.0.14
$ uv pip install -r requirements_web_demo.txt
$ uv pip install "websockets>=13.0"
$ uv pip install gradio transformers -U
Downloading the Model
Download the model in advance to ../models/Qwen/Qwen3-VL-30B-A3B-Instruct.
Below is an example of the commands.
$ uv pip install hf_transfer
$ HF_TRANSFER=1 uv run hf download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir ../models/Qwen/Qwen3-VL-30B-A3B-Instruct
Execution
This time, I used web_demo_mm.py. There were two places in the source code that needed to be modified, so be sure to fix them beforehand. Without these changes, it might require twice the memory or result in timeouts.
--- a/web_demo_mm.py
+++ b/web_demo_mm.py
@@ -99,7 +99,7 @@ def _load_model_processor(args):
attn_implementation='flash_attention_2',
device_map=device_map)
else:
- model = AutoModelForImageTextToText.from_pretrained(args.checkpoint_path, device_map=device_map)
+ model = AutoModelForImageTextToText.from_pretrained(args.checkpoint_path, torch_dtype='auto', device_map=device_map)
processor = AutoProcessor.from_pretrained(args.checkpoint_path)
return model, processor, 'hf'
@@ -226,7 +226,7 @@ def _launch_demo(args, model, processor, backend):
)
tokenizer = processor.tokenizer
- streamer = TextIteratorStreamer(tokenizer, timeout=20.0, skip_prompt=True, skip_special_tokens=True)
+ streamer = TextIteratorStreamer(tokenizer, timeout=120.0, skip_prompt=True, skip_special_tokens=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
gen_kwargs = {'max_new_tokens': 1024, 'streamer': streamer, **inputs}
$ uv run web_demo_mm.py --checkpoint-path ../models/Qwen/Qwen3-VL-30B-A3B-Instruct --server-name 0.0.0.0 --backend hf
Warning: vLLM not available. Install vllm and qwen-vl-utils to use vLLM backend.
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:29<00:00, 2.30s/it]
/Volumes/20240625/AI/Qwen3-VL/web_demo_mm.py:339: UserWarning: You have not specified a value for the `type` parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys.
chatbot = gr.Chatbot(label='Qwen3-VL', elem_classes='control-height', height=500)
* Running on local URL: http://0.0.0.0:7860
* To create a public link, set `share=True` in `launch()`.
With this, the application has been successfully started. If you are running it on your local machine, the web app will be available by accessing http://localhost:7860.

Inference Examples
As summarized in this post, the inference is relatively slow.
Although it's a simple graph, the reading of the graph and Japanese seems quite good.
On the other hand, in other examples, the generation sometimes continues indefinitely, which is a bit of a weak point.
Bonus: Can it run on EVO-X2?
I was able to set up the environment and perform inference on the EVO-X2 as well, but the PC itself froze with high probability.
By the way, it wouldn't start with ROCm 6.4.4, so I used ROCm 7.9.0rc from "The Rock."
Since the freeze rate is quite high and it's bound to be stressful, I will not go into further detail in this article.
Summary
In this article, I summarized how to run Qwen3-VL-30B-A3B-Instruct on a Mac Studio.
It's not as fast as I expected, but the performance itself seems quite good.
I hope it will become faster if it gets quantized to GGUF in the future.
Also, since there were cases where the answer didn't end and generation continued indefinitely, its practical utility would increase if optimal parameters to prevent that were known.
Additionally, as a side note, it's concerning that a stable version of ROCm + PyTorch hasn't been released for the EVO-X2 even though it's been nearly six months since its release...
I hope we can use it stress-free soon.
Thank you for reading this far. See you in the next one!
Discussion