iTranslated by AI
Interpreting YouTube Videos and Comparing Model Sizes with Qwen2-VL
This is a memo from when I tried out Qwen2-VL after hearing that it's a good model for video-to-text tasks.
Apparently, Qwen supports not only its primary language, Chinese, but also major European languages, Japanese, Korean, Arabic, and Vietnamese.
In this article, I will briefly compare the GPU usage and generation results of the 2B and 7B models.
Library Installation
First, install the necessary libraries. It seems that transformers must be the latest version, otherwise Qwen will throw an error.
git+https://github.com/huggingface/transformers
pip install qwen-vl-utils
pip install yt-dlp
pip install flash-attn --no-build-isolation
pip install accelerate
Additionally, install the version of PyTorch that fits your specific environment.
Code
Download the specified YouTube video.
from yt_dlp import YoutubeDL
YOUTUBE_URL = "https://www.youtube.com/watch?v=wQ93J_HOD-c&t=480s&ab_channel=SUSURUTV."
# Download the video with the best image and sound quality
ydl_opts = {
"format": "best",
"outtmpl":"../data/youtube.mp4"
}
with YoutubeDL(ydl_opts) as ydl:
ydl.download([YOUTUBE_URL])
The video used for input this time is from "Ikedaya," a famous Jiro-style ramen shop in Ichijoji, Kyoto (well-known because searching for "Kyoto Ikedaya" often brings up results related to the Shinsengumi incident).
The following code is used for inference.
It follows the standard flow: loading the model, processing the input, and performing the inference.
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Select your preferred model
MODEL_NAME = "Qwen/Qwen2-VL-7B-Instruct"
MODEL_NAME = "Qwen/Qwen2-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
MODEL_NAME,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(MODEL_NAME)
# When inputting a video
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "../data/youtube.mp4",
"max_pixels": 1280 * 780,
"fps": 0.1,
},
{"type": "text", "text": "Summarize the video."},
],
}
]
# When inputting an image
# messages = [
# {
# "role": "user",
# "content": [
# {
# "type": "image",
# "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
# },
# {"type": "text", "text": "Describe this image."},
# ],
# }
# ]
# Input preprocessing
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
First, let's have the 2B model summarize the video.
Inference Result (2B) "Summarize the video."
The video shows a young man standing outside a building with a yellow sign above it. He is wearing a blue and yellow scarf and a white shirt. He then walks inside a restaurant and sits at a table with other people. He is eating a large bowl of noodles with chopsticks. The video ends with the young man standing outside again.
(Japanese Translation)
ビデオには、黄色い看板が掲げられた建物の外に立つ若い男が映っている。彼は青と黄色のスカーフを巻き、白いシャツを着ている。その後、彼はレストランの中に入り、他の人々と一緒にテーブルに座る。彼は箸で大盛りの麺を食べている。ビデオは、青年が再び外に立っているところで終わる。
Isn't it impressive that it managed to mention the large serving of noodles...?
I changed the prompt because I wanted it to identify the specific ramen being eaten in the video.
txt: Inference Result (2B) "What kind of ramen is being eaten in this video?"
The video shows a young man eating ramen in a restaurant. He is wearing a white shirt and a blue and yellow scarf. He is holding chopsticks and eating the noodles. The restaurant has a sign that reads "SUSURU TV" and a menu with different sizes of ramen. The video also shows the young man standing on a street with a sign that reads "池田屋" and a sign that reads "SUSURU TV."
(Japanese Translation)
ビデオにはレストランでラーメンを食べる若い男性が映っている。彼は白いシャツに青と黄色のスカーフを巻いている。箸を持ってラーメンを食べている。店には「SUSURU TV」と書かれた看板と、さまざまなサイズのラーメンが書かれたメニューがある。ビデオには、「池田屋」と書かれた看板と「SUSURU TV」と書かれた看板のある通りに立っている青年も映っている。
I wish it would answer in Japanese when asked with a Japanese prompt...
And the result remains just a summary...
That said, it's impressive that it properly recognizes text like "Ikedaya" and "SUSURU TV."
Next, the 7B model:
Inference Result (7B) "Summarize the video."
The video shows a man eating a large bowl of noodles while standing outside in the rain. He then sits down and continues eating the noodles
(Japanese Translation)
ビデオには、雨の中、屋外で大盛りの麺を食べる男の姿が映っている。その後、彼は座って麺を食べ続ける
"What kind of ramen is being eaten in this video?"
The video shows a man eating a bowl of noodles with chopsticks. The caption mentions "ラーメン" which translates to "ramen" in English. Therefore, the man is eating ramen noodles.
(Japanese Translation)
ビデオには、箸でラーメンを食べる男性の姿が映っている。キャプションには「ラーメン」と書かれている。したがって、この男性はラーメンを食べていることになる。
It thinks he's eating ramen in the rain...
It gives the impression of saying things that are slightly off the mark, even though it seems smarter than the 2B model.
It's strange that the generated text is in English despite being able to answer the prompt.
Since it does return answers in Japanese when asked with image inputs, this might be an issue specific to video.
Re-running the inference with "Please answer in Japanese." added.
Inference Result (7B) "Please describe this video. In Japanese, please."
この動画は、雨の中を歩く男性がラーメン屋でラーメンを食べるシーンを収めたもので、日本語で解説されています。男性はラーメンを食べながら、周囲の景色や店舗の情報などを紹介しています。また、ラーメンの種類や価格、店舗の特徴なども紹介されています。最後には、ラーメンを食べ終えた男性が、満足の表情でカメラに向かって話しています。
This time it properly replied in Japanese, but it still outputs English every few tries.
Also, I've omitted the 2B results here as it didn't return any Japanese at all.
I realized that Qwen's video inference is quite optimized for summarization and QA; it seems likely to provide good accuracy if used in English.
Also, since things unique to Japan like "Jiro" are likely not included in the training set, I really feel the need for a domestically produced VLM (especially one capable of Video-to-Text).
GPU Usage
Having 24GB of VRAM is great because it eliminates the need to perform processes like lowering the resolution of YouTube videos for the 7B model. It is also very positive that the 2B model seems capable of inference on a 12GB GPU.
Inference time was approximately 42s for the 7B model and 20s for the 2B model.
# When loading the 7B model
nvidia-smi
Fri Dec 13 19:25:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 45C P8 30W / 350W | 16475MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
# During inference of a 9-minute video with the 7B model
nvidia-smi
Fri Dec 13 19:26:57 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 31% 51C P0 121W / 350W | 21447MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
# When loading the 2B model
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 53C P0 121W / 350W | 4911MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
# During inference of a 9-minute video with the 2B model
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 40C P8 31W / 350W | 9843MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Discussion