🌈
動画生成AI【CogVideoX】をローカルで実行する

2024/10/31に公開
 CogVideoXとはhttps://github.com/THUDM/CogVideo
2022年に論文公開時にgithubでソースコードを公開しており、2024年8月6日に「CogVideoX-2B」がオープンソース化されてから、長い間、ローカルで動作できる動画生成AIにおいて、最も高性能なモデルの立ち位置を維持していたモデルです。
現在は、Pyramid-Flowなどの高性能モデルが公開され、動画生成の精度という観点では一歩譲っていますが、個人的にはPyramid-Flowより安定した動画（崩壊が少ない）を生成してくれるような印象があるため、まだまだ現役のモデルです。

（それ以上の動画生成モデルは、そもそもRTX3060程度じゃ動作しないですからねえ）
さらに、Diffusersライブラリにも2ヶ月ほど前に統合されたため、非常に利用しやすいモデルになっています。
CogVideoXは480pの解像度の動画を8FPSで6秒生成することができます。8FPSなので、生成した直後の動画はかなりカクつく印象があります。

ですので、CogVideoXのデモでも利用されているRIFEというフレーム補間技術を利用してFPSを改善させて利用することが多いです。

ですが、この方法でFPSを改善できれば、非常になめらかかつ高品質な動画を安定して作成することができます。
実際に生成される動画の質は、リポジトリのReadMeから確認できます。

https://github.com/THUDM/CogVideo/blob/main/README_ja.md#引用
加えて、動画生成のデモは下記から試すことができます。

https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space
今回も、著者はWebUIは宗教上の理由で利用できないので、Pythonコードで実行できるようにしようと思います。

 環境構築CogVideoXはDiffusersライブラリに統合されているため、非常に使いやすいです。

また、CogVideoXには5Bモデルと2Bモデルがありますが、より大きな5Bモデルでも私のPCで動作したので、5Bモデルのみを試すことにします。

 実行環境OS:Ubuntu 20.04

GPU:RTX3060 12GB

CUDA:12.2

RAM:64GB

Python:3.11.7
venvで仮想環境を作ります。

 venvによる仮想環境を構築する。python -m venv env
source env/bin/activate

 必要なパッケージをインストールする下記コマンドでインストールします。
pip install transformers accelerate diffusers imageio-ffmpeg torch==2.4.1 sentencepiece opencv-python imageio imageio-ffmpeg
!torchのバージョンとCUDAのバーションが合わない場合、実際のコードを実行する際に、下記のようなエラーが出ます。
Traceback (most recent call last):
  File "/home/xxx/xxx/xxx/xxx/CogVideoX-sample/main.py", line 23, in <module>
    import torch
  File "/home/xxx/xxx/xxx/xxx/CogVideoX-sample/env_test/lib/python3.11/site-packages/torch/__init__.py", line 368, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: /home/xxx/xxx/xxx/xxx/CogVideoX-sample/env_test/lib/python3.11/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
私の環境ではtorch==2.4.1が安定して動作したため、こちらを採用しています。

もしかしたら、他の方の場合は別のバージョンの方が良い可能性もあります。

 実行コードを作成する下記のようなmain.pyを用意します。

（長いので折りたたみます）
基本的には、下記のリポジトリの実装コードを参考に、私が使いやすいように組んでいます。

https://github.com/THUDM/CogVideo
コード全文main.py
from typing import Literal

import torch
from diffusers import (
    CogVideoXPipeline,
    CogVideoXDDIMScheduler,
    CogVideoXDPMScheduler,
    CogVideoXImageToVideoPipeline,
    CogVideoXVideoToVideoPipeline,
)

from diffusers.utils import export_to_video, load_image, load_video


def generate_video(
    prompt: str,
    model_path: str,
    lora_path: str = None,
    lora_rank: int = 128,
    output_path: str = "./output.mp4",
    image_or_video_path: str = "",
    num_inference_steps: int = 50,
    guidance_scale: float = 6.0,
    num_videos_per_prompt: int = 1,
    strength: float = 0.8,
    dtype: torch.dtype = torch.bfloat16,
    generate_type: str = Literal["t2v", "i2v", "v2v"],  # i2v: image to video, v2v: video to video
    seed: int = 42,
):

    image = None
    video = None

    print(image_or_video_path)

    if generate_type == "i2v":
        pipe = CogVideoXImageToVideoPipeline.from_pretrained(model_path, torch_dtype=dtype)
        image = load_image(image=image_or_video_path)
    elif generate_type == "t2v":
        pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
    else:
        pipe = CogVideoXVideoToVideoPipeline.from_pretrained(model_path, torch_dtype=dtype)
        video = load_video(image_or_video_path)

    # If you're using with lora, add this code
    if lora_path:
        pipe.load_lora_weights(lora_path, weight_name="pytorch_lora_weights.safetensors", adapter_name="test_1")
        pipe.fuse_lora(lora_scale=1 / lora_rank)

    pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

    pipe.enable_sequential_cpu_offload()
    #pipe.enable_model_cpu_offload()

    pipe.vae.enable_slicing()
    pipe.vae.enable_tiling()

    if generate_type == "i2v":
        video_generate = pipe(
            prompt=prompt,
            image=image,  # The path of the image to be used as the background of the video
            num_videos_per_prompt=num_videos_per_prompt,  # Number of videos to generate per prompt
            num_inference_steps=num_inference_steps,  # Number of inference steps
            num_frames=49,  # Number of frames to generate，changed to 49 for diffusers version `0.30.3` and after.
            use_dynamic_cfg=True,  # This id used for DPM Sechduler, for DDIM scheduler, it should be False
            guidance_scale=guidance_scale,
            generator=torch.Generator().manual_seed(seed),  # Set the seed for reproducibility
        ).frames[0]
    elif generate_type == "t2v":
        video_generate = pipe(
            prompt=prompt,
            num_videos_per_prompt=num_videos_per_prompt,
            num_inference_steps=num_inference_steps,
            num_frames=49,
            use_dynamic_cfg=True,
            guidance_scale=guidance_scale,
            generator=torch.Generator().manual_seed(seed),
        ).frames[0]
    else:
        video_generate = pipe(
            prompt=prompt,
            video=video,  # The path of the video to be used as the background of the video
            num_videos_per_prompt=num_videos_per_prompt,
            num_inference_steps=num_inference_steps,
            # num_frames=49,
            use_dynamic_cfg=True,
            strength=strength,
            guidance_scale=guidance_scale,
            generator=torch.Generator().manual_seed(seed),  # Set the seed for reproducibility
        ).frames[0]
    # 5. Export the generated frames to a video file. fps must be 8 for original video.
    export_to_video(video_generate, output_path, fps=8)


if __name__ == "__main__":

    dtype = torch.bfloat16
    generate_type = "t2v" # t2v: text to video, i2v: image to video, v2v: video to video
    image_or_video_path = "./videoframe_2013.png"
    #image_or_video_path = "./hiker.mp4"
    lora_path = None
    strength = 0.8

    prompts=[
        "A garden comes to life as a kaleidoscope of butterflies flutters amidst the blossoms, their delicate wings casting shadows on the petals below. In the background, a grand fountain cascades water with a gentle splendor, its rhythmic sound providing a soothing backdrop. Beneath the cool shade of a mature tree, a solitary wooden chair invites solitude and reflection, its smooth surface worn by the touch of countless visitors seeking a moment of tranquility in nature's embrace.",
        "An elderly gentleman, with a serene expression, sits at the water's edge, a steaming cup of tea by his side. He is engrossed in his artwork, brush in hand, as he renders an oil painting on a canvas that's propped up against a small, weathered table. The sea breeze whispers through his silver hair, gently billowing his loose-fitting white shirt, while the salty air adds an intangible element to his masterpiece in progress. The scene is one of tranquility and inspiration, with the artist's canvas capturing the vibrant hues of the setting sun reflecting off the tranquil sea.",
        "A golden retriever, sporting sleek black sunglasses, with its lengthy fur flowing in the breeze, sprints playfully across a rooftop terrace, recently refreshed by a light rain. The scene unfolds from a distance, the dog's energetic bounds growing larger as it approaches the camera, its tail wagging with unrestrained joy, while droplets of water glisten on the concrete behind it. The overcast sky provides a dramatic backdrop, emphasizing the vibrant golden coat of the canine as it dashes towards the viewer.",
        "On a brilliant sunny day, the lakeshore is lined with an array of willow trees, their slender branches swaying gently in the soft breeze. The tranquil surface of the lake reflects the clear blue sky, while several elegant swans glide gracefully through the still water, leaving behind delicate ripples that disturb the mirror-like quality of the lake. The scene is one of serene beauty, with the willows' greenery providing a picturesque frame for the peaceful avian visitors.",
        "A small boy, head bowed and determination etched on his face, sprints through the torrential downpour as lightning crackles and thunder rumbles in the distance. The relentless rain pounds the ground, creating a chaotic dance of water droplets that mirror the dramatic sky's anger. In the far background, the silhouette of a cozy home beckons, a faint beacon of safety and warmth amidst the fierce weather. The scene is one of perseverance and the unyielding spirit of a child braving the elements.",
    ]

    #prompts = ["FPV flying over the Great Wall"]

    #prompts = ["An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, moons, but the remainder of the scene is mostly realistic."]

    if generate_type == "t2v":
        model_path = "THUDM/CogVideoX-5b"
    elif generate_type == "i2v":
        model_path = "THUDM/CogVideoX-5b-I2V"
    else:
        model_path = "THUDM/CogVideoX-5b"


    import os
    os.makedirs("./output", exist_ok=True)

    for step, prompt in enumerate(prompts):
        steps = step + 1
        output_path = f"./output/output_{steps}.mp4"
        print(f"Generating video for step: {steps}")
        generate_video(
            prompt=prompt,
            model_path=model_path,
            lora_path=lora_path,
            lora_rank=128,
            output_path=output_path,
            image_or_video_path=image_or_video_path,
            num_inference_steps=50,
            strength=strength,
            guidance_scale=6.0,
            num_videos_per_prompt=1,
            dtype=dtype,
            generate_type=generate_type,
            seed=42,
        )

 実行下記コマンドを実行することで、動画が生成できます。
python main.py

 生成された動画
 Text to Videoをためすまずは、t2vを試します。

プロンプトはデモ動画と同じプロンプトを利用します。
なお、デモ動画のうち、5つのプロンプトを試しました。

プロンプトは下記です。
prompts=[
        "A garden comes to life as a kaleidoscope of butterflies flutters amidst the blossoms, their delicate wings casting shadows on the petals below. In the background, a grand fountain cascades water with a gentle splendor, its rhythmic sound providing a soothing backdrop. Beneath the cool shade of a mature tree, a solitary wooden chair invites solitude and reflection, its smooth surface worn by the touch of countless visitors seeking a moment of tranquility in nature's embrace.",
        "An elderly gentleman, with a serene expression, sits at the water's edge, a steaming cup of tea by his side. He is engrossed in his artwork, brush in hand, as he renders an oil painting on a canvas that's propped up against a small, weathered table. The sea breeze whispers through his silver hair, gently billowing his loose-fitting white shirt, while the salty air adds an intangible element to his masterpiece in progress. The scene is one of tranquility and inspiration, with the artist's canvas capturing the vibrant hues of the setting sun reflecting off the tranquil sea.",
        "A golden retriever, sporting sleek black sunglasses, with its lengthy fur flowing in the breeze, sprints playfully across a rooftop terrace, recently refreshed by a light rain. The scene unfolds from a distance, the dog's energetic bounds growing larger as it approaches the camera, its tail wagging with unrestrained joy, while droplets of water glisten on the concrete behind it. The overcast sky provides a dramatic backdrop, emphasizing the vibrant golden coat of the canine as it dashes towards the viewer.",
        "On a brilliant sunny day, the lakeshore is lined with an array of willow trees, their slender branches swaying gently in the soft breeze. The tranquil surface of the lake reflects the clear blue sky, while several elegant swans glide gracefully through the still water, leaving behind delicate ripples that disturb the mirror-like quality of the lake. The scene is one of serene beauty, with the willows' greenery providing a picturesque frame for the peaceful avian visitors.",
        "A small boy, head bowed and determination etched on his face, sprints through the torrential downpour as lightning crackles and thunder rumbles in the distance. The relentless rain pounds the ground, creating a chaotic dance of water droplets that mirror the dramatic sky's anger. In the far background, the silhouette of a cozy home beckons, a faint beacon of safety and warmth amidst the fierce weather. The scene is one of perseverance and the unyielding spirit of a child braving the elements.",
    ]
これにより生成された動画は下記になります。

（480pの解像度の動画を8FPSで6秒）
https://youtu.be/TsrE2eSTKow
https://youtu.be/LyW36Rs9-n4
https://youtu.be/Lh6BvKfK7Yg
https://youtu.be/QvWgNrzH8_0
https://youtu.be/JYToogtv7rg
FPSは低いですが、生成されている動画の質はなかなか高いと思いました。

特に、動画が崩壊しているところが少ないのが高印象です。安定した動画を生成してくれるイメージです。

（ただし、各層ごとにCPUオフロードを実施しているため、一つの動画を作成するのに30分ほどの時間がかかっています）

 Image to Videoをためす続いて、i2vを試します。
これは、「Pyramid-Flow」のデモのものを流用します。

（CogVideoXのi2iのサンプル画像を取得できなかったため）
プロンプトは下記です。
FPV flying over the Great Wall
画像は下記のものを利用します。

生成された動画は下記になります。
https://youtu.be/R8frruFVKQk
他のモデルのサンプルを持ってきて実行していますが、非常に質の高いI2Vが達成できていることがわかります。

（ちょっと感動）

 Video to Videoをためす。続いてv2vを試します。
こちらはサンプルを取得できたので、デモのサンプルを利用します。
動画は下記のコードで取得しました。
import torch
from diffusers.utils import export_to_video, load_video
input_video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4")
export_to_video(input_video, "hiker.mp4", fps=8)
入力した動画は下記になります。
https://youtu.be/ffxLk4yfXaE
利用するプロンプトは下記になります。
An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, moons, but the remainder of the scene is mostly realistic.
実際に生成された動画は下記になります。

https://youtu.be/JAX0NdrBxIw
プロンプトの通り、宇宙飛行士に変換されていますね。

ただ、プロンプトにはもっと様々な情報が記載されていますが、それはあまり反映されていないようにも見えます。

ただ、動画は安定しています。
RTX3060みたいな安価なGPUでも、安定して動画生成ができて、かつ生成結果が安定しているのは非常に嬉しいですね！

 コードの簡単な解説基本的には標準的なdiffusrs記法に則って記載することができます。

 モデル定義例えばモデル定義は下記のように書きます。
if generate_type == "i2v":
    pipe = CogVideoXImageToVideoPipeline.from_pretrained(model_path, torch_dtype=dtype)
    image = load_image(image=image_or_video_path)
elif generate_type == "t2v":
    pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
else:
    pipe = CogVideoXVideoToVideoPipeline.from_pretrained(model_path, torch_dtype=dtype)
    video = load_video(image_or_video_path)
各モードごとに異なるPipelineで定義し、必要なら画像や動画を取得しています。

Diffusersで非常によく見る形ですね。

 LoRAやSchedulerの登録下記では、LoRAやschedulerを登録しています。
# If you're using with lora, add this code
    if lora_path:
        pipe.load_lora_weights(lora_path, weight_name="pytorch_lora_weights.safetensors", adapter_name="test_1")
        pipe.fuse_lora(lora_scale=1 / lora_rank)

    pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

 CPUオフロードなどによるVRAM削減下記では、使用するVRAMの削減のため、CPUオフロードなどの設定を行っています。
pipe.enable_sequential_cpu_offload()
#pipe.enable_model_cpu_offload()

pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
層ごとにオフロードする方法（enable_sequential_cpu_offload）でないとRTX3060ではOOMになってしまいました。

この手法はenable_model_cpu_offloadと比較して、大幅に生成速度が下がってしまいますが、動くだけマシですね。

 Pipelineの呼び出しif generate_type == "i2v":
    video_generate = pipe(
        prompt=prompt,
        image=image,  # The path of the image to be used as the background of the video
        num_videos_per_prompt=num_videos_per_prompt,  # Number of videos to generate per prompt
        num_inference_steps=num_inference_steps,  # Number of inference steps
        num_frames=49,  # Number of frames to generate，changed to 49 for diffusers version `0.30.3` and after.
        use_dynamic_cfg=True,  # This id used for DPM Sechduler, for DDIM scheduler, it should be False
        guidance_scale=guidance_scale,
        generator=torch.Generator().manual_seed(seed),  # Set the seed for reproducibility
    ).frames[0]
elif generate_type == "t2v":
    video_generate = pipe(
        prompt=prompt,
        num_videos_per_prompt=num_videos_per_prompt,
        num_inference_steps=num_inference_steps,
        num_frames=49,
        use_dynamic_cfg=True,
        guidance_scale=guidance_scale,
        generator=torch.Generator().manual_seed(seed),
    ).frames[0]
else:
    video_generate = pipe(
        prompt=prompt,
        video=video,  # The path of the video to be used as the background of the video
        num_videos_per_prompt=num_videos_per_prompt,
        num_inference_steps=num_inference_steps,
        # num_frames=49,
        use_dynamic_cfg=True,
        strength=strength,
        guidance_scale=guidance_scale,
        generator=torch.Generator().manual_seed(seed),  # Set the seed for reproducibility
    ).frames[0]
# 5. Export the generated frames to a video file. fps must be 8 for original video.
export_to_video(video_generate, output_path, fps=8)

ことなるモードごとに異なる引数でPipelineを実行します。

これも一般的なDiffusersの記法なのでわかりやすいです。
また、最後に8FPSで動画を保存しています。

 パラメータの設定
generate_type = "v2v" # t2v: text to video, i2v: image to video, v2v: video to video
#image_or_video_path = "./videoframe_2013.png"
image_or_video_path = "./hiker.mp4"
lora_path = None
strength = 0.8

prompts = ["An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, moons, but the remainder of the scene is mostly realistic."]

if generate_type == "t2v":
    model_path = "THUDM/CogVideoX-5b"
elif generate_type == "i2v":
    model_path = "THUDM/CogVideoX-5b-I2V"
else:
    model_path = "THUDM/CogVideoX-5b"

上記では、モデルを動作させるためのパラメータの設定をしています。

generate_typeは動作モードを指定しており、

t2v：テキストから動画生成

i2v：画像とテキストから動画生成

v2v：動画とテキストから動画生成
の3つのモードがございます。
また、i2vやv2vでは、入力となる画像や動画が必要になりますので、image_or_video_pathで指定しています。
また、動画生成のモードにおいては、プロンプトの反映強度を設定できます。

それがstrengthです。デフォルトでは0.8になっているので、デフォルトのまま実行しています。

 まとめ家庭用の安価なGPUで、これだけ安定した動画生成が利用できるのはすごいなと思いました。

このモデルの高性能なLoRAやFineTuningモデルが出てくれないかなと切に願います。
一方で、ほかのクローズドな動画生成モデルと比較すると、FPSの低さが目立ったかなと思います。

このFPSの弱さは、フレーム補間技術などを利用することである程度解消できることがあります。
たとえば、CogVideoXのデモページでもフレーム補間技術を利用して、高いFPSにした動画を生成することができるようになっています。

そのうち、その技術に関しても記事をかければと思います。