📝
【論文紹介】AniSoraチームの最新論文！アニメ生成のアラインメント

2025/10/27に公開
論文タイトル：Aligning Anime Video Generation with Human Feedback
リンク：https://arxiv.org/pdf/2504.10044
執筆者：EQUES 学生研究員 Dinh Hoang Dai
※ 本ページの図は特筆がない限り全て本論文から引用しています。
In this blog post, we introduce a following up paper of AniSora, the powerful anime video diffusion model. The authors are from Fudan Univ. and Bilibili.

 AbstractThis study proposes Anisora, a framework for improving anime video generation through human feedback alignment. To address the lack of anime-specific reward models, the authors build a 30k-sample human-annotated dataset evaluating both visual appearance and visual consistency. Based on this, a multi-dimensional reward model called AnimeReward is developed to better reflect human aesthetic preferences. Furthermore, a new training strategy, Gap-Aware Preference Optimization (GAPO), is introduced to emphasize stronger preference differences during optimization. Experiments show that Anisora significantly enhances the visual quality, temporal coherence, and human alignment of anime video generation.

 Diffusion models and Preference Optimization
 Diffusion model:Diffusion models work by gradually denoising random noise into meaningful images or videos.
At each step, the model predicts the noise component and removes it step by step, eventually reconstructing a clear video sequence.
This process allows high-quality, detailed, and temporally consistent frame generation.

 DPO for Diffusion ModelsDPO is a method for fine-tuning generative models based on human preferences.
Instead of relying only on loss functions like MSE or CLIP similarity, DPO learns from pairs of outputs — one preferred (v^w) and one less preferred (v^l).
The model then updates its parameters to make the preferred sample more likely.
 \mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}_{(c, v^w, v^l)} \left[\log \sigma \left( \beta \log \frac{p_\theta(v^w \mid c)}{p_{\mathrm{ref}}(v^w \mid c)} - \beta \log \frac{p_\theta(v^l \mid c)}{p_{\mathrm{ref}}(v^l \mid c)} \right) \right] 
However, there are some problems with current DPO so that the authors use Diffusion-DPO, which reformulates the objective using the predicted noise during the denoising process.

\mathcal{L}_{\mathrm{DPO}}(\theta)
= -\mathbb{E}_{(c, v^w, v^l)} \left[
\log \sigma \left(
- \beta T \omega(\lambda_t)
\Big(
\left\lVert \epsilon^w - \epsilon_\theta(v_t^w, t) \right\rVert_2^2
- \left\lVert \epsilon^w - \epsilon_{\mathrm{ref}}(v_t^w, t) \right\rVert_2^2
- \left\lVert \epsilon^l - \epsilon_\theta(v_t^l, t) \right\rVert_2^2
+ \left\lVert \epsilon^l - \epsilon_{\mathrm{ref}}(v_t^l, t) \right\rVert_2^2
\Big)
\right)
\right]

 Anime Reward Dataset:One of the biggest challenges in anime video generation is the lack of datasets that reflect human preferences.
To address this, the authors introduce the AnimeReward Dataset — the first large-scale human-annotated dataset specifically designed for anime video evaluation.

 Dataset Overview5,000 original anime videos from more than 100 action categories (walking, waving, jumping, hugging, etc.)
Around 30,000 generated samples collected from various open-source anime/video generation models such as Hailuo, Vidu, OpenSora, and CogVideoX
6,000 validation samples (completely different from the training set)
The dataset covers six human-centric evaluation dimensions, divided into two main groups: Visual Appearance (Smoothness, Motion, Appeal) and Visual Consistency (Text-Video Consistency, Image-Video Consistency, Character Consistency)

 AnimeReward:AnimeReward Model is a multi-dimensional reward model designed to capture both visual appearance and visual consistency from human feedback. It was trained with Anime Reward Dataset.
Create a function R(v) that assigns each anime video a reward score reflecting how well it matches human preferences.
The model learns 6 specialized dimensions (mentioned in previous section), each handled by different vision-language (VLM) backbones.

 Visual Apperance:Visual Smoothness
Trains a regression model to detect flicker or unstable frame transitions
Uses frame-level embeddings from a vision encoder Ev

     S_{\text{smooth}} = Reg(E_v(I_1, \ldots, I_N))
     
Visual Motion
Based on ActionCLIP, which measures how well motion matches the described action

      S_{\text{motion}} = Cos(MCLIP(V), MCLIP(T_m))
      
Visual Appeal
Evaluates overall visual beauty and composition.
Extracts keyframes and feeds them into SigLIP + an aesthetic regressor.

      S_{\text{appeal}} = Aes(SigLIP(I_{0,1,\ldots,K})) \; I_i \in KeyFrm(V)
      

 Visual ConsistencyText-Video Consistency
Checks whether the generated motion and objects match the input text prompt.

     S_{\text{tvc}} = Reg(E_v(V), E_t(T))
     
Image-Video Consistency
Ensures the style and character identity match the reference image.

      S_{\text{ivc}} = Reg(E_v(V), E_v(I_p))
      
Character Consistency
Detects identity drift (e.g., character’s face changing).
Combines GroundingDINO, SAM, and BLIP to extract and compare character features.

      S_{\text{cc}} = \frac{1}{N} \sum_i^N Cos(BLIP(M_i), fea_c)
      

 Anime Video AlignmentDPO
To align anime video generation with human judgment, the authors use Direct Preference Optimization (DPO).
For each input image and prompt, the model generates several videos (v_1, v_2, ... v_N). Each video receives a reward score R(v_i) from the AnimeReward model, which reflects how well it matches human preferences.
The video with the highest score becomes the preferred sample v^w, and the lowest one becomes the less preferred v^l.
DPO then fine-tunes the model to make outputs like v^w more likely than v^l.
Gap-Aware Preference Optimization
However, standard DPO treats all video pairs equally — even when the preference difference is small.
To solve this, the authors propose GAPO, which gives higher training weight to pairs with larger preference gaps.
The new loss is defined as:

      \mathcal{L}_{\text{GAPO}}(\theta) = (G_w - G_l) \cdot \mathcal{L}_{\text{DPO}}(x, c, v^w, v^l)
      (Gw and Gl are the reward gain of the winning sample v(w) and the lossing sample v(l))

 Experiment
 Set upFor evaluation, the authors use CogVideoX-5B as the base diffusion model.
They generate 2,000 anime videos from prompts and images, then assign reward scores using AnimeReward to create training pairs.
The alignment is fine-tuned using the GAPO loss, while comparison models include:
Baseline (no preference optimization)
SFT (Supervised Fine-Tuning)
GAPO (Proposed method)
Metrics: The authors use Automated metrics (VBench-I2V, AnimeReward, and VideoScore) and Human evaluations (500 generated videos rated by 3 annotators)

 ResultsAcross all benchmarks, the proposed GAPO-based model achieves the highest scores. In VBench-I2V, it outperforms both Baseline and SFT in every metric
From the result below, we can see that the GAPO performs better than CogVideoX and SFT. For example, there is fewer flickering or distortion artifacts, stronger text-video and character consistency and better overal animation smoothness.

 Conclusion:In summary, Anisora introduces:
The first large-scale human preference dataset for anime videos
A multi-dimensional reward model (AnimeReward) for aesthetic and consistency evaluation
A novel Gap-Aware Preference Optimization (GAPO) training method


 おわりにANIMINS（アニミンズ, ANIMe INSight）はオー・エル・エム・デジタル社が実施するデータ・生成AI利活用実証事業です。AIを「ツールの一つであり、クリエイターをサポートするもの」と明確に位置づけ、アニメ制作現場でAIの利活用が本当にできるのかを徹底的に調査しています。
詳しくは以下のホームページもご覧下さい。
経済産業省 GENIAC 特設ページ
IMAGICA GROUP note記事
EQUESでは引き続き、「最先端の機械学習技術をあやつり社会の発展を加速する」をミッションに研究開発と社会実装に取り組んでいきます。一緒に事業を創出する仲間を募集しています。詳しくは以下をご覧ください。
https://www.wantedly.com/companies/company_6691299
Abstract

Diffusion models and Preference Optimization

Diffusion model:

DPO for Diffusion Models

Anime Reward Dataset:

Dataset Overview

AnimeReward:

Visual Apperance:

Visual Consistency

Anime Video Alignment

Experiment

Set up

Results

Conclusion:

おわりに

Discussion