iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐰

A Simple Guide to Keling 2.0: China’s Latest AI Video Generation Model Explained

に公開

Keling 2.0 for Rabbits: Explaining China's Latest AI Video Generation Model

Introduction

The evolution of AI technology has been dizzying lately, with new models appearing one after another, particularly in the field of video generation. Following the massive attention garnered by OpenAI's Sora, the competition for AI video generation technology has intensified globally.

Development of original AI video generation models is also progressing within the Chinese tech industry. On April 15, 2025, Kuaishou, a major Chinese technology company, announced the latest version 2.0 of its AI model, "Keling (可灵)." This model is not just a simple update; it is an innovative leap that introduces a new interaction philosophy for AI video generation called "MVL (Multi-modal Visual Language)."

To make it easy enough for even a rabbit to understand, this article explains the overview, features, and technical innovations of Keling 2.0. Let's hop right in!

What is Keling 2.0?

Keling AI Overview and Development History

Keling AI is an AI video generation model developed by the Chinese short-video platform "Kuaishou." Since its first announcement in June 2024, it has undergone more than 20 updates in just 10 months.

Keling AI mainly consists of two major components:

  1. Keling (可灵): Video generation model
  2. Katu (可图): Image generation model

With this announcement, both have been upgraded to version 2.0.

Overall structure of Keling AI
Overall structure of Keling AI

Announcement on April 15, 2025

On April 15, 2025, Gai Kun, Senior Vice President and head of Community Science at Kuaishou, announced the upgrade of the Keling AI base models at the "Inspiration Comes True" 2.0 model launch event held in Beijing.

Specifically:

  • Keling 2.0 Video Generation Model
  • Katu 2.0 Image Generation Model

Both were officially released for a global audience.

Basic Features of Keling 2.0 and Katu 2.0

The Keling 2.0 model has undergone significant evolution in the following areas:

  • Dynamic texture (naturalness of movement)
  • Language understanding and response to instructions
  • Aesthetic quality of the video

Additionally, the Katu 2.0 model is characterized by:

  • Improved instruction following capabilities
  • Enhanced cinematic aesthetic expression
  • Support for more diverse artistic styles

Notably, it now supports over 60 different stylization effects.

Key Functions and Features of Keling 2.0

Improvements in Video Generation Capabilities

In Keling 2.0, video generation capabilities have been significantly improved over the previous version. Particularly in the "Master Version," notable improvements are seen in the following areas:

  1. Enhanced Language Understanding: Understanding more complex instructions and generating videos that align with the user's intent.
  2. Naturalness of Movement: Movements of people and objects have become more natural and smooth.
  3. Aesthetic Quality of the Video: Enabling more artistic and beautiful visual expressions.

Details of Multi-modal Editing Functions

One of the most innovative features of Keling 2.0 is the newly added multi-modal editing function. This feature allows for:

  • Editing existing videos by "adding," "deleting," or "modifying" elements.
  • Providing editing instructions not just through text, but by referencing parts of images or other videos.
  • More flexible understanding of user intent, making it easy to edit video content.

This feature enables users, even those without professional video editing skills, to perform advanced video editing through AI.

Multi-modal editing function of Keling 2.0
Conceptual diagram of Keling 2.0's multi-modal editing function

Image Generation Features of "Katu 2.0" and Their Improvements

Katu 2.0 is the image generation component of Keling AI, and it has significantly improved in the following areas:

  1. Instruction Following Capability: It is now possible to generate images that more accurately reflect the user's instructions.
  2. Cinematic Aesthetic Expression: The ability to generate images with a film-like texture has been improved.
  3. Diverse Artistic Styles: It now supports more than 60 types of stylization effects.
  4. Creativity and Imagination: The model's creative expression has been greatly enhanced.

Interestingly, about 85% of video generation in Keling AI today is "image-to-video," and the quality of the base image greatly impacts the quality of the video. In other words, the improvements in Katu 2.0 indirectly boost the video generation capabilities of Keling 2.0.

Examples and Use Cases

Main usage scenarios for Keling 2.0 include:

  • Creative content production (short videos, advertisements, etc.)
  • Entertainment industry (visual effects, concept videos)
  • Personal social media content creation
  • Creation of educational content
  • Product demo video creation

Particularly noteworthy is the ability to generate new videos based not only on text but also on images and existing videos. This allows users to convey their "mental images" to the AI more intuitively, even without specialized expertise.

MVL Technology: A New Interaction Philosophy for Video Generation

How Multi-modal Visual Language (MVL) Works

The most significant innovation of Keling 2.0 is the introduction of a new interaction concept for AI video generation called "Multi-modal Visual Language (MVL)." MVL is a new way for users to efficiently communicate complex creative intentions to the AI.

The characteristic of MVL lies in its ability to combine multiple modalities (types of information)—not just text, but also image references and video clips—to directly and efficiently convey more complex and multi-dimensional creative intent to the AI.

Conceptual diagram of MVL technology
Conceptual diagram of MVL (Multi-modal Visual Language)

Combination of TXT and MMW

MVL primarily consists of the following two elements:

  1. TXT (Pure Text): The text portion that serves as the semantic skeleton.
  2. MMW (Multi-modal-document as a Word): Multi-modal descriptors.

By combining these, it is possible to achieve both the basic direction of video generation (specified by TXT) and fine-grained control (specified by MMW), allowing creators to express their creative intent more accurately.

It is worth noting that MMW is designed to incorporate other modal information in the future, such as audio or movement trajectories, in addition to images and videos. This will allow users to achieve even richer expressions.

It might be a bit difficult for a rabbit, but basically, it means "you can convey an image that is hard to explain with words by showing samples or examples."

Differences from Conventional AI Video Generation

The main differences between conventional AI video generation models and Keling 2.0, which employs MVL, are as follows:

Conventional Models Keling 2.0 (with MVL)
Instructions via text prompts only Instructions possible via text + images + video, etc.
Complex ideas must be expressed in words alone Intent can be conveyed intuitively using visual references
Fine-grained control is difficult Basic direction and detailed control can be specified separately
Maintaining consistent characters or landscapes is hard Styles and consistency are easier to maintain via reference images or videos

In short, MVL can be seen as applying the principle of "a picture is worth a thousand words" to AI video generation.

Technical Innovations

The primary technical innovations of MVL technology are:

  1. Integration of Multi-modal Information: Integrated processing of different types of input information (text, images, video, etc.).
  2. Layering of Information: Hierarchical processing of basic instructions (TXT) and detailed controls (MMW).
  3. Intuitive Interface: Ability to communicate creative intent intuitively without the need for complex technical knowledge.
  4. Extensibility: Designed to support even more diverse modalities in the future, such as audio and movement patterns.

Through this technology, a solution has been presented for one of the biggest challenges in AI creation: communicating "complex mental images" to an AI.

Keling AI Market Performance and Future Outlook

Statistics on User Numbers and Generated Content

Keling AI has been growing rapidly since its announcement in June 2024. According to the latest statistics:

  • Global user count has surpassed 22 million.
  • User numbers have surged 25-fold in the past 10 months.
  • A cumulative total of 168 million videos and 344 million images have been generated.
  • Over 15,000 developers and companies worldwide are utilizing the API.

Particularly noteworthy is the achievement of such growth in just 10 months. This demonstrates the high demand for AI video generation and the ease of use of Keling AI.

Growth data of Keling AI
Growth data of Keling AI (June 2024 – April 2025)

Expansion of the Developer Ecosystem

Keling AI is focusing on building an ecosystem not just for general users but also for developers. More than 15,000 developers and companies from around the world are leveraging Keling's API for applications across various industrial scenarios.

Through this, Keling AI is increasing its value as a business solution, transcending its role as a mere consumer application.

Comparison with Competing Models

The AI video generation market in 2025 is becoming increasingly competitive. A comparison with major competing models is as follows:

Model Name Developer Features Strengths
Sora OpenAI Long-duration high-quality video generation Scene consistency, understanding of physical laws
Vidu Shengshu Technology Realistic video Reproducibility of details
Jimeng AI - Fast generation, Chinese text generation Single-frame image quality, generation efficiency
Minimax-Video MiniMax Diverse video styles Variety of styles
Keling 2.0 Kuaishou MVL technology, multi-modal editing Interaction, editing functions

In evaluations at the beginning of 2025 by VBENCH, Keling 2.0 received high marks, particularly in the fields of "dynamic texture," "language understanding," and "screen aesthetics." However, compared to OpenAI's Sora, there is still room for improvement in areas such as consistency in long-duration videos.

Future Development Potential and Challenges

Future development potential and challenges for Keling AI are considered as follows:

Development Potential:

  • Further expansion of MVL technology (integration of new modalities such as audio and touch)
  • Support for longer video generation
  • Development of customized solutions for industries
  • Challenges in real-time video generation

Challenges:

  • Addressing ethical and legal issues (copyright, misinformation, etc.)
  • Maintaining consistency in longer videos
  • Efficiency of computing resources (current generation takes about 1 to 5 minutes)
  • Strengthening competitiveness in the international market

Gai Kun stated, "AI holds great potential to aid creative expression, but the current state of industry development is still far from meeting user needs," acknowledging that there are "still many challenges" in the stability of AI-generated content and the accurate transmission of users' complex creativity.

Summary

Keling 2.0 is a prime example of how Chinese AI technology is undergoing its own unique evolution. In particular, the introduction of the new interaction concept called MVL (Multi-modal Visual Language) is an interesting approach to the fundamental challenge in AI video generation: "how to communicate human creative intent to AI."

From the perspective of a Japanese engineer, Keling 2.0 is noteworthy for the following points:

  1. Intuitive creative process through multi-modal input methods
  2. Ease of use through the integration of base models and editing functions
  3. The speed of acquiring a global user base of 22 million

Future AI video generation technology will likely evolve not just toward "generating more realistic videos" but toward "how to support the human creative process." Keling 2.0 is one such example, and interface innovations like MVL may influence future AI creation tools.

The world of AI video generation, which would even surprise a rabbit, is still in its infancy. Keep an eye on its future evolution! 🐰

Discussion