iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🎶

Running DiffRhythm, the Latest Music Generation AI, on Google Colab (Simplified Version)

に公開

Introduction

Everyone, have you read this paper released on March 3rd, 2025?
https://arxiv.org/abs/2503.01183v1
(I haven't read it thoroughly yet lol, but I've skimmed it enough to understand the general mechanism.)

I posted about this paper as well, and it's incredibly interesting!
https://x.com/asap2650/status/1896966554275000752

It's a paper about music generation like SUNO or YUE, but the most appealing part is that it's an open model capable of generating up to 4 minutes and 45 seconds of music in just 10 seconds! (Note: it seems limited to 1 minute and 35 seconds for now.)

You can try out a demo in the following Space!
https://huggingface.co/spaces/ASLP-lab/DiffRhythm
(For the results of my play, please check the YouTube link in the post mentioned above. I've also included information about the lyrics and reference audio used.)

Since I wanted to try running it in my own environment, I'm going to set it up to run on Google Colab without using a GUI!

Deliverables

Please check the repository below.
https://github.com/personabb/colab_AI_sample/tree/main/colab_DiffRhythm_sample

Preparation

You need to prepare the items in the folder below.
https://github.com/personabb/colab_AI_sample/tree/main/colab_DiffRhythm_sample/example

  • Lyrics file
    • It will sing according to these lyrics (Supports only English or Chinese?)
    • A timestamped lyrics file is required.
      • The original repository includes the following features, but I have not implemented them here:
        • Automatic generation of timestamped lyrics from a specified theme
        • Automatic setting of timestamps from input lyrics
          • It's probably better to use this feature for timestamps, but I just added them randomly.
    • You can refer to example/eg.lrc in my repository.
  • Reference audio file
    • It outputs music similar to this audio.
    • An audio file of 10 seconds or longer is required.
    • In this article, I'm using example/pray.mp3 from my repository.
      • This music is something I previously made with SUNO. Please see the following article.

https://zenn.dev/asap/articles/886af0fe48dda3

If you want to use the features for creating timestamped lyrics from a theme or automatically setting timestamps from input lyrics, I think you can do so by using the following prompts with ChatGPT or similar tools.
(Even the official implementation feeds the following prompts into DeepSeek-R1.)

Creating timestamped lyrics from a theme

You are a professional musician who has been invited to make music-related comments

Please generate a complete set of lyrics in {language} that follows the style of "{tags}" around the theme of "{theme}". Strictly follow these requirements:

### **Mandatory Format Rules**
1. **Output only timestamps and lyrics**, no brackets, narration, or segment markers (such as chorus, bridge, outro annotations).
2. Each line must be in the format `[mm:ss.xx]lyrics content`, with no space between the timestamp and lyrics. Lyrics should be continuous and coherent.
3. Timestamps should be naturally distributed; **the first line must not start at [00:00.00]**, considering the intro silence.

### **Content and Structure Requirements**
1. Lyrics should be varied with emotional progression and a sense of layers. **The length of each line should vary naturally**—do not make them all the same length, which results in a robotic format.
2. **Timestamp allocation should be reasonably inferred based on song tags, lyric emotion, and rhythm**, rather than mechanically assigned by lyric length.
3. Bridges/Outros are represented only through time gaps (e.g., jumping directly from [02:30.00] to [02:50.00]), **no text description needed**.

### **Negative Examples (Prohibited)**
- Error: [01:30.00](Piano Bridge)
- Error: [02:00.00][Chorus]
- Error: Empty lines, line breaks, comments

Example)

theme tags Language
Love and Heartbreak vocal emotional piano pop en
Heroic Epic choir orchestral powerful zh

Automatically setting timestamps from input lyrics

You are a professional musician who has been invited to make music-related comments

{lyrics_input} These are the lyrics of a song, one line per sentence. {tags_lyrics} is the style I want for this song. I now want to timestamp each line of these lyrics to get an LRC file. I hope the timestamp allocation is reasonably inferred based on the song's tags, lyric emotion, and rhythm, rather than mechanically assigned by lyric length. The timestamp for the first line should consider the intro length, avoiding starting directly from `[00:00.00]`. Output the lyrics strictly in LRC format, with each line formatted as `[mm:ss.xx]lyrics content`. Output only the LRC result without any other explanation.

Example)

tags Raw Lyrics (without timestamps)
acoustic folk happy I'm sitting here in the boring room It's just another rainy Sunday afternoon
electronic dance energetic We're living in a material world And I am a material girl

By the way, I've written an article about DeepSeek-R1 here, so please take a look.
https://zenn.dev/asap/articles/34237ad87f8511

How to Use

https://github.com/personabb/colab_AI_sample/tree/main/colab_DiffRhythm_sample

Download colab_DiffRhythm_sample/colab_DiffRhythm_sample.ipynb from the repository above and upload it to Google Drive.

Execute the cells in order up to the "second cell."

The pip install in the second cell takes a very long time.
During this time, upload your "reference audio file" to the /content/DiffRhythm/example folder.
(As shown in the image below, I have placed pray.mp3 in the example folder. You can upload it by dragging and dropping.)

It will then likely ask you to restart as shown below, so please click "Cancel."

After that, execute the cells all the way to the bottom.
You can then download output.wav from the left sidebar!

Output Results

Here are the results! Since I set the timestamps somewhat randomly this time, there are some slightly unstable parts, but the lyric following and general quality seem quite good, don't they?

https://youtu.be/iLKTVWHtWlg

Lyrics
[00:10.00] The cold wind pierces through my heart
[00:13.20] The blurry streetlights hide my tears
[00:16.85] The moment I knew what loneliness was
[00:20.40] I felt like I found a piece of the future
[00:24.15] Beyond the locked door
[00:27.65] I hear a whispering dream
[00:31.30] I want to believe, but I'm so afraid
[00:34.90] Reaching out with trembling hands
[00:38.55] Under the stardust night, I make a wish
[00:42.10] That your smile will never fade away
[00:45.75] Holding onto strength within the fleeting moments
[00:49.25] I will keep chasing the light, again and again
[00:52.00] Guided by the signpost soaked in rain
[00:55.30] I trace back the memories from afar
[00:58.90] The unseen future makes me anxious
[01:02.50] But a small flame flickers deep in my heart
[01:06.25] There's no dream that’s out of reach
[01:09.75] Because you were the one who showed me
[01:13.40] I won’t forget, no matter when
[01:16.95] Your voice will always lead me

Summary

The fact that you can easily create audio of this quality on the free version of Google Colab really shows how far we've come.
Moreover, it's incredibly impressive that the model can generate music in just about 10 seconds.
(Although this specific code takes a bit of time for module installation and imports, I still found it quite fast.)

I expect the quality would be even better if the timestamps were set accurately.
I also hope for Japanese language support in the future!

Discussion