iTranslated by AI
No More Lonely Songwriting: Creating Music with AI Partner "Session MUSE" [AI Agent Hackathon]
Introduction: A Partner for Creative Loneliness
Music production, especially the process of composition and arrangement, is often a journey of lonely exploration. A melody that suddenly comes to mind, a guitar riff that resonates in your heart—many creators face the wall of "creative loneliness" while trying to nurture those small seeds of ideas into a full song.
"Is this direction right?"
"My ideas are running dry, and I don't know what to do next..."

Objective feedback or unexpected inspiration are things you'd get if you had band members or a producer. However, not everyone is blessed with such an environment.
To solve this universal challenge faced by all musicians, we developed the AI music partner "Session MUSE." This is more than just a tool. It is the ultimate band member that ignites your creativity and stays with your ideas 24/7, 365 days a year.
A one-of-a-kind work created from humming, together with AI.
The AI listens to the humming or instrumental performances uploaded by the user and instantly generates a backing track. Through dialogue, it helps you brainstorm the next progression. In this article, we will introduce in detail the idea, the technology behind it, and the results of this product developed for the 2nd AI Agent Hackathon with Google Cloud.
🏆 Project Achievement Summary
- Utilization of Gemini 2.5 Flash Lite Preview: Realized an AI that "creates music directly from humming."
- Robust Architecture: A serverless, workflow-driven architecture using Cloud Run and LangGraph.
- Cross-platform Development: High-efficiency multi-platform support using Flutter.
1. Project Overview
Target Users and the "Creative Wall"
Session MUSE targets "bedroom producers" dedicated to music production at home, solo musicians, and everyone who creates music as a hobby. While they are full of passion, they commonly experience deadlocks in the creative process.
User Pain Points (Actual voices)
"I recorded a melody that came to me, but I don't know where to go from there."
"I can use DAW software, but I can't come up with arrangement ideas."
"I don't have band members, so I can't practice sessions alone."
The root causes of these deadlocks are a "lack of an objective perspective" and the "exhaustion of ideas."
Solution: The 24/7 AI Band Member "Session MUSE"
Session MUSE is an AI band member that breaks through these "walls."

Difference from Conventional Methods
| Conventional AI Composition Tools | Session MUSE |
|---|---|
| Generate from text prompts | Direct analysis from humming |
| Output the finished product all at once | Step-by-step collaborative creation |
| Mechanical dialogue | Understand emotional requests |
| Requires knowledge of music theory | No specialized knowledge required |
🎧 Feature 1: "Listening" AI - From Humming to Music Blueprint
Just by uploading an audio file of a user's recorded humming or guitar riff, the AI captures that initial spark of inspiration. The core of this function is the native multimodal understanding capability of Google's latest AI model, Gemini 2.5 Flash Lite Preview. Unlike conventional systems that go through a multi-stage process of converting audio to text, it "listens" to the audio file directly to understand the atmosphere and theme of the music. This enables a human-like, intuitive musical understanding, such as "a bright and energetic J-POP style."
🎸 Feature 2: "Playing" AI - Instant Backing Tracks
Once the music blueprint is obtained, the session begins. Session MUSE automatically generates a backing track using an algorithm that incorporates music theory based on the analyzed theme. This is the process of giving a three-dimensional feel—rhythm and harmony—to a single line of melody. Musicians who used to hone phrases alone can now test ideas as if they were playing with band members.
(Backing track generation time: Average 16 seconds)
💬 Feature 3: "Interacting" AI - A Sounding Board for Creativity
What sets Session MUSE apart from simple automated composition tools is this dialogue function. Beyond specific questions based on music theory like "Can you make this chord progression more dramatic?", its true value lies in its ability to interpret abstract and emotional requests.
"How can I make it feel poignant, like a rainy day in Tokyo?"
To such poetic requests, the AI provides hints to translate sensitivity into specific sound production, such as "In that case, I recommend using deep delay and reverb to give the sound a lingering resonance." This advanced dialogue is realized through stateful workflow management using Gemini and LangGraph.
2. The Technology Behind Session MUSE
Technical Features: An AI Pipeline that Translates Humming into Music via "Language"
The biggest technical challenge of this project was generating musically meaningful backing tracks from ambiguous inputs like humming. Initially, we tried to have the generative AI directly estimate the scale and rhythm of the humming, but this proved extremely difficult. However, through trial and error, we discovered that the AI could accurately capture the "atmosphere" and "tone" of the humming through language.
Based on this discovery, instead of generating music directly from audio, we implemented a unique pipeline that goes through "language."
- Audio → Language (Atmosphere Extraction): Gemini listens to the humming and extracts the atmosphere in text, such as "bright and bouncy pop."
- Language → Musical Score (MusicXML Generation): Using the extracted text as a prompt, it generates musical score data (MusicXML) that matches the atmosphere.
- Musical Score → Audio Source (Backing Track Synthesis): The generated MusicXML is converted to MIDI and rendered using a SoundFont audio source. By using established technology for this step, we ensured stable quality.
This flow of "Audio → Language → Musical Score → Audio Source" is a practical approach to overcoming the current limitations of generative AI.
System Architecture
This AI pipeline is realized with an event-driven architecture combining the following Google Cloud services.

- Main Processing: FastAPI on Cloud Run accepts requests, and LangGraph manages the sequence of processes in the above pipeline as an asynchronous workflow.
- Generative AI Utilization: Gemini on Vertex AI handles "atmosphere extraction" and "MusicXML generation," which are the core of the music.
- CI/CD and IaC: Infrastructure is managed as code with Terraform and automatically deployed to Cloud Run via a CI/CD pipeline using Cloud Build and Artifact Registry.
3. Demonstration
Scenario: User "Takeshi's" Creative Session
Production Flow
- Humming Upload: Record and upload a pop-style melody of about 15 seconds.
- AI Analysis: The AI extracts a theme like "bright and energetic pop-rock style."
- Music Generation: Automatically generates a 30-second backing track MP3 based on the theme.
- Dialogue Session: The user requests things like "make it more dramatic" or "like a rainy day in Tokyo." The AI suggests specific chord progressions and effects.
- Final Result: The arranged backing track, MusicXML data, and theme analysis results become available for download.
Actual Screen Flow
![]() |
![]() |
![]() |
![]() |
|---|
4. Technical Challenges and Learnings
Challenges in the Hackathon
- Pursuit of Real-time Performance: Minimizing latency through an asynchronous architecture was a mandatory requirement to ensure users could test ideas without stress.
- Ensuring Musical Quality: We combined genre-specific pattern databases with music theory algorithms so that even simple patterns would sound musically natural.
- Countermeasures against AI Hallucination: We introduced validation logic based on music theory to Gemini's analysis results to enhance output stability.
Learnings and Results from the Project
- Potential of Multimodal AI: We demonstrated that the audio understanding capability of Gemini 2.5 Flash Lite Preview is higher than expected, significantly simplifying conventional complex audio processing pipelines.
- The Power of Workflow-Driven Architecture: We confirmed that the combination of LangGraph and Cloud Run enables the rapid construction of scalable and cost-efficient AI applications.
- The Importance of User Experience: We reaffirmed that an intuitive interface, rather than just technological advancement, is the most crucial factor in determining a product's value.
5. Future Prospects: From a Partner to a Creativity Ecosystem
The Session MUSE developed for this hackathon is just the first step of a grand vision. We aim to evolve this AI music partner into a more powerful "creativity ecosystem."
- Real-time Session Function: Our ultimate goal is to realize an AI band member that can follow a user's performance in real-time and jam just like a human.
- DAW Integration: Seamlessly integrate generated MIDI data into professional production workflows such as Logic Pro and Ableton Live.
- Democratization of Music: We aim to create a world where anyone can produce a song from a simple source of inspiration like humming, even without specialized knowledge or expensive equipment.
Conclusion
I would like to thank my teammates who built this together.
"I truly felt that the power of human creativity is significantly enhanced by AI Agents. This potential is infinite, and I am confident that even more exciting things will be born in the future." by Keisuke Karijuku
"I felt there is still immense room for growth in the field of Music x AI. I look forward to seeing this field become even more vibrant in the future." by Takafumi Kubota




Discussion