As AI-generated text and images capture the world’s attention, music is catching up.
What’s new: Andrea Agostinelli, Timo I. Denk, and colleagues at Google and Sorbonne Université introduced MusicLM, a system that generates music from text descriptions. You can hear its output here.
Key insight: Paired natural-language descriptions of music and corresponding music recordings are relatively scarce. How, then, to train a text-to-music generator? Previous work trained a model to map corresponding text and music to the same embedding. This makes it possible to train a system to regenerate music from a large corpus of recordings and then, at inference, prompt it with text.
How it works: MusicLM learned to regenerate audio clips (30 seconds at 24kHz resolution) from an undisclosed corpus that comprised 280,000 hours of recorded music. The challenge involved modeling sound in three distinct aspects: the correspondence between words and music; large-scale composition, such as a spare introduction that repeats with an added melody; and small-scale details, such as the attack and decay of a single drum beat. The team represented each aspect using a different type of token, each generated by a different pretrained system.
- Given an audio clip, MuLan (a transformer-based system) generated 12 audio-text tokens designed to represent both music and corresponding descriptions. It was pretrained on soundtracks of 44 million online music videos and their text descriptions to embed corresponding music and text to the same representation.
- Given the same audio clip, w2v-BERT generated 25 semantic tokens per second that represented large-scale composition. It was pretrained to generate masked tokens in speech and fine-tuned on 8,200 hours of music.
- Given the same audio clip, the encoder component of a SoundStream autoencoder generated 600 acoustic tokens per second, capturing small-scale details. It was pretrained to reconstruct music and speech and fine-tuned on 8,200 hours of music.
- Given the audio-text tokens, a series of transformers learned to generate semantic tokens.
- Given the semantic and audio-text tokens, a second series of transformers learned to generate acoustic tokens.
- At inference, MuLan generated audio-text tokens from an input description instead of input music. Given the tokens from the second series of transformers, the SoundStream decoder generated a music clip.
Results: The authors fed 1,000 text descriptions from a text-music dataset (released with the paper) to MusicLM and two other recent text-to-music models, Riffusion and Mubert. Listeners judged which clip — including the music in the dataset, which was produced by professional musicians — best matched a given caption. They judged MusicLM to have created the best match 30.0 percent of the time, Riffusion 15.2 percent of the time, and Mubert 9.3 percent of the time. They judged the ground-truth, human-created music to be the best fit 45.4 percent of the time.
Yes, but: The listeners didn’t evaluate the generated clips based on how musically satisfying they were, just how well they matched the corresponding text.
Why it matters: Rather than relying on a single embedding, the authors combined three embeddings that represent an audio clip with increasing degrees of specificity. This approach, which is analogous to a human writer’s tendency to start with a concept, sketch an outline, and fill in the words, may be useful in other applications that require a computer to generate detailed, dynamic, long-form output.
We’re thinking: MusicLM’s output sounds more coherent than that of previous music generators, but it’s hard to judge musical values that unfold over time from brief clips. That said, its shows an impressive ability to interpret the diverse emotional language found in descriptions of painter Jacques-Louis David’s triumphant “Napoleon Crossing the Alps” and Edvard Munch’s harrowing “The Scream.”