Subscribe for free access to Data Points!
Chinese researchers developed DiffRhythm, a diffusion-based model capable of generating complete songs up to 4 minutes 45 seconds long, including both vocals and accompaniment. The model uses a variational autoencoder to compress audio into latent representations, which are then generated by a diffusion transformer conditioned on lyrics and style prompts. DiffRhythm can produce high-quality 4-minute songs in just 10 seconds, significantly faster than previous language model approaches. The researchers released their model and code under a noncommercial license.
GitHub and arXiv