
Our previous work on MuseNet explored synthesizing music based on large amounts of MIDI data. We chose to work on music because we want to continue to push the boundaries of generative models. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space. One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. 24 A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps.



20 21 22 23 Generating music at the audio level is challenging since the sequences are very long. 19 For a deeper dive into raw audio modelling, we recommend this excellent overview. One can also use a hybrid approach-first generate the symbolic music, then render it to raw audio using a wavenet conditioned on piano rolls, 13 14 an autoencoder, 15 or a GAN 16-or do music style transfer, to transfer styles between classical and jazz music, 17 generate chiptune music, 18 or disentangle musical style and content.
