ISMIR 2024
Jiwoo Ryu¹
Hao-Wen Dong²
Jongmin Jung¹
Dasaem Jeong¹
¹SOGANG UNIVERSITY, ²UNIVERSITY OF CALIFORNIA SAN DIEGO
Diagram of two prediction methods in sub-decoder.
Note: Our proposed Nested Music Transformer (NMT) predicts sub-tokens in a fully-sequential manner.
Illustrations of the proposed Nested Music Transformer (NMT) and other sub-decoder structures
An example illustrating the proposed representations, note-based (NB) encoding (c) NB-Metric1st and (d) NB-Pitch1st, alongside REMI and Compound word.
Note: All encodings represent the same piece of music by using five musical features. Specifically, REMI and Compound word were not originally designed for multi-instrument pieces, which is why we renamed the encodings with “+I” to (a) and (b). Here, k denotes the number of notes and sequence length for NB, while r and c represent the ratios for REMI and Compound word, with values greater than 1.
The statistics of the dataset used in the experiments.
The hyperparameters used in the experiments for each dataset.
The main results of the experiments for symbolic music generation. Comparison average NLL loss for each model.
Note: We used total 12 number of layers including number of main deocer layers, sub-decoder layers and feature-enricher layers in case it is utilized. All models are trained with 512 dimension of hidden size and 8 heads of multi-head attention.
Settings: The results of unconditional generation from 4 different datasets. The model is given a random seed with only Start-of-Suence (SOS) token and generates the note sequences. The outputs are converted converted into MIDI file and then turned into audio file by Logic pro X (Digital Audio Workstation). We trimmed the audio file to have maximum 2 minutes of length.
SOD | Lakh | Pop1k7 | |
---|---|---|---|
REMI + flattening | |||
NB-PF + NMT | |||
Settings: The model is provided with symbolic tokens of selected pieces, each comprising four measures in length. From these tokens, the model generates note sequences. The selection of pieces for the prompt is crucial, as captivating motifs would be provided for the continuation. Although the audio files were trimmed to a length of two minutes for demo page, during the listening test, the samples were limited to 50 seconds each. This adjustment was made to ensure that the total duration of the test remained at 20 minutes, thus helping participants maintain concentration.
Prompt 1 | Prompt 2 | Prompt 3 | |
REMI + flattening | |||
CP + Catvec | |||
CP + NMT | |||
NB-PF + NMT |
Settings: The model receives 10-second audio samples, encoded as EnCodec tokens. Six models are employed to generate tokens from these inputs. We then decoded the generated audio tokens into audio files using a decoder pretrained with the MAESTRO dataset.
Prompt 1 | Prompt 2 | Prompt 3 | |
Parallel | |||
Flatten | |||
Delay | |||
Self-attention | |||
Cross-attention | |||
NMT |