Nested Music Transformer:
Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation

ISMIR 2024

Jiwoo Ryu¹Hao-Wen Dong²Jongmin Jung¹Dasaem Jeong¹
¹SOGANG UNIVERSITY, ²UNIVERSITY OF CALIFORNIA SAN DIEGO

paper video code


Content



NMT Architecture


Diagram of two prediction methods in sub-decoder.

Note: Our proposed Nested Music Transformer (NMT) predicts sub-tokens in a fully-sequential manner.

Illustrations of the proposed Nested Music Transformer (NMT) and other sub-decoder structures


Encoding Comparison


An example illustrating the proposed representations, note-based (NB) encoding (c) NB-Metric1st and (d) NB-Pitch1st, alongside REMI and Compound word.

Note: All encodings represent the same piece of music by using five musical features. Specifically, REMI and Compound word were not originally designed for multi-instrument pieces, which is why we renamed the encodings with “+I” to (a) and (b). Here, k denotes the number of notes and sequence length for NB, while r and c represent the ratios for REMI and Compound word, with values greater than 1.


Results


The statistics of the dataset used in the experiments.



The hyperparameters used in the experiments for each dataset.


The main results of the experiments for symbolic music generation. Comparison average NLL loss for each model.

Note: We used total 12 number of layers including number of main deocer layers, sub-decoder layers and feature-enricher layers in case it is utilized. All models are trained with 512 dimension of hidden size and 8 heads of multi-head attention.

Generated Results


Best unconditioned generation samples

Settings: The results of unconditional generation from 4 different datasets. The model is given a random seed with only Start-of-Suence (SOS) token and generates the note sequences. The outputs are converted converted into MIDI file and then turned into audio file by Logic pro X (Digital Audio Workstation). We trimmed the audio file to have maximum 2 minutes of length.

  SOD Lakh Pop1k7
REMI + flattening
 
 
NB-PF + NMT
 
 

4-measure continuation comparison among models in SOD dataset

Settings: The model is provided with symbolic tokens of selected pieces, each comprising four measures in length. From these tokens, the model generates note sequences. The selection of pieces for the prompt is crucial, as captivating motifs would be provided for the continuation. Although the audio files were trimmed to a length of two minutes for demo page, during the listening test, the samples were limited to 50 seconds each. This adjustment was made to ensure that the total duration of the test remained at 20 minutes, thus helping participants maintain concentration.

  Prompt 1 Prompt 2 Prompt 3
REMI + flattening
CP + Catvec
CP + NMT
NB-PF + NMT

MAESTRO fine-tuned EnCodec token based continuated generation samples

Settings: The model receives 10-second audio samples, encoded as EnCodec tokens. Six models are employed to generate tokens from these inputs. We then decoded the generated audio tokens into audio files using a decoder pretrained with the MAESTRO dataset.

  Prompt 1 Prompt 2 Prompt 3
Parallel
Flatten
Delay
Self-attention
Cross-attention
NMT