Nested Music Transformer:
Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation

ISMIR 2024


Content



NMT Architecture


Diagram of different prediction method in subdecoder.

Note: Our proposed Nested Music Transformer (NMT) predicts each feature in a fully-sequential manner, which is different from how the previous decoding architectures work.

Illustrations of the proposed Nested Music Transformer (NMT) and other sub-decoder structures


Encoding Comparison


An example illustrating proposed representations, note-based encoding (c) NB-Type1st and (d) NB-Pitch1st, alongside REMI and Compound word for the sake of comparison.

Note: All the encodings represent the same piece of music by utilizing 8 features. Specifically, REMI and Compound word weren’t designed for multi-instrument pieces. That’s why we renamed the encoding with “+I” to (a) and (b). However, the main ideas for these two encoding is reserved for (a) and (b). The piece has K number of notes in the representation. If we used F number of different features for the encoding, the scale factors between REMI;r, Compound word;c can be expressed as an inequality like followings: 1 < c <= 2 < r < F. c can reach up to 2 in the case where every notes are positioned differently (no simultaneous note played at the same time).


Results


The statistics of the dataset used in the experiments.



The hyperparameters used in the experiments for each dataset.


The main results of the experiments for symbolic music generation. Comparison average NLL loss for each model.

Note: We used total 12 number of layers including number of main deocer layers, sub-decoder layers and feature-enricher layers in case it is utilized. All models are trained with 512 dimension of hidden size and 8 heads of multi-head attention.

Generated Results


Best unconditioned generation samples

Settings: The results of unconditional generation from 4 different datasets. The model is given a random seed with only Start-of-Suence (SOS) token and generates the note sequences. The outputs are converted converted into MIDI file and then turned into audio file by Logic pro X (Digital Audio Workstation). We trimmed the audio file to have maximum 2 minutes of length.

  SOD Lakh Pop1k7 Pop909
REMI + flattening
 
 
NB-PF + NMT
 
 

4-measure continuation comparison among models in SOD dataset

Settings: The model is provided with symbolic tokens of selected pieces, each comprising four measures in length. From these tokens, the model generates note sequences. The selection of pieces for the prompt is crucial, as captivating motifs would be provided for the continuation. Although the audio files were trimmed to a length of two minutes for demo page, during the listening test, the samples were limited to 50 seconds each. This adjustment was made to ensure that the total duration of the test remained at 20 minutes, thus helping participants maintain concentration.

  Prompt 1 Prompt 2 Prompt 3
REMI + flattening
CP + Catvec
CP + NMT
NB-PF + NMT

MAESTRO fine-tuned EnCodec token based continuated generation samples

Settings: The model receives 10-second audio samples, encoded as EnCodec tokens. Six models are employed to generate tokens from these inputs. We then decoded the generated audio tokens into audio files using a decoder pretrained with the MAESTRO dataset.

  Prompt 1 Prompt 2 Prompt 3
Parallel
Flatten
Delay
Self-attention
Cross-attention
NMT