1 University of California, San Diego
2 University of Michigan, Ann Arbor
MusPyExpress
Current work in modeling symbolic music primarily relies on representations extracted from MIDI-like data. While such formats allow for modeling symbolic music as sequences of notes, they omit the large space of symbolic annotations common in western sheet music broadly known as expression text, such as tempo or dynamics, which specify time- and velocity-dependent controls on the musical composition and performance. To alleviate this gap, we present MusPyExpress, an extension to the popular symbolic music processing library MusPy[1] that enables the extraction of expression text along with symbolic music for downstream modeling. Utilizing this extension, we parse the PDMX dataset[2,3] to illustrate the wealth of expression text available in MusicXML datasets. Additionally, we introduce multiple generative tasks, including joint expression-note generation, expression-conditioned music generation, and expression tagging, that take advantage of this additional notational information.
We present three generative tasks enabled by the additional expression text extracted by MusPyExpress (the latter two of which we refer to as conditional tasks): joint note-expression text generation, expression-conditioned note generation, and expression tagging. We describe the details of our experiments, based on the model proposed in Multitrack Music Transformer (MMT)[4], at length in Appendix C of our paper, but we provide an overview here.
We conduct all of our experiments using models trained under two different timing schemes: metrical and real. Under the metrical scheme, an event’s onset time is represented with two values, beat and position, where beat is the number of beats since the start of the piece and position is the event’s position within the current beat; the metrical timing scheme is therefore tempo-agnostic. Meanwhile, the real timing scheme represents an event's onset time as a single value, time, which is the number of seconds since the start of the piece; many expression text types are temporal in nature, which would not be captured at the note-level in a metric system.
As we seek to model the sequence of notes and expression text together with an autoregressive model, how we interleave the notes and expression text into a single sequence x = interleave(n, e) can make a large difference in performance. Under each conditional experiment, “events” are conditioned on “controls.” For example, expression text controls dictate the note events generated by an expression-conditioned model. Prefix conditioning sets the control sequence as a prefix to the main event sequence. While simple, this forces the model to attend to long-term dependencies in order to model the relationship between events and controls that are close in time. Anticipation[5] seeks to remedy this fact and place controls close in the sequence to the events they affect. Specifically, under anticipatory conditioning, a control with onset time t is placed after the first event i with onset time ti such that ti ≥ t - δ for some offset δ, which effectively interleaves the controls within the event sequence rather than separate from it (see [5] for an in depth description). For all experiments, we set δ = 8, which is interpreted under metrical and real time schemes as beats and seconds, respectively.
We use the model from MMT[4] as a baseline for our experiments. This model ignores expression text and generates only notes.
We model both sequences together pθ(n, e) and thus produce a model that can generate both notes and expression text. This task mirrors how composers actually write music, adding new notes and expression text simultaneously.
We modify the standard symbolic music generation task with expression text annotations pθ(n | e), learning a model to produce notes given a sequence of expression text tokens. This task could prove useful in film music generation and sound design for games and audiobooks, where it is often useful to match music to the climax, tension, and rhythm of the inputs.
Given an existing note sequence n, we learn a model pθ(e | n) that tags the sequence with expression text e (i.e. add expression text to a planned musical score). This task serves as an intermediate step in the MIDI to performance rendering of a musical score; given only a plain MIDI file, a trained model could suggest appropriate expression markings such as legato, crescendo, or dolce, enabling automatic annotation of extensive collections of unexpressive MIDI data widely available in large-scale MIDI datasets.
Bibtex
The official MusPyExpress implementation can be found on the expressive branch of the MusPy repository.
We also make all supporting code (i.e., experiments, tables, and figures) available for reference and reproducibility.
Lastly, we provide extensive documentation for the MusPyExpress extension.
|
|
|
| Code | Support | Documentation |
[1] Hao-Wen Dong, Ke Chen, Julian McAuley, and Taylor Berg-Kirkpatrick. Muspy: A toolkit for symbolic music generation. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), 2020.
[2] Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, and Julian McAuley. PDMX: A large-scale public domain musicxml dataset for symbolic music processing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025.
[3] Weihan Xu, Julian McAuley, Taylor Berg-Kirkpatrick, Shlomo Dubnov, and Hao-Wen Dong. Generating symbolic music from natural language prompts using an llm-enhanced dataset. arXiv preprint arXiv:2410.02084, 2024.
[4] Hao-Wen Dong, Ke Chen, Shlomo Dubnov, Julian McAuley, and Taylor Berg-Kirkpatrick. Multitrack music transformer. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
[5] John Thickstun, David Hall, Chris Donahue, and Percy Liang. Anticipatory music transformer. arXiv preprint arXiv:2306.08620, 2023.