Publications

See my Google Scholar profile for more information.


MusPyExpress: Extending MusPy with Enhanced Expression Text Support

Phillip Long, Hao-Wen Dong, Julian McAuley, Zachary Novack
NeurIPS 2025 Workshop on AI for Music

Created an extension to the popular symbolic music processing library MusPy, which we term MuspyExpress, to better extract expression text, symbolic annotations common in western sheet music. While often omitted by the dominant MIDI standard, expression text is abundant in MusicXML (and MusicXML-adjacent) formats yet often goes unharnessed because current symbolic music processing libraries struggle, or even fail, to parse it. By integrating expression text into the MusPy ecosystem, MusPyExpress enables the extraction of expression text along with symbolic music for downstream modeling.

MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding

Jingyue Huang, Zachary Novack, Phillip Long, Yupeng Hou, Ke Chen, Taylor Berg-Kirkpatrick, Julian McAuley
arXiv Preprint, https://arxiv.org/abs/2510.16273

Proposed a new tokenization method for symbolic music, which employs a residual vector quantized-variational autoencoder (RQ-VAE) on bar-wise music segments within a Transformer-based encoder-decoder framework, producing music codes that achieve high-fidelity music reconstruction and accurate understanding of music theory. Evaluated the semantic understanding of the framework at the note, bar and song levels by training classifiers on MuseTok's encoded-latents for three music information retrieval tasks: melody extraction, chord detection, and emotion recognition.

PDMX: A Large Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025, https://arxiv.org/abs/2409.10831

Assembled the largest known copyright-free MusicXML dataset, PDMX, consisting of over 250,00 sheet music pieces for modeling symbolic music. Demonstrated the utility of PDMX's deduplication and rating data as a means for data distillation by training decoder-only transformers to generate symbolic music from four different-filtered subsets of PDMX. PDMX also features an abundance of fine-grained performance directives, which could be harnessed in future work as expression-text controls or natural language captions.

The utility of a closed breeding colony of Peromyscus leucopus for dissecting complex traits

Phillip Long, Vanessa J Cook, Arundhati Majumder, Alan G Barbour, Anthony D Long
Genetics, Volume 221, Issue 1, May 2022, iyac026, https://doi.org/10.1093/genetics/iyac026

Used R, Python, and Bash scripts on UC Irvine's HPC3 Cluster to analyze terabytes of genetic data collected in Peromyscus leucopus, the primary reservoir for Lyme disease. Tested a new computational framework for imputing genetic data in closed-breeding colonies. Created figures and tables summarizing datasets for the paper.