This website accompanies the NeurIPS 2023 Machine Learning for Audio Workshop paper entitled “InstrumentGen: Generating Sample-Based Musical Instruments From Text.” Here, we provide listening examples of the instruments generated by the model described in the paper.
Abstract
We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. Accordingly, we propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding. Furthermore, we present a differentiable loss function to evaluate the intra-instrument timbral consistency of sample-based instruments. Our results establish a foundational baseline, paving the way for further research in the domain of automatic sample-based instrument generation.
Text-to-Instrument
Prompt | Generated audio (2 octaves) |
dark concert grand piano | |
bright upright piano | |
deep and punchy sub bass | |
distorted synth bass | |
distorted electric guitar lead | |
bright acoustic guitar | |
aggressive synth lead | |
hammond organ | |
warm cello | |
silky violin |
Advanced Descriptive Prompts/Limitations
To showcase the capabilities and current limitations of the model in response to advanced descriptive prompts, we present the following examples.
Prompt | Generated audio (2 octaves) |
A string ensemble characterized by high harmonics, light bowing, and sparse vibrato, yielding an airy, floating tonal quality. | |
A bass synth with a distorted sawtooth waveform and high resonance, delivering a gritty, aggressive sonic texture. | |
Staccato piano notes augmented with a synthetic overlay and digital delay, producing a crisp, rhythmically precise tonal effect. |
Sample-to-Instrument
Although it is not the central focus of our paper, our system inherently accommodates a sample-to-instrument functionality, whereby a musical instrument can be generated from a single audio reference as input. We notionally demonstrate this with an out-of-domain audio sample used as the prompt.
Prompt | Generated audio (2 octaves) |