This website accompanies the NeurIPS 2023 Machine Learning for Audio Workshop paper entitled “InstrumentGen: Generating Sample-Based Musical Instruments From Text.” Here, we provide listening examples of the instruments generated by the model described in the paper.

Abstract

We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. Accordingly, we propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding. Furthermore, we present a differentiable loss function to evaluate the intra-instrument timbral consistency of sample-based instruments. Our results establish a foundational baseline, paving the way for further research in the domain of automatic sample-based instrument generation.

Text-to-Instrument

Select MIDI velocity:

Prompt	Generated audio (2 octaves)
dark concert grand piano
bright upright piano
deep and punchy sub bass
distorted synth bass
distorted electric guitar lead
bright acoustic guitar
aggressive synth lead
hammond organ
warm cello
silky violin

Advanced Descriptive Prompts/Limitations

To showcase the capabilities and current limitations of the model in response to advanced descriptive prompts, we present the following examples.

Select MIDI velocity:

Prompt	Generated audio (2 octaves)
A string ensemble characterized by high harmonics, light bowing, and sparse vibrato, yielding an airy, floating tonal quality.
A bass synth with a distorted sawtooth waveform and high resonance, delivering a gritty, aggressive sonic texture.
Staccato piano notes augmented with a synthetic overlay and digital delay, producing a crisp, rhythmically precise tonal effect.

Sample-to-Instrument

Although it is not the central focus of our paper, our system inherently accommodates a sample-to-instrument functionality, whereby a musical instrument can be generated from a single audio reference as input. We notionally demonstrate this with an out-of-domain audio sample used as the prompt.

Prompt	Generated audio (2 octaves)