Skip to content

We use audio diffusion models for improving the classification performance.

License

Notifications You must be signed in to change notification settings

YashSharma/AudioDiffusionClassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio Diffusion Model for Improving Classification Performance

In this project, we used an audio diffusion model to generate new data points for improving the audio classification performance.

For our classification experiment, we selected the popular UrbanSound8k dataset. Urban Sound 8K is an audio dataset that contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music. The classes are drawn from the urban sound taxonomy. All excerpts are taken from field recordings uploaded to www.freesound.org. We split this dataset into 90:10 splits for training and validation.

Our classification model for the task was inspired by Valerio Velardo’s PyTorch for the audio series and modified to match our needs. Our classification model consisted of 3 convolution blocks followed by adaptive average pooling and two fully-connected layers for 10-class classification. All audio files were converted to spectrogram format before passing it to the model for prediction. The spectrogram has been shown to be effective in different audio-processing applications. It is a visual representation of the spectrum of frequencies of a signal as it varies with time, and it contains relevant information that can aid in the accurate classification of audio signals. We created a base scenario in which we trained our CNN model on these audio signals and reported the best classification accuracy for validation split.

In the second case, we used audio diffusion models for generating novel data inputs. We used https://github.com/teticio/audio-diffusion implementation. Instead of prompting with text and generating new data points, we generated new data points for training using interpolation. We randomly sampled two similar class data points from our training data, encoded them, computed the average latent representation, and decoded them using the diffuser’s generator. By iteratively using this strategy, we generated 100 new data points for each class for model training. Moreover, the audio diffusion package was trained on Spotify playlists. Hence, generating similar-sounding audio signals to classes like a car horn or jackhammer via text prompting was difficult. The interpolation strategy ensured that the decoded signal sounds similar to our original audio signals while augmenting them enough to serve as a novel data point in model training. All the implementation was performed in PyTorch and on A100 GPU.

Our base model gave us an accuracy score of 57.8% in our classification experiment. By including new data points using the interpolation strategy of the diffusion model, we were able to increase the classification accuracy to 62.8%, resulting in a 5% gain. The model was able to learn new audio signal patterns using the generated audio files. This is one of the applications of diffusion models that we explored as part of this project. We expect more innovative applications to come up with research in this field. Diffusion models are capable of generating novel data points and subsequently aid in deep learning model training for multiple downstream tasks.

About

We use audio diffusion models for improving the classification performance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published