This repository contains the code to run the experiments described in our CBMI2024 paper, "Evaluation of Deep Audio Representations for Semantic Sound Similarity".
We provide a flexible framework to:
- Extract deep audio representations from neural networks,
- Process the embeddings
- time aggregation (mean, max, median, none),
- dimensionality reduction with PCA,
- L2 normalization
- Perform similarity search
- Maximum Inner Product Search (MIPS),
- Maximum Cosine Similarity Search (MCSS),
- Nearest Neighbour Search (NNS)
- Evaluate retrieval results
- Metrics:
- Mean Average Precision@N (MAP@N)
- Mean Rank of the First Relevant Item (MR1)
- Granularity:
- Class-wise
- Sound Family-wise
- Macro averaged (average of the class-wise metrics)
- Metrics:
We use the FSD50K evaluation set for our experiments, but you can use other datasets with minimal changes.
A copy of the paper and the poster are included in the paper-poster-presentation/
directory.
To set up the environment and install the necessary dependencies, follow these steps:
-
Clone the repository:
git clone https://github.com/raraz15/semantic-sound-similarity cd semantic-sound-similarity
-
Create a virtual environment (Python 3.10.12 is required):
python3.10 -m venv sss
-
Activate the virtual environment:
source sss/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
You're now ready to run the project!
Here are the main parts of our pipeline.
To extract embeddings from audio files using the TensorFlowPredict model, you can run the following command:
usage: tensorflow_predict-embedding_extractor.py [-h] [-o OUTPUT_DIR] config_path audio_dir
Takes a FSD50K csv file specifying audio file names and computes embeddings using a TensorFlowPrecict model. All frame embeddings are exported without aggregation. Currently works with
FSD-Sinet, VGGish, YamNet and OpenL3 models.
positional arguments:
config_path Path to config.json file of the model. Assumes the model.pb is next to it.
audio_dir Path to directory with audio files.
options:
-h, --help show this help message and exit
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Path to output directory. (default: )
If a model is not included in the Essentia's TensorFlowPredict you should run the following command:
usage: embedding_extractor.py [-h] [-o OUTPUT_DIR] model_path audio_dir
This script loads a model with a pre-trained checkpoint and extracts clip level embeddings for all the audio files in the FSD50K evaluation dataset.
positional arguments:
model_path Path to model.pt chekpoint.
audio_dir Path to an audio file or a directory with audio files.
options:
-h, --help show this help message and exit
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Path to output directory. Default: data/embeddings/<dataset_name>/<model_name> (default: )
For processing the embeddings you can use:
usage: create_clip_level_embedding.py [-h] [-a {mean,median,max,none}] [-N N] [--no-normalization] [--normalization] [--output-dir OUTPUT_DIR] embed_dir
Takes frame level model embeddings and processes them for similarity search. First aggregates frame level embeddings into clip level embeddings then applies PCA to reduce the dimensions
and finally normalizes by the length.
positional arguments:
embed_dir Path to an embedding or a directory containing embedding.json files.
options:
-h, --help show this help message and exit
-a {mean,median,max,none}, -aggregation {mean,median,max,none}
Type of embedding aggregation. (default: mean)
-N N Number of PCA components to keep. -1 to do not apply. (default: 100)
--no-normalization Do not normalize the final clip embedding. (default: False)
--normalization Normalize the final clip embedding. (default: False)
--output-dir OUTPUT_DIR, -o OUTPUT_DIR
Path to output directory. If not provided, a directory will be created in the same directory as the embed_dir. (default: )
After the embeddings are extracted and processed, you can run the similarity search.
usage: similarity_search.py [-h] [-s {dot,nn}] [-N N] [--ground-truth GROUND_TRUTH] [--output-dir OUTPUT_DIR] embed_dir
In a corpus of sound embeddings, takes each sound as a query and searches for similar sounds using user defined strategies.
positional arguments:
embed_dir Directory containing embedding.json files. Embeddings should be prepared with create_clip_level_embedding.py.
options:
-h, --help show this help message and exit
-s {dot,nn}, --search {dot,nn}
Type of similarity search algorithm. (default: nn)
-N N Number of search results per query to save. (default: 15)
--ground-truth GROUND_TRUTH
Path to the ground truth CSV file. You can provide a subset of the ground truth by filtering the CSV file before passing it to this script. (default: None)
--output-dir OUTPUT_DIR, -o OUTPUT_DIR
Root output directory. Default: data/similarity_rankings/<dataset_name>/<model_name>/<search_type>/ (default: data/similarity_rankings)
Now that the query results are obtained you can evaluate the retrieval results.
usage: evaluate_map_at_n.py [-h] [--metrics METRICS [METRICS ...]] [-N N] [--families-json FAMILIES_JSON] [--output-dir OUTPUT_DIR] results_path ground_truth
Compute evaluation metrics for the similarity search result of an embedding over FSD50K.eval_audio.
positional arguments:
results_path Path to similarity_results.json file.
ground_truth Path to the ground truth CSV file. You can provide a subset of the ground truth by filtering the CSV file before passing it to this script.
options:
-h, --help show this help message and exit
--metrics METRICS [METRICS ...]
Metrics to calculate. (default: ['micro_map@n', 'macro_map@n'])
-N N Cutoff rank. (default: 15)
--families-json FAMILIES_JSON
Path to the JSON file containing the family information of the FSD50K Taxonomy. You can also provide the family information from an ontology (default: None)
--output-dir OUTPUT_DIR, -o OUTPUT_DIR
Root output directory. Default: data/evaluation_results/<dataset_name>/<model_name>/<search_type>/ (default: data/evaluation_results)
usage: evaluate_mr1.py [-h] [-s {dot,nn}] [--metrics METRICS [METRICS ...]] [--families-json FAMILIES_JSON] [--output-dir OUTPUT_DIR] embed_dir ground_truth
Since calculation of R1 requires the complete ranking for each query, instead of first running similarity_search.py with N=-1 and storing the results, here we first do a similarity_search
with N=-1 without saving the results and then calculate the R1 for each query.
positional arguments:
embed_dir Directory containing embedding.json files. Embeddings should be prepared with create_clip_level_embedding.py.
ground_truth Path to the ground truth CSV file. You can provide a subset of the ground truth by filtering the CSV file before passing it to this script.
options:
-h, --help show this help message and exit
-s {dot,nn}, --search {dot,nn}
Type of similarity search algorithm. (default: nn)
--metrics METRICS [METRICS ...]
Metrics to calculate. (default: ['micro_mr1', 'macro_mr1'])
--families-json FAMILIES_JSON
Path to the JSON file containing the family information of the FSD50K Taxonomy. You can also provide the family information from an ontology (default:
data/ontology/taxonomy/FSD50K_taxonomy-families.json)
--output-dir OUTPUT_DIR
Root output directory. Default: data/evaluation_results/<dataset_name>/<model_name>/<search_type>/ (default: data/evaluation_results)
You can also directly run the bash scripts to run the full pipeline.
First you need to extract embeddings with a model following Embedding Extraction once. Then you can define embedding processing and similarity search parameters inside the following bash script. It will create a grid of parameters and perform the full pipeline.
scripts/experiments/clap_2023-prepare_search_evaluate_experiments.sh
If you had processed the embeddings before and want to experiment with just the search function, you can use the following script.
scripts/experiments/clap_2023-search_evaluate_experiments.sh
streamlit run code/analysis_listener.py -- -p0 data/similarity_rankings/FSD50K.eval_audio/fs-essentia-extractor_legacy-PCA_100/nn/similarity_results.json -p1 data/similarity_rankings/FSD50K.eval_audio/BEATs_iter3_plus_AS2M-Agg_mean-PCA_100-Norm_True/nn/similarity_results.json -p2 data/similarity_rankings/FSD50K.eval_audio/CAVMAE-as_46.6-Agg_none-PCA_100-Norm_True/nn/similarity_results.json -p3 data/similarity_rankings/FSD50K.eval_audio/clap-630k-fusion-best-Agg_none-PCA_100-Norm_True/
usage: analysis_listener.py [-h] [-p0 PATH0] [-p1 PATH1] [-p2 PATH2] [-p3 PATH3] [-p4 PATH4] [-N N] [--gt-path GT_PATH]
Listen to randomly selected target and query sounds from an analysis file.
options:
-h, --help show this help message and exit
-p0 PATH0, --path0 PATH0
Similarity Result Path 0. (default: None)
-p1 PATH1, --path1 PATH1
Similarity Result Path 1. (default: None)
-p2 PATH2, --path2 PATH2
Similarity Result Path 2. (default: None)
-p3 PATH3, --path3 PATH3
Similarity Result Path 3. (default: None)
-p4 PATH4, --path4 PATH4
Similarity Result Path 4. (default: None)
-N N Number of top entries to display. (default: 15)
--gt-path GT_PATH Path to the ground truth file. Leave empty for auto. (default: None)