This repository is a submission for the SISAP 2023 Indexing Challenge by team UTokyo
.
- Task: A
- Subset: clip768 (300K, 10M, 30M, 100M)
- members
- Yutaro Oguri (The University of Tokyo, oguri [at] hal.t.u-tokyo.ac.jp)
- Yusuke Matsui (The University of Tokyo, matsui [at] hal.t.u-tokyo.ac.jp)
The following instructions are for the CI workflow.
- Go to CI page in Actions tab.
- Click
Run workflow
button. - Select
main
branch. (by default) - Choose
Size of dataset
you will evaluate. (300K, 10M, 30M, 100M)- Plesae execute the workflow for each size (300K, 10M, 30M, 100M) separately and in this order becauswe the workflow with smaller size will take less time.
- Fill in
Ref (commit hash)
input with the commit hash you will evaluate.- The commit hash you should use in evaluation is on our pre-registration issue.
The each result including csv files and figures is saved as an artifact Results on ${size}
. You can download it from the GitHub Actions page. The plotted figures for size 10M, 30M, and 100M may be too large and too vague. The final selection of best-performing configuration should be done by looking at the csv file.
Note that query time is measured across all processing that depends on the input query. That is, it includes all pre-processing, searching, and post-processing for the query. If query time should only be measured for searches, please contact us. We will provide a script for the correct measurement.
Change the following section in search/search.py
to use other query sets.
You can change if needed:
public-queries-10k
to other query set file name.- change base url
url
to other url.
def prepare(kind: str, size: Size):
url = "https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge"
task = {
"query": f"{url}/public-queries-10k-{kind}.h5",
"dataset": f"{url}/laion2B-en-{kind}-n={size}.h5",
}
...
The following instructions are the same as the CI workflow, but you need to install dependencies and run scripts manually. You can run the scripts on your local environment.
Please replace size 300K
with 10M
, 30M
, or 100M
for other subsets.
conda create -n faiss python=3.10
conda activate faiss
conda install -c pytorch faiss-cpu=1.7.4 mkl=2021 blas=1.0=mkl
conda install matplotlib scikit-learn
pip install h5py
chmod +x ./run_300K.sh
./run_300K.sh
or
python search/search.py -k 10 --size 300K
python eval/eval.py
python eval/plot.py --size 300K res.csv
Details will be described in the paper. Here is a short summary.
- Python and Faiss library are used for the implementation.
- NSG is used for the base index structure. Its implementation is included in Faiss.
- Search better enterpoints of NSG graph traversal by using kmeans clustering.
- Reduce vector dimension by using PCA. The dimension is a SIMD-friendly value.
- Use AntihubRemoval for the preprocessing, which reduce the number of data points by removing anti-hub points.
- Hyperparameters are tuned with multi-objective blackbox optimization using Optuna library.
Hyperparameters to be used is hard-coded in search/search.py
.
There are multiple candidate sets for runtime hyperparameters in the script. Note that it is not required to rebuild index for each set, so it is in accordance with the the regulation of the submission. The result for each set is saved as a separated record in the same csv file res.csv
. Please select the best-performing configuration from the csv file.
Runtime hyperparameters are:
Hyperparameter | Description |
---|---|
search_L |
NSG search parameter |
ep_search_mode |
Mode of entry point searcher |
threads |
Number of threads for the search section |
search/search.py
- Main script for search
search/antihub_removal.py
- AntihubRemoval implementation.
search/ep_searcher.py
- Algorithms to search for better enterpoints of NSG graph traversal by using kmeans clustering.
These scripts are used for tuning hyperparameters, which is not used for evaluation.
search/analyze_singleobj_opt.py
search/analyze_multiobj_opt.py
search/opt_singleobj_nsg.py
search/opt_multiobj_nsg.py
eval/
directory contains scripts for evaluation from sisap23-laion-challenge-evaluation as a submodule.
- AntihubRemoval: K. Tanaka, Y. Matsui, and S. Satoh, "Efficient Nearet Neighbor Search by Removing Anti-hub", ICMR 2021
- NSG: C. Fu, C. Xiang, C. Wang, and D. Cai, "Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graphs", VLDB 2019
- Faiss: J. Johnson, M. Douze, and H. Jegou, "Billion-scale similarity search with GPUs", IEEE Transactions on Big Data, 7(3), 535–547, 2019
- Optuna: T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, "Optuna: A Next-generation Hyperparameter Optimization Framework", KDD, Applied Data Science Track, 2019