From b0b61151fad024f59f185ac26b91b9ef7db65f18 Mon Sep 17 00:00:00 2001 From: Tommie Kerssies <6392002+tommiekerssies@users.noreply.github.com> Date: Mon, 6 May 2024 15:42:04 +0200 Subject: [PATCH] Update README.md --- docs/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/README.md b/docs/README.md index d54ec3b..4b42a85 100644 --- a/docs/README.md +++ b/docs/README.md @@ -8,7 +8,7 @@ **Code**: [GitHub](https://github.com/tue-mps/benchmark-vfm-ss) ## Abstract -Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require fine-tuning with semantic mask labels for the task of semantic segmentation. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons, therefore the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under several settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The recommended benchmarking setup enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that promptable segmentation pretraining is not beneficial, whereas masked image modeling (MIM) with abstract representations appears crucial, even more so than the type of supervision. +Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. ## Citation ```