Combination of multiple modalities #38

anthony-mendil · 2024-02-14T13:04:17Z

First of all congrats on the paper and thanks for providing the code!

In the paper at 'Zero-shot language-based multi-modal joint retrieval' you mention that integrating/combining multiple embeddings improves the performance. I am specifically referring to the sentence:

'Similar trends have been observed in other modalities, where each modality has the potential to enhance the performance when combined with other modalities.'

However, the paper does not clarify how the embeddings for different modalities are actually combined. If for instance, the input modalities are text, audio, video and depth the model would produce individual embeddings for all of the modalities. How do you then combine these embeddings in order to obtain the results you report?
Do you simply average the different embeddings?

Thanks in advance,
Anthony Mendil.

LinB203 · 2024-02-15T08:59:17Z

Yes, just average two modalities logits.

anthony-mendil · 2024-03-04T10:17:08Z

Is the code for this available? I can not seem to locate it in the repository. If not could you perhaps provide it? For example for the Infrared+RGB -> Text task.

Thanks in advance,
Anthony Mendil.

anthony-mendil · 2024-03-04T10:19:47Z

And is there is specific reason to average the logits and not directly the produced embeddings of the modalities? Especially for the retrieval task there are no logits computed if I understand correctly. How would this be done without the logits?

damian0815 · 2024-08-29T11:42:25Z

i'm not the paper author but in my experience you can just sum/average regular CLIP model features to achieve similar things, eg the text embedding for snow and the text embedding for forest summed together and then used to search a database of images returns images of snowy forests. expect it'd be the same here, just add the output feature vectors (these aren't not logits) of the models together

anthony-mendil · 2024-09-03T07:27:02Z

@damian0815 Hey, thanks for the answer. The fact that the scenario you describe works, is only due to the fact that CLIP uses cosine similarity as a distance metric. Said similarity function has a very specific property, namely that the scale of the vectors does not matter, but only their angle in high dimensional space. As a result, adding or averaging produces a vector with an average angle that can then be compared to a database. However, for tasks that do not use cosine similarity, this argument does not hold.

damian0815 · 2024-09-18T14:31:20Z

@anthony-mendil i would assume based on my experience working with CLIP, and with the understanding that the LanguageBind models are, more or less, CLIP (or CLIP-like) models, that the answer to this question:

If for instance, the input modalities are text, audio, video and depth the model would produce individual embeddings for all of the modalities. How do you then combine these embeddings in order to obtain the results you report?

is literally just:

combined_embedding = torch.normalize(text_embedding + audio_embedding + video_embedding + depth_embedding)

(assuming, as you point out, that the text_embedding etc are normalised already)

what are you doing with the feature vectors that is not cosine similarity? again, they're not logits and should not be treated as such, unless you're using the hidden states or you've added a classifier layer onto the end or something, in which case i guess you're on your own..?

anthony-mendil · 2024-09-30T10:16:20Z

@damian0815 I was investigating how this approach could be used in varies downstream tasks, which involve adding a classifier on top of the combined embedding space. As you also mention in the end, averaging the embeddings causes undesired behavior in such cases.

e1four15f mentioned this issue Oct 23, 2024

Does the model work in scenarios with missing modalities? #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combination of multiple modalities #38

Combination of multiple modalities #38

anthony-mendil commented Feb 14, 2024 •

edited

Loading

LinB203 commented Feb 15, 2024

anthony-mendil commented Mar 4, 2024

anthony-mendil commented Mar 4, 2024 •

edited

Loading

damian0815 commented Aug 29, 2024

anthony-mendil commented Sep 3, 2024 •

edited

Loading

damian0815 commented Sep 18, 2024 •

edited

Loading

anthony-mendil commented Sep 30, 2024

Combination of multiple modalities #38

Combination of multiple modalities #38

Comments

anthony-mendil commented Feb 14, 2024 • edited Loading

LinB203 commented Feb 15, 2024

anthony-mendil commented Mar 4, 2024

anthony-mendil commented Mar 4, 2024 • edited Loading

damian0815 commented Aug 29, 2024

anthony-mendil commented Sep 3, 2024 • edited Loading

damian0815 commented Sep 18, 2024 • edited Loading

anthony-mendil commented Sep 30, 2024

anthony-mendil commented Feb 14, 2024 •

edited

Loading

anthony-mendil commented Mar 4, 2024 •

edited

Loading

anthony-mendil commented Sep 3, 2024 •

edited

Loading

damian0815 commented Sep 18, 2024 •

edited

Loading