-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification questions about the framework #50
Comments
You can likely not just use the embeddings with an arbitrarily trained LLM model. The Idea of LanguageBind is to create a custom set of embeddings that is aligned to a specific set of text embeddings (the CLIP encoder, I think). |
@lennartmoritz Hi, I wonder that during the pre-training process, dothe authors only use video-language or audio-language for training, or do they train with audio-video-depth-infrae-language? |
They use x-language training pairs where x denotes any of the supported modalities. So e.g. video-language, audio-language and depth-language, etc. are all used during training. |
Got it. Thank you for your reply. |
I'm trying to understand this in context of other works in the ecosystem. For example, I'm interested in video. For the video encoder, there is the LoRa tuned and the Fully-finetuned, can I use the embeddings from these models with an already trained LLM or model? Can I use these embeddings with Video-Llava? Can I use the LanguageBind encoder as a replacement for Video-Llava encoder (video tower)?
Also the demos shown in gradio, only show modality comparisons. I'm also trying to understand how do you do zero shot classifications. Thank you -- someone who is confused but excited and thankful for for the work done.
The text was updated successfully, but these errors were encountered: