-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use caption to suggest depictions #5422
Comments
I don't think splitting the caption would work, because how do we identify which words need to be shown in depictions? |
@rohit9625 It is pretty easy using LLM: Even splitting word-by-word would work reasonably well. Prompt for further improvements:
|
Is on-device execution necessary for this use case? The API is available for target API level 21 and higher, and should work if we're not going for Gemini Nano. |
On-device execution is necessary. As a privacy policy, we do not send any HTTP request to any server outside of Wikimedia (and Mapbox until recently but hopefully this should be resolved soon). |
Thank you for clarifying! Since captions are publicly available, I'd not thought of it from this perspective. This seems like an on-device LLM, but I'm still exploring😅. |
Great link thanks Ritika! I wonder how many megabytes that would add to the app... if not too many that's definitely an option. |
Hey @nicolas-raoul, I have found a model which we can utilize to achieve NER https://huggingface.co/dslim/bert-base-NER. They have provided tf model which we can to convert to .tflite model, then can do the implementation in the app. I tested it with
I believe accuracy is good as it provides us with |
@nicolas-raoul, I have converted the model to .tflite ~ 411 MB which is very large when we have to use in Android App (as it will increase the App size). Is it possible we can use MLKit by Google in the App ? |
@kanahia1 Great research, thanks! I feel like 50MB should be maximum... MLKit sounds great! Can it extract keywords? How big is it? 🙂 |
@nicolas-raoul Sure, It can extract They have provided examples - Input Text - More examples are available here |
@nicolas-raoul
If I get enough time by end of this month, I will try to train a custom model 🙂 |
We can try running Gemini Nano using the AICore but it would support very few devices. |
Yes, this project is very long-term (and not urgent) so even if it currently supports only few devices that's OK. Also, development is possible without owning such a device (we will use stubs, and someone else who has such a device will test for real).
Fantastic! Since it could be used for more things than just depictions, I mentioned it here: #6143 (comment) |
If I caption my picture "Papilio machaon on Asteraceae flower in Croatia", then the app should ideally suggest these depictions:
First step: Parse the caption into expressions could be asked to an LLM if one is available locally on the device. That means only on Pixel 8 Pro (or above) and Samsung Galaxy S24 (or above) for now but hopefully more makers will follow.
On devices where no such technology is available, the app can fall back to some more traditional tokenization, maybe even split on each space character, or just skip entirely.
Second step: Search for 1 or 2 Wikidata items that match each expression found in the first step.
Third step: Add these suggestions to the list of all suggestions.
The text was updated successfully, but these errors were encountered: