Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use caption to suggest depictions #5422

Open
nicolas-raoul opened this issue Jan 8, 2024 · 13 comments
Open

Use caption to suggest depictions #5422

nicolas-raoul opened this issue Jan 8, 2024 · 13 comments
Assignees
Labels
enhancement gsoc Google Summer of Code

Comments

@nicolas-raoul
Copy link
Member

nicolas-raoul commented Jan 8, 2024

If I caption my picture "Papilio machaon on Asteraceae flower in Croatia", then the app should ideally suggest these depictions:

  • papilio machaon
  • asteraceae
  • flower
  • Croatia

First step: Parse the caption into expressions could be asked to an LLM if one is available locally on the device. That means only on Pixel 8 Pro (or above) and Samsung Galaxy S24 (or above) for now but hopefully more makers will follow.
On devices where no such technology is available, the app can fall back to some more traditional tokenization, maybe even split on each space character, or just skip entirely.

Second step: Search for 1 or 2 Wikidata items that match each expression found in the first step.

Third step: Add these suggestions to the list of all suggestions.

@rohit9625
Copy link
Contributor

I don't think splitting the caption would work, because how do we identify which words need to be shown in depictions?

@nicolas-raoul
Copy link
Member Author

@rohit9625 It is pretty easy using LLM:
Screenshot_20240109-075634.png

Even splitting word-by-word would work reasonably well.

Prompt for further improvements:

You are a syntax parser that understands all human languages. For each input sentence, extract all nouns from the sentence. Include each adjective with its noun. Do not output anything but the list of nouns.

Example: "Papilio machaon on Asteraceae flower in Croatia"
 - papilio machaon
 - asteraceae
 - flower
 - Croatia

Now do the same for: "Oxcart passing on a wooden bridge in Papua New Guinea"

@RitikaPahwa4444
Copy link
Collaborator

That means only on Pixel 8 Pro for now it seems

Is on-device execution necessary for this use case? The API is available for target API level 21 and higher, and should work if we're not going for Gemini Nano.

@nicolas-raoul
Copy link
Member Author

Is on-device execution necessary for this use case?

On-device execution is necessary. As a privacy policy, we do not send any HTTP request to any server outside of Wikimedia (and Mapbox until recently but hopefully this should be resolved soon).

Details: https://github.com/commons-app/commons-app-documentation/blob/master/android/Privacy-policy.md#privacy-policy

@RitikaPahwa4444
Copy link
Collaborator

RitikaPahwa4444 commented Jan 9, 2024

Thank you for clarifying! Since captions are publicly available, I'd not thought of it from this perspective. This seems like an on-device LLM, but I'm still exploring😅.

@nicolas-raoul
Copy link
Member Author

Great link thanks Ritika! I wonder how many megabytes that would add to the app... if not too many that's definitely an option.

@kanahia1
Copy link
Contributor

kanahia1 commented Mar 3, 2024

Hey @nicolas-raoul, I have found a model which we can utilize to achieve NER https://huggingface.co/dslim/bert-base-NER. They have provided tf model which we can to convert to .tflite model, then can do the implementation in the app.

I tested it with Papilio machaon on Asteraceae flower in Croatia these were the results

[
  {
    "entity_group": "PER",
    "score": 0.4467669427394867,
    "word": "Pa",
    "start": 0,
    "end": 2
  },
  {
    "entity_group": "PER",
    "score": 0.8728312849998474,
    "word": "mac",
    "start": 8,
    "end": 11
  },
  {
    "entity_group": "MISC",
    "score": 0.9185755848884583,
    "word": "Asteraceae",
    "start": 19,
    "end": 29
  },
  {
    "entity_group": "LOC",
    "score": 0.9997634291648865,
    "word": "Croatia",
    "start": 40,
    "end": 47
  }
]

I believe accuracy is good as it provides us with Asteraceae and Croatia

@kanahia1
Copy link
Contributor

kanahia1 commented Mar 4, 2024

@nicolas-raoul, I have converted the model to .tflite ~ 411 MB which is very large when we have to use in Android App (as it will increase the App size).

Is it possible we can use MLKit by Google in the App ?
https://github.com/googlesamples/mlkit/tree/master/android/entityextraction

@nicolas-raoul
Copy link
Member Author

@kanahia1 Great research, thanks!

I feel like 50MB should be maximum...

MLKit sounds great! Can it extract keywords? How big is it? 🙂

@kanahia1
Copy link
Contributor

kanahia1 commented Mar 4, 2024

@nicolas-raoul Sure, It can extract
https://github.com/googlesamples/mlkit/tree/master/android/entityextraction?rgh-link-date=2024-03-04T06%3A00%3A08Z
Entity extraction has an app size impact of up to ~5.6MB.

They have provided examples -

Input Text - Meet me at 1600 Amphitheatre Parkway, Mountain View, CA, 94043 Let’s organize a meeting to discuss.
Output Text - 1600 Ampitheatre Parkway, Mountain View, CA 94043

More examples are available here

@kanahia1
Copy link
Contributor

kanahia1 commented Mar 4, 2024

@nicolas-raoul
I tested MLKit, results I found were not satisfactory. I think dropping the idea of using MLKit would be better until the Google make it more effective since it is currently in beta.

Test 1 
Input: Dust storm approaching Stratford, Texas
Output: There are no entities detected

Test 2
Input: Aerial view of the city of Mayfield on December 12.
Output: December 12.

Test 3
Input: Interior view of the main nave of the Cathedral in Narbonne, France. The Roman Catholic church is dedicated to Saints Justus and Pastor and it begun its construction in 1272. The choir was finished in 1332, but the rest of the gothic building was never completed, as the result of many factors including sudden changes in the economic status of Narbonne, its unusual size and geographical location (to complete it would have meant demolishing the city wall) and financial constraints.
Output: There are no entities detected

Test 4
Input: Mt Technical from Lewis tops, Lewis Pass Scenic Reserve, New Zealand
Output: There are no entities detected

If I get enough time by end of this month, I will try to train a custom model 🙂

@Thejas775
Copy link

Thejas775 commented Jan 22, 2025

If I caption my picture "Papilio machaon on Asteraceae flower in Croatia", then the app should ideally suggest these depictions:

  • papilio machaon
  • asteraceae
  • flower
  • Croatia

First step: Parse the caption into expressions could be asked to an LLM if one is available locally on the device. That means only on Pixel 8 Pro (or above) and Samsung Galaxy S24 (or above) for now but hopefully more makers will follow. On devices where no such technology is available, the app can fall back to some more traditional tokenization, maybe even split on each space character, or just skip entirely.

Second step: Search for 1 or 2 Wikidata items that match each expression found in the first step.

Third step: Add these suggestions to the list of all suggestions.

We can try running Gemini Nano using the AICore but it would support very few devices.
Well I also found this : https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android
Maybe this will work. What's your say on this ?

@nicolas-raoul
Copy link
Member Author

AICore but it would support very few devices

Yes, this project is very long-term (and not urgent) so even if it currently supports only few devices that's OK. Also, development is possible without owning such a device (we will use stubs, and someone else who has such a device will test for real).

Well I also found this : https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android

Fantastic! Since it could be used for more things than just depictions, I mentioned it here: #6143 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement gsoc Google Summer of Code
Projects
None yet
Development

No branches or pull requests

5 participants