Use caption to suggest depictions #5422

nicolas-raoul · 2024-01-08T09:03:14Z

If I caption my picture "Papilio machaon on Asteraceae flower in Croatia", then the app should ideally suggest these depictions:

papilio machaon
asteraceae
flower
Croatia

First step: Parse the caption into expressions could be asked to an LLM if one is available locally on the device. That means only on Pixel 8 Pro (or above) and Samsung Galaxy S24 (or above) for now but hopefully more makers will follow.
On devices where no such technology is available, the app can fall back to some more traditional tokenization, maybe even split on each space character, or just skip entirely.

Second step: Search for 1 or 2 Wikidata items that match each expression found in the first step.

Third step: Add these suggestions to the list of all suggestions.

rohit9625 · 2024-01-08T13:53:28Z

I don't think splitting the caption would work, because how do we identify which words need to be shown in depictions?

nicolas-raoul · 2024-01-08T23:00:00Z

@rohit9625 It is pretty easy using LLM:

Even splitting word-by-word would work reasonably well.

Prompt for further improvements:

You are a syntax parser that understands all human languages. For each input sentence, extract all nouns from the sentence. Include each adjective with its noun. Do not output anything but the list of nouns.

Example: "Papilio machaon on Asteraceae flower in Croatia"
 - papilio machaon
 - asteraceae
 - flower
 - Croatia

Now do the same for: "Oxcart passing on a wooden bridge in Papua New Guinea"

RitikaPahwa4444 · 2024-01-09T05:19:33Z

That means only on Pixel 8 Pro for now it seems

Is on-device execution necessary for this use case? The API is available for target API level 21 and higher, and should work if we're not going for Gemini Nano.

nicolas-raoul · 2024-01-09T05:27:20Z

Is on-device execution necessary for this use case?

On-device execution is necessary. As a privacy policy, we do not send any HTTP request to any server outside of Wikimedia (and Mapbox until recently but hopefully this should be resolved soon).

Details: https://github.com/commons-app/commons-app-documentation/blob/master/android/Privacy-policy.md#privacy-policy

RitikaPahwa4444 · 2024-01-09T05:49:40Z

Thank you for clarifying! Since captions are publicly available, I'd not thought of it from this perspective. This seems like an on-device LLM, but I'm still exploring😅.

nicolas-raoul · 2024-01-09T08:19:20Z

Great link thanks Ritika! I wonder how many megabytes that would add to the app... if not too many that's definitely an option.

kanahia1 · 2024-03-03T17:13:28Z

Hey @nicolas-raoul, I have found a model which we can utilize to achieve NER https://huggingface.co/dslim/bert-base-NER. They have provided tf model which we can to convert to .tflite model, then can do the implementation in the app.

I tested it with Papilio machaon on Asteraceae flower in Croatia these were the results

[
  {
    "entity_group": "PER",
    "score": 0.4467669427394867,
    "word": "Pa",
    "start": 0,
    "end": 2
  },
  {
    "entity_group": "PER",
    "score": 0.8728312849998474,
    "word": "mac",
    "start": 8,
    "end": 11
  },
  {
    "entity_group": "MISC",
    "score": 0.9185755848884583,
    "word": "Asteraceae",
    "start": 19,
    "end": 29
  },
  {
    "entity_group": "LOC",
    "score": 0.9997634291648865,
    "word": "Croatia",
    "start": 40,
    "end": 47
  }
]

I believe accuracy is good as it provides us with Asteraceae and Croatia

kanahia1 · 2024-03-04T06:00:08Z

@nicolas-raoul, I have converted the model to .tflite ~ 411 MB which is very large when we have to use in Android App (as it will increase the App size).

Is it possible we can use MLKit by Google in the App ?
https://github.com/googlesamples/mlkit/tree/master/android/entityextraction

nicolas-raoul · 2024-03-04T10:38:56Z

@kanahia1 Great research, thanks!

I feel like 50MB should be maximum...

MLKit sounds great! Can it extract keywords? How big is it? 🙂

kanahia1 · 2024-03-04T10:46:45Z

@nicolas-raoul Sure, It can extract
https://github.com/googlesamples/mlkit/tree/master/android/entityextraction?rgh-link-date=2024-03-04T06%3A00%3A08Z
Entity extraction has an app size impact of up to ~5.6MB.

They have provided examples -

Input Text - Meet me at 1600 Amphitheatre Parkway, Mountain View, CA, 94043 Let’s organize a meeting to discuss.
Output Text - 1600 Ampitheatre Parkway, Mountain View, CA 94043

More examples are available here

kanahia1 · 2024-03-04T16:20:35Z

@nicolas-raoul
I tested MLKit, results I found were not satisfactory. I think dropping the idea of using MLKit would be better until the Google make it more effective since it is currently in beta.

Test 1 
Input: Dust storm approaching Stratford, Texas
Output: There are no entities detected

Test 2
Input: Aerial view of the city of Mayfield on December 12.
Output: December 12.

Test 3
Input: Interior view of the main nave of the Cathedral in Narbonne, France. The Roman Catholic church is dedicated to Saints Justus and Pastor and it begun its construction in 1272. The choir was finished in 1332, but the rest of the gothic building was never completed, as the result of many factors including sudden changes in the economic status of Narbonne, its unusual size and geographical location (to complete it would have meant demolishing the city wall) and financial constraints.
Output: There are no entities detected

Test 4
Input: Mt Technical from Lewis tops, Lewis Pass Scenic Reserve, New Zealand
Output: There are no entities detected

If I get enough time by end of this month, I will try to train a custom model 🙂

Thejas775 · 2025-01-22T09:25:24Z

If I caption my picture "Papilio machaon on Asteraceae flower in Croatia", then the app should ideally suggest these depictions:

papilio machaon

asteraceae

flower

Croatia

First step: Parse the caption into expressions could be asked to an LLM if one is available locally on the device. That means only on Pixel 8 Pro (or above) and Samsung Galaxy S24 (or above) for now but hopefully more makers will follow. On devices where no such technology is available, the app can fall back to some more traditional tokenization, maybe even split on each space character, or just skip entirely.

Second step: Search for 1 or 2 Wikidata items that match each expression found in the first step.

Third step: Add these suggestions to the list of all suggestions.

We can try running Gemini Nano using the AICore but it would support very few devices.
Well I also found this : https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android
Maybe this will work. What's your say on this ?

nicolas-raoul · 2025-01-23T08:14:46Z

AICore but it would support very few devices

Yes, this project is very long-term (and not urgent) so even if it currently supports only few devices that's OK. Also, development is possible without owning such a device (we will use stubs, and someone else who has such a device will test for real).

Well I also found this : https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android

Fantastic! Since it could be used for more things than just depictions, I mentioned it here: #6143 (comment)

nicolas-raoul added the enhancement label Jan 8, 2024

nicolas-raoul mentioned this issue Jan 15, 2024

Question: Does anyone have a physical Pixel 8 Pro? #5428

Open

nicolas-raoul self-assigned this Jan 9, 2025

nicolas-raoul mentioned this issue Jan 17, 2025

Implement a helper app to crowdsource AICore testing #6138

Open

nicolas-raoul added the gsoc Google Summer of Code label Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use caption to suggest depictions #5422

Use caption to suggest depictions #5422

nicolas-raoul commented Jan 8, 2024 •

edited

Loading

rohit9625 commented Jan 8, 2024

nicolas-raoul commented Jan 8, 2024

RitikaPahwa4444 commented Jan 9, 2024

nicolas-raoul commented Jan 9, 2024

RitikaPahwa4444 commented Jan 9, 2024 •

edited

Loading

nicolas-raoul commented Jan 9, 2024

kanahia1 commented Mar 3, 2024 •

edited

Loading

kanahia1 commented Mar 4, 2024 •

edited

Loading

nicolas-raoul commented Mar 4, 2024

kanahia1 commented Mar 4, 2024

kanahia1 commented Mar 4, 2024 •

edited

Loading

Thejas775 commented Jan 22, 2025 •

edited

Loading

nicolas-raoul commented Jan 23, 2025

Use caption to suggest depictions #5422

Use caption to suggest depictions #5422

Comments

nicolas-raoul commented Jan 8, 2024 • edited Loading

rohit9625 commented Jan 8, 2024

nicolas-raoul commented Jan 8, 2024

RitikaPahwa4444 commented Jan 9, 2024

nicolas-raoul commented Jan 9, 2024

RitikaPahwa4444 commented Jan 9, 2024 • edited Loading

nicolas-raoul commented Jan 9, 2024

kanahia1 commented Mar 3, 2024 • edited Loading

kanahia1 commented Mar 4, 2024 • edited Loading

nicolas-raoul commented Mar 4, 2024

kanahia1 commented Mar 4, 2024

kanahia1 commented Mar 4, 2024 • edited Loading

Thejas775 commented Jan 22, 2025 • edited Loading

nicolas-raoul commented Jan 23, 2025

nicolas-raoul commented Jan 8, 2024 •

edited

Loading

RitikaPahwa4444 commented Jan 9, 2024 •

edited

Loading

kanahia1 commented Mar 3, 2024 •

edited

Loading

kanahia1 commented Mar 4, 2024 •

edited

Loading

kanahia1 commented Mar 4, 2024 •

edited

Loading

Thejas775 commented Jan 22, 2025 •

edited

Loading