Words flat #83

miqwit · 2017-10-05T09:26:04Z

Hello. Thanks for your great work on image-match.

This branch is a suggestion to store in ES the document in a "flat" way, rather than a field per word, i.e. "simple_words": "123 456 789" rather than {"simple_word_01": 123, "simple_word_02": 456, "simple_word_03": 789}. The results are comparable. The correct image is always the one to be found.

In more details (from the test run test_elasticsearch_driver_speed.py):

The first result is always the correct image
In some cases (2,23%), the flat search returns less results (tested on 6 minimum_should_match values), probably due to a TF/IDF search
In some cases (7,37%), the flat search returns more results than the regular search
47,07% of the results from the flat search are the same than the fields search (the result tail)
Out of 3000 searches, the flat search is 55.81% faster than the field search (66.89 seconds instead of 104.24 seconds)
For 9145 documents (the Caltech Object Categories dataset), the flat index is 10% smaller than the fields index (39.8Mo instead of 43.9Mo)
For 9145 documents, it took 1,5% more time to ingest them in the flat index (217,34 seconds against 214,21 seconds)

The Caltech Object Categories dataset is to be found here: http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz

To run the test:

Install dependancies in a virtualenv
virtualenv ~/.env/image-match-words-flat/
source ~/.env/image-match-words-flat/bin/activate
pip install -r requirements.txt
Launch an elasticsearch via docker:
docker run -d --rm --name=es -p 9200:9200 -p 9300:9300 elasticsearch:latest
Run the test (around 5 minutes run):
python tests/test_elasticsearch_driver_speed.py

rhsimplex · 2017-10-20T14:12:54Z

Hi @miqwit thanks for the PR! I'll be away at a conference this week, but I'll try to look over this ASAP.

SimonSteinberger · 2017-12-12T16:56:04Z

I think this approach is a very good idea. But isn't the positional precision of the individual words lost here? I mean, with having separate fields, input word1 is compared exclusively to other word1 values in the ES storage. With the flat approach, word1 is matched against all words simultaneously.

If so, this approach certainly works for a few hundreds of thousands of images just as well, or even better than the original approach. But at some point (maybe a billion images), flat search might be less efficient, because of many false positives matches due to the lost positional precision...?

Cheers, Simon

SimonSteinberger · 2017-12-12T17:18:12Z

I guess if enough words are used, it's not a problem. Another though: How about using an array of integers for this in ES? Maybe an array may even be faster than the concatenated string...

miqwit · 2017-12-18T09:49:53Z

Hi Simon. Thanks for taking time to review this. Yes, you are right. The precision is lost, and in case of billions of images it's better to increase the number of words to match. I could not come up with a formal proof to find the number of words needed to obtain a similar precision from the two techniques. I used tests to prove results were comparable. Your thought is interesting, maybe an array is faster than a list of string. I will investigate on this. Thanks for the tip. I'll let you know. My idea was not to replace the way it's done today, but to give another option (understanding the differences with the first option) which is faster and can be needed in more specific cases. Mickaël. 2017-12-12 18:18 GMT+01:00 Simon Steinberger <[email protected]>:

…

I guess if enough words are used, it's not a problem. Another though: How about using an array of integers for this in ES? Maybe an array may even be faster than the concatenated string... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#83 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG5G4Gv_WYcGF6LE7qxnvxYyqUVkBXy7ks5s_rVYgaJpZM4PuynC> .

-- Mickaël. R&D Software Engineer

…-flat

miqwit · 2017-12-20T09:38:19Z

Hi. I added a speed test with a "flatint" driver where words are stored as an array of lon ints. It is not as fast as the text array. From my test suite (3000 searches) I have the following results:

111.71158003807068 to search fields documents
71.15072679519653 to search flat documents
91.89284920692444 to search flatint documents

(cumulative time, in seconds)

ps: what is the proper way to access an elasticsearch from Travis to run my tests properly?

taylorjdawson · 2019-02-18T22:38:17Z

@miqwit are you still pursuing this? Also have you tried testing on a very high number of images? I would be curious to see a graph of the stats based on number of images.

miqwit · 2019-02-21T11:13:25Z

@miqwit are you still pursuing this? Also have you tried testing on a very high number of images? I would be curious to see a graph of the stats based on number of images.

Yes, I tested with this dataset: http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz, which is 9100 images. I guess that would not count as a "very high number of images". What volume would be appropriate to convince you? Do you suggest a dataset?

You can see my speed test in this file already: https://github.com/miqwit/image-match/blob/words-flat/tests/test_elasticsearch_driver_speed.py

I will work on a graph display, which will be indeed interesting.

taylorjdawson · 2019-02-22T03:44:57Z

I need to index about 8,530,641 images so I would consider that a high number! 😆

miqwit · 2019-02-22T10:49:38Z

Yes, well, I won't test it on that amount :) I'll see if I can raise it to a more significant number. At least I'll pull some statistical data.

…-flat

miqwit · 2019-03-19T15:07:40Z

Hi @taylorjdawson. I improved my script by generating graphs (with matplotlib) about various performances. I am in the process of validating the concept with some probabilistic approach (this should answer @SimonSteinberger thoughts in a more accurate fashion).

Still, you can run my last benchmark by launching an ES with:
docker run -d -p 9200:9200 -p 9300:9300 elasticsearch:5.5.2

and running the test with:
python test_elasticsearch_driver_speed.py --delete-indices=True --populate-indices=True

This will generate png images starting with plot_*

miqwit · 2019-03-19T15:14:11Z

You can find generated plots on my machine here

Here is the query time one:

This needs more explanation about all of them, I plan to do this soon (here, or in a public post somewhere).

heipei · 2022-10-28T12:28:34Z

Hey @miqwit did you ever figure out why the flat_txt is so much faster than the fields query?

miqwit · 2022-10-29T14:37:03Z

Hello @heipei. No, not really... This would need to dive in the ES engine. It is based on Lucene and open source, so I guess it's feasible, but I don't plan to do it. This work was just to show it.

A gut feeling would be that it's not testing the position of the words, as stated above in a previous comment from @SimonSteinberger. By testing field by field, I can conceive that it takes more time, providing more "accuracy". But really, I don't know...!

miqwit added 4 commits October 4, 2017 19:03

Store documents words in a single text field

197b2f2

Store documents words in a single text field

f94f007

Add requirements file from my virtualenv pip freeze

ffec258

Remove verbose when untar

f37585b

miqwit added 3 commits December 20, 2017 10:21

Added a speed test with 'flatint', storing words as list of long

f7e98bb

Added a speed test with 'flatint', storing words as list of long

39a7faa

Merge branch 'words-flat' of github.com:miqwit/image-match into words…

0b40728

…-flat

excerebrose mentioned this pull request Nov 14, 2018

Elastic Search query can be modified to improve performance on search image #105

Open

miqwit added 3 commits February 22, 2019 18:48

Merge branch 'words-flat' of github.com:miqwit/image-match into words…

dc0f3bd

…-flat

Generate plots when running benchmark

c05934d

Minor comments and changes

4ddcfa6

Added plots examples

a0028d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Words flat #83

Words flat #83

miqwit commented Oct 5, 2017

rhsimplex commented Oct 20, 2017

SimonSteinberger commented Dec 12, 2017 •

edited

Loading

SimonSteinberger commented Dec 12, 2017

miqwit commented Dec 18, 2017 via email

miqwit commented Dec 20, 2017

taylorjdawson commented Feb 18, 2019

miqwit commented Feb 21, 2019

taylorjdawson commented Feb 22, 2019

miqwit commented Feb 22, 2019 via email •

edited

Loading

miqwit commented Mar 19, 2019

miqwit commented Mar 19, 2019

heipei commented Oct 28, 2022

miqwit commented Oct 29, 2022

Words flat #83

Are you sure you want to change the base?

Words flat #83

Conversation

miqwit commented Oct 5, 2017

rhsimplex commented Oct 20, 2017

SimonSteinberger commented Dec 12, 2017 • edited Loading

SimonSteinberger commented Dec 12, 2017

miqwit commented Dec 18, 2017 via email

miqwit commented Dec 20, 2017

taylorjdawson commented Feb 18, 2019

miqwit commented Feb 21, 2019

taylorjdawson commented Feb 22, 2019

miqwit commented Feb 22, 2019 via email • edited Loading

miqwit commented Mar 19, 2019

miqwit commented Mar 19, 2019

heipei commented Oct 28, 2022

miqwit commented Oct 29, 2022

SimonSteinberger commented Dec 12, 2017 •

edited

Loading

miqwit commented Feb 22, 2019 via email •

edited

Loading