Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Words flat #83

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Words flat #83

wants to merge 11 commits into from

Conversation

miqwit
Copy link

@miqwit miqwit commented Oct 5, 2017

Hello. Thanks for your great work on image-match.

This branch is a suggestion to store in ES the document in a "flat" way, rather than a field per word, i.e. "simple_words": "123 456 789" rather than {"simple_word_01": 123, "simple_word_02": 456, "simple_word_03": 789}. The results are comparable. The correct image is always the one to be found.

In more details (from the test run test_elasticsearch_driver_speed.py):

  • The first result is always the correct image

  • In some cases (2,23%), the flat search returns less results (tested on 6 minimum_should_match values), probably due to a TF/IDF search

  • In some cases (7,37%), the flat search returns more results than the regular search

  • 47,07% of the results from the flat search are the same than the fields search (the result tail)

  • Out of 3000 searches, the flat search is 55.81% faster than the field search (66.89 seconds instead of 104.24 seconds)

  • For 9145 documents (the Caltech Object Categories dataset), the flat index is 10% smaller than the fields index (39.8Mo instead of 43.9Mo)

  • For 9145 documents, it took 1,5% more time to ingest them in the flat index (217,34 seconds against 214,21 seconds)

The Caltech Object Categories dataset is to be found here: http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz

To run the test:

  1. Install dependancies in a virtualenv
    virtualenv ~/.env/image-match-words-flat/
    source ~/.env/image-match-words-flat/bin/activate
    pip install -r requirements.txt

  2. Launch an elasticsearch via docker:
    docker run -d --rm --name=es -p 9200:9200 -p 9300:9300 elasticsearch:latest

  3. Run the test (around 5 minutes run):
    python tests/test_elasticsearch_driver_speed.py

@rhsimplex
Copy link
Owner

Hi @miqwit thanks for the PR! I'll be away at a conference this week, but I'll try to look over this ASAP.

@SimonSteinberger
Copy link

SimonSteinberger commented Dec 12, 2017

I think this approach is a very good idea. But isn't the positional precision of the individual words lost here? I mean, with having separate fields, input word1 is compared exclusively to other word1 values in the ES storage. With the flat approach, word1 is matched against all words simultaneously.

If so, this approach certainly works for a few hundreds of thousands of images just as well, or even better than the original approach. But at some point (maybe a billion images), flat search might be less efficient, because of many false positives matches due to the lost positional precision...?

Cheers, Simon

@SimonSteinberger
Copy link

I guess if enough words are used, it's not a problem. Another though: How about using an array of integers for this in ES? Maybe an array may even be faster than the concatenated string...

@miqwit
Copy link
Author

miqwit commented Dec 18, 2017 via email

@miqwit
Copy link
Author

miqwit commented Dec 20, 2017

Hi. I added a speed test with a "flatint" driver where words are stored as an array of lon ints. It is not as fast as the text array. From my test suite (3000 searches) I have the following results:

111.71158003807068 to search fields documents
71.15072679519653 to search flat documents
91.89284920692444 to search flatint documents

(cumulative time, in seconds)

ps: what is the proper way to access an elasticsearch from Travis to run my tests properly?

@taylorjdawson
Copy link

@miqwit are you still pursuing this? Also have you tried testing on a very high number of images? I would be curious to see a graph of the stats based on number of images.

@miqwit
Copy link
Author

miqwit commented Feb 21, 2019

@miqwit are you still pursuing this? Also have you tried testing on a very high number of images? I would be curious to see a graph of the stats based on number of images.

Yes, I tested with this dataset: http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz, which is 9100 images. I guess that would not count as a "very high number of images". What volume would be appropriate to convince you? Do you suggest a dataset?

You can see my speed test in this file already: https://github.com/miqwit/image-match/blob/words-flat/tests/test_elasticsearch_driver_speed.py

I will work on a graph display, which will be indeed interesting.

@taylorjdawson
Copy link

I need to index about 8,530,641 images so I would consider that a high number! 😆

@miqwit
Copy link
Author

miqwit commented Feb 22, 2019 via email

@miqwit
Copy link
Author

miqwit commented Mar 19, 2019

Hi @taylorjdawson. I improved my script by generating graphs (with matplotlib) about various performances. I am in the process of validating the concept with some probabilistic approach (this should answer @SimonSteinberger thoughts in a more accurate fashion).

Still, you can run my last benchmark by launching an ES with:
docker run -d -p 9200:9200 -p 9300:9300 elasticsearch:5.5.2

and running the test with:
python test_elasticsearch_driver_speed.py --delete-indices=True --populate-indices=True

This will generate png images starting with plot_*

@miqwit
Copy link
Author

miqwit commented Mar 19, 2019

You can find generated plots on my machine here

Here is the query time one:

This needs more explanation about all of them, I plan to do this soon (here, or in a public post somewhere).

@heipei
Copy link

heipei commented Oct 28, 2022

Hey @miqwit did you ever figure out why the flat_txt is so much faster than the fields query?

@miqwit
Copy link
Author

miqwit commented Oct 29, 2022

Hello @heipei. No, not really... This would need to dive in the ES engine. It is based on Lucene and open source, so I guess it's feasible, but I don't plan to do it. This work was just to show it.

A gut feeling would be that it's not testing the position of the words, as stated above in a previous comment from @SimonSteinberger. By testing field by field, I can conceive that it takes more time, providing more "accuracy". But really, I don't know...!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants