Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Index can return too many entries when used with ORDER BY #1760

Open
gramian opened this issue Oct 16, 2024 · 3 comments
Open

SQL: Index can return too many entries when used with ORDER BY #1760

gramian opened this issue Oct 16, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@gramian
Copy link
Collaborator

gramian commented Oct 16, 2024

ArcadeDB Version:

ArcadeDB Server v24.11.1-SNAPSHOT (build 4207821d055e37cbd74b76683e45073da759c2de/1729083296142/console)

OS and JDK Version:

Running on Mac OS X 12.7.6 - OpenJDK 64-Bit Server VM 17.0.12 (Homebrew)

Querying a database can return the wrong number of records, in particular too many records. This problem is not easy to reproduce so I attach a small database (75 records) generated from public data. This behavior occurs when from certain databases, such as the one provided below, is queried via SQL, ie SELECT, where an (not-unique) indexed property is used in the projection and also as ordering quantity (in ORDER BY). The schema is given here: https://github.com/ulbmuenster/dataasee/blob/main/database/schema.sql

Expected behavior

Correct number of records returned.

Actual behavior

Too many records are returned.

Steps to reproduce

Restore this backup: metadatalake-backup-20240829-100753793.zip (337KB)

SELECT count(*) FROM metadata; -- There are 75 records in the database

SELECT name, publicationYear FROM metadata; -- Returns 75 records correctly

SELECT name FROM metadata ORDER BY publicationYear; -- Returns 75 records correctly

SELECT name, publicationYear FROM metadata ORDER BY publicationYear; -- Returns 117 records INCORRECTLY

Notes

End of August (2024-08-28 -- 2024-08-31) @lvca (and I) looked into this and potential sources of this behavior could be:

  • FetchFromIndexStep
  • The UPSERT entering the data.

Here is a sample UPSERT statement that generated the provided database:

UPDATE metadata MERGE {"creators":[{"@type":"pair","name":"Neumann, Georg"},{"@type":"pair","name":"Boivin, Odette"},{"@type":"pair","name":"Kleber, Kristin"},{"@type":"pair","name":"Neumann, Georg"},{"@type":"pair","name":"Boivin, Odette"},{"@type":"pair","name":"Kleber, Kristin"}],"dataLocation":"https://data-management.uni-muenster.de/direct-access/wwurdm/07988641231","dataSteward":"https://datastore.uni-muenster.de","description":"The dataset comprises the RTI-data of the cuneiform tablet VAT 17966 (N5:27).\\n\\nRTI = Reflectance Transformation Imaging (RTI) is a computational photographic method that captures a subject’s surface shape and color and enables the interactive re-lighting of the subject from any direction in a software viewer, revealing details not visible with the naked eye (cf. https://culturalheritageimaging.org/).","identifiers":[{"@type":"pair","data":"10.17879/07988641225","name":"DE-6"},{"@type":"pair","data":"10.17879/07988641225","name":"null"}],"keywords":"Babylon,Cuneiform,Neo-Babylonian","language":"#4:0","metadataQuality":"Incomplete","name":"VAT 17966 - RTI dataset","publicationYear":2024,"publisher":"Universität Münster","rawChecksum":"TcrJlnTQYfugD2rROs2Y3g==","rawType":"marc21","recordId":"MTA0MzQ4NDE4NDMwODAyNDg3NA","resourceType":null,"rights":"CC BY-NC-SA 4.0","source":"https://datastore.uni-muenster.de/oai","synonyms":[{"@type":"pair","data":"GoviB","name":"Alternative Title"}]} UPSERT WHERE recordId == 'MTA0MzQ4NDE4NDMwODAyNDg3NA';

As a workaround the projection can be renamed, ie:

SELECT name, publicationYear AS pubYear FROM metadata ORDER BY publicationYear;

but this also excludes the use of the index!

@gramian gramian added the bug Something isn't working label Oct 16, 2024
@gramian gramian changed the title Index returns too many entries when used with ORDER BY SQL: Index returns too many entries when used with ORDER BY Oct 16, 2024
@gramian gramian changed the title SQL: Index returns too many entries when used with ORDER BY SQL: Index can return too many entries when used with ORDER BY Oct 16, 2024
@gramian
Copy link
Collaborator Author

gramian commented Oct 17, 2024

Using either UPDATE ... MERGE ... UPSERT or UPDATE ... CONTENT ... UPSERT produces too many index results.

@gramian
Copy link
Collaborator Author

gramian commented Nov 29, 2024

Here is some more testing which may help locate the source:

  • Both sorting directions ASC and DESC produce these duplicate results but different numbers: 146 for ASC, and 117 for DESC for a database with 81 records.
  • This index behavior occurs also for INSERT INTO ... CONTENT ...
  • Using DISTINCT removes the duplicates but this is no workaround as the in paged results the duplicates may be spread.

@gramian
Copy link
Collaborator Author

gramian commented Dec 11, 2024

More testing reveals:

  • Using multiple order keys like ORDER BY publicationYear, name returns the correct amount of results. @lvca This does also not use the publicationYear index, right?
  • The data type seems not to make a difference (tested with SHORT, INTEGER, and LONG).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants