Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Values in registry that should be numbers/integers are indexed as strings in JSON API response #153

Closed
tloubrieu-jpl opened this issue Dec 6, 2023 · 29 comments
Assignees
Labels
B14.1 B15.0 bug Something isn't working duplicate This issue or pull request already exists p.must-have s.medium

Comments

@tloubrieu-jpl
Copy link
Member

💡 Description

  • Check if the gt, lt... operators still work on numbers in q parameter.
  • Trace back when/how/why the change was introduced
  • Decide if we want to come back to the previous behavior

⚔️ Parent Epic / Related Tickets

No response

@alexdunnjpl
Copy link
Contributor

@tloubrieu-jpl could I bug you for more details?

  • example node/document
  • example field and expected value
  • example request with relevant query
  • expected behaviour - data type in OpenSearch document
  • expected behaviour - data type in http response

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Feb 2, 2024

Example document/fields
node: psa-prod
id: urn:esa:psa:em16_tgo_cas:data_raw:cas_raw_sc_20230309t163649-20230309t163653-23606-60-blu-1122087748-29-3::3.0

{
  "pds:Axis_Array/pds:sequence_number": [
    "1",
    "2"
  ],
  "em16_tgo_cas:Detector_Information/em16_tgo_cas:pixel_width": [
    "10.0"
  ]
}

@alexdunnjpl
Copy link
Contributor

Of note is the fact that this document may not have had repairkit ever run on it - it lacks version metadata for any sweeper - though it's possible a pre-versioning repairkit may have been run on it.

Simplest way to confirm is probably just to write a test and run it against various versions (current, and all pre-versioning versions) of the repairkit sweeper.

If it isn't due to a bug in repairkit, then it seems like harvest may not be storing values correctly, assuming that the field metadata in the DD is present and correct.

@jordanpadams
Copy link
Member

@alexdunnjpl 👍 . not sure where this bug is. that being said, there is definitely some funkiness happening across the schemas based upon this warning I have seen from our EN cross-cluster search:

Screenshot 2024-02-02 at 12 18 25 PM

This is from the Index Patterns page where it is trying to bring all the indexes fields together, and it is colliding with their types. I imagine something went weird here with past versions of Harvest (hopefully not the current), and we just need to do a scrub of the types.

@tloubrieu-jpl ☝️

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Feb 2, 2024

@tloubrieu-jpl @jordanpadams I'm softly skeptical that this is new/erroneous behaviour and that OpenSearch was intended to use float/int types when storing numerically-defined fields.

@jordanpadams if you're seeing collisions, does that suggest that some field is differently-typed across different nodes? If so, that would suggest that at some point, harvest changed from (I assume) varied field types to keyword.

Thoughts?

@tloubrieu-jpl
Copy link
Member Author

Hi @alexdunnjpl ,

Thanks for looking at that.
My basis for thinking we had a regression on having all numbers as string is that in the past (approximately 2022, maybe later), we had integration tests which were testing filter on number, see for example {{baseUrl}}/collections?limit=10&q=_file_size ne 8042.

Then I removed these tests when I transitioned the integration test to the I&T team so that we re-start from something less chaotic than how the test suite had turned in. So I am not sure when the number stoped being numbers.

So you are saying that none of the values are stored as numbers in OpenSearch. Does the schema confirms that ? Is the current version of harvest pushing the documents with number converted as strings ?

Thanks

@tloubrieu-jpl
Copy link
Member Author

Sorry I think @alexdunnjpl you already answered my question on harvest not converting values to string.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Feb 2, 2024

This ElasticSearch issue seems to indicate that while numeric values are stored as-is in _source, they are indexed as strings if typed as keyword.

It seems fraught to intentionally write numerics to keyword-typed fields, on that basis alone.

@tloubrieu-jpl going the other direction, could you give an example of a node, document and numeric field which is in OpenSearch as a numeric type rather than a string?

Regarding the query in the tests, I'm not familiar with the internals of our query-language parsing, but I'd expect that OpenSearch will happily take a numeric type in a query and cast it when doing a comparison, if necessary. I'll test that assumption now.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Feb 2, 2024

Confirmed: OS will accept a numeric to compare with a keyword field, convert it to a string, and perform a lexicographic comparison (resulting in erroneous results, for example finding that 9.0 > "10.0")

So it looks like there are two potential issues:

  • numeric fields are erroneously indexed as keyword, when this is inappropriate
  • numeric fields are being written as string in at least some cases (since if they were mapped as keyword but written as int/float, they should still be returned as int/float in _source, per that ES issue, if the info is still current)

Looking at the psa-prod index, I'm seeing plenty of numeric fields, e.g.

"cart:Equirectangular/cart:latitude_of_projection_origin": {
                    "type": "double"
                },
                "cart:Equirectangular/cart:longitude_of_central_meridian": {
                    "type": "double"
                },
                "cart:Equirectangular/cart:standard_parallel_1": {
                    "type": "double"
                },
                "cart:Geo_Transformation/cart:upperleft_corner_x": {
                    "type": "double"
                },
                "cart:Geo_Transformation/cart:upperleft_corner_y": {
                    "type": "double"
                },
                "cart:Geodetic_Model/cart:a_axis_radius": {
                    "type": "double"
                },
                "cart:Geodetic_Model/cart:b_axis_radius": {
                    "type": "double"
                },
                "cart:Geodetic_Model/cart:c_axis_radius": {
                    "type": "double"
                },

but the field I'm using as a test example is keyword:

                "em16_tgo_cas:Detector_Information/em16_tgo_cas:pixel_width": {
                    "type": "keyword"
                },

so either:

  • the cart* properties were indexed with an older version of harvest, and newer versions of harvest switched to using keyword for everything
  • the DD is bad for some fields

in either case, seems necessary to

  • identify the error as harvest or DD
  • fix the error
  • fix the bad data by resolving any disparity between the DD and the index, on every node (using a sweeper for this seems like the obvious choice)
  • reindex each node (this could be incorporated into the sweeper, triggered only if a disparity exists - need to think through what happens if data is harvested during a reindex operation, and whether there's any potential for data indexing to be lost and not re-performed). Can you re-index specific fields, I wonder?

@alexdunnjpl
Copy link
Contributor

Suggest moving this ticket to harvest

@tloubrieu-jpl tloubrieu-jpl transferred this issue from NASA-PDS/registry-api Feb 2, 2024
@tloubrieu-jpl
Copy link
Member Author

@alexdunnjpl @jordanpadams I moved the ticket to the harvest repository, from what was said before I suggest 2 actions:

  1. check that the latest version harvest is not breaking the schema
  2. develop a utility to make the schema straight, could include a feature for the ticket Keyword fields do not support like queries registry-api#351. Ths development could be a sub-script of registry-manager or registry-sweeper, my first thought was to make it part of registry-manager/registry-common.

@jordanpadams jordanpadams changed the title Investigate why we now have numbers as string in the JSON API response Values in registry that should be numbers/integers are indexed as strings in JSON API response Feb 5, 2024
@jordanpadams jordanpadams added bug Something isn't working s.high and removed i&t.skip task labels Feb 5, 2024
@tloubrieu-jpl
Copy link
Member Author

The properties which we expect to be number are:

      "pds:Record_Character.pds:record_length" : [ "110" ],
      "pds:File.pds:file_size" : [ "3850" ],
      "pds:Table_Character.pds:records" : [ "35" ],

As seen in request https://pds.nasa.gov/api/search/1/products

@alexdunnjpl
Copy link
Contributor

@tloubrieu-jpl @jordanpadams I'm trying to inspect the relevant index, but I'm not having much luck tracking it down.

The first product returned by that request is urn:nasa:pds:compil-comet:nuc_properties:extinct::1.0, which has "ops:Harvest_Info.ops:node_name": "PDS_SBN". Directly inspecting the OpenSearch registryindex mappings forsearch-sbnpsi-prodandsearch-sbnumb-prod` yields none of those three properties. Is it just a bad assumption on my part that that's where they'd be?

Is there a way to quickly determine which OS node is serving a particular document?

@jordanpadams
Copy link
Member

@alexdunnjpl I have no idea what is going on here... Will try out all the registries. not sure how this is possible

@jordanpadams
Copy link
Member

It looks like the collection is there, but the product is not...

@jordanpadams
Copy link
Member

@alexdunnjpl actually I lied.

curl -u <user> 'https://search-sbnumd-prod-o5i2rnn265gnwmv2quk4n7uram.us-west-2.es.amazonaws.com/registry/_search?q={_id:"urn:nasa:pds:compil-comet:nuc_properties:extinct::1.0"}&pretty=true'

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Feb 15, 2024

Sorry @jordanpadams - didn't mean to send you off to do the thing I was lazily avoiding!

So the issue with the "missing" property mappings was that I forgot to convert them .->/ 🤦‍♂️

The good-ish news is that those are all mapped as long in the index.

@alexdunnjpl
Copy link
Contributor

More good-ish news.

Searches appear to respect the mapping type, i.e. when performing comparisons the values aren't compared lexicographically and therefore aren't returning erroneous results.

ex. this query

{
    "query": {
        "bool": {
            "must": [
                {
                    "range": {
                        "pds:File/pds:records": {
                            "gte": 0,
                            "lte": 35
                        }
                    }
                }
            ]
        }
    },
    "_source": {
        "includes": ["pds:File/pds:records"]
    },
    "size": 3,
    "track_total_hits": true
}

returns the following (note presence of value "7")

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 150,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "registry",
                "_type": "_doc",
                "_id": "urn:nasa:pds:compil-comet:nuc_properties:extinct::1.0",
                "_score": 1.0,
                "_source": {
                    "pds:File/pds:records": [
                        "35"
                    ]
                }
            },
            {
                "_index": "registry",
                "_type": "_doc",
                "_id": "urn:nasa:pds:gbo-kpno:mosaic-9p:reduced_2005_july0_conv_stand_july04_tab::1.0",
                "_score": 1.0,
                "_source": {
                    "pds:File/pds:records": [
                        "7"
                    ]
                }
            },
            {
                "_index": "registry",
                "_type": "_doc",
                "_id": "urn:nasa:pds:gbo-kpno:mosaic-9p:reduced_200_standards_centers_july04_tab::1.0",
                "_score": 1.0,
                "_source": {
                    "pds:File/pds:records": [
                        "31"
                    ]
                }
            }
        ]
    }
}

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Feb 16, 2024

Testing via the API appears to indicate that comparisons are being correctly performed:

~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "100")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
1194175
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "1000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
501838
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "10000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
150101
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "100000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
19194
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "1000000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
219
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "10000000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
11

Note that records gt threshold is increasing tenfold each time, and hits count is being reduced drastically in a manner inconsistent with lexicographic comparison, which should result in fairly consistent hit counts.

@tloubrieu-jpl @jordanpadams with this in mind, what disconnect, if any, still exists between current and desired behaviour, and is it a regression, or a new feature (if that's known for certain)?

@jordanpadams
Copy link
Member

@alexdunnjpl, @tloubrieu-jpl may have a specific instance where this was breaking, but, in general, I think this is very tightly coupled with NASA-PDS/registry#230. For this particular case, it may have been something we broke in the schema during re-indexing or repair kit or ???, but, in general, I think we may need a way to validate our OpenSearch schema types match the expectations of our released information model schemas, and repair those schema fields where there is a mismatch.

I will try to poke through Kibana to identify where OpenSearch sees a mismatch in schema types across all the clusters.

@alexdunnjpl
Copy link
Contributor

Roger that - thanks Jordan!

@jordanpadams
Copy link
Member

jordanpadams commented Feb 20, 2024

@alexdunnjpl per my comment on our Sprint Tag, it doesn't look like there is a way for me to look this up. I just see the screenshot from above that there is something out there.

Is there any way we can just develop a generic sweeper that goes through fields and checks their type matches what we have in the schema types?

@alexdunnjpl
Copy link
Contributor

@jordanpadams sure can! You just want detection/report, or with automatic sync and re-index?

What should the source of truth be for mapping fields onto their correct opensearch index mapping type?

@jordanpadams
Copy link
Member

jordanpadams commented Feb 20, 2024

@alexdunnjpl the source of truth is the JSON files online (e.g. this one). see Harvest for how it gets those files.

I'm thinking this ticket may actually be 2 parts:

  1. Functionality to output a report
  2. Functionality to force update the field types using registry manager and the JSON schema needed to update those fields?

@jordanpadams
Copy link
Member

@alexdunnjpl @tloubrieu-jpl have we figured out what is going on here? are we good to close this out or is there still work to be done?

@alexdunnjpl
Copy link
Contributor

@jordanpadams implementation of your two items in the previous comment are still outstanding.

We can close this ticket iff tickets exist for those two items, otherwise this should stay open. Probably best to split into new tickets since 1 will take much less time than 2.

@alexdunnjpl
Copy link
Contributor

@jordanpadams the registry-mgr dd family of commands are totally orthogonal to this functionality, yeah?

@jordanpadams
Copy link
Member

@alexdunnjpl correct. I don't think this is really a registry-mgr or harvest thing to date. They both work with LDDs and trying to strongly type things in the schema, but nothing specifically dealing with manipulating or reviewing the schema after the fact

@jordanpadams
Copy link
Member

created NASA-PDS/registry-sweepers#128 and NASA-PDS/registry-sweepers#129. closing this one.

@github-project-automation github-project-automation bot moved this from ToDo to 🏁 Done in EN Portfolio Backlog Aug 3, 2024
@jordanpadams jordanpadams added duplicate This issue or pull request already exists and removed sprint-backlog labels Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B14.1 B15.0 bug Something isn't working duplicate This issue or pull request already exists p.must-have s.medium
Projects
Status: 🏁 Done
Development

No branches or pull requests

3 participants