-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CTD processing 2: batch-queries #584
Comments
This should be able to be solved by a custom pairCurieWithAPIResponse function, I can work on this in the JQ and/or javascript transformer for CTD |
Here is the pairCurieWithAPIResponse JQ solves this problem. |
It's not clear to me how BTE will construct large batch-queries to CTD, and whether we'll need to make adjustments to BTE. I'm specifically thinking about:
Notes:
|
|
Replying to @tokebe (thanks for the quick reply!) with my thoughts:
|
I think a safe batch-size is 80 IDs, assuming a 2048 character-max for the GET url. Rough calculations
The most crucial number is click to see character num for all input IDs
So the equation for this situation is: Rounding down to the nearest ten gets 80. |
I'm getting JQ-related errors when I try to test the batch-size limit, using the process in the next section.
Recreating the error with a simpler example, not testing the batch-size-limit
Noticed on ci/dev instances, but not test/prod. No overrides, no batch-size-limit-testing adjustments done. TRAPI query:
2/3 subqueries fail with Interestingly, I think those two sub-queries are returning 0 hits: this and this, vs the 3rd sub-query that has hits
recreating the problem with a simple query
Follow the steps in the next section, but don't set the batch-size-limit (step 2 in the next section) Then do the simple query that works in dev without the override:
I'd normally get 134 results, but instead I get 0 results. In the console logs, the sub-query fails with My full process to test the batch-size-limit
2. Adding the batch-size limit to the query-handler's config
To API_BATCH_SIZE, add:
3. Setting an override to use CTD x-bte annotation for batch-querying
I actually override to my local file with the branch checked out, but this should do the same thing. Paste into BTE's smartapi_overrides file, so it'll use this x-bte annotation:
Console log of a sub-query
|
Looks like this is a problem in the JQ string, largely due to CTD's inconsistent response structure depending on if anything was found or not. Working on a fix... |
Ok, turns out this was less CTD's inconsistencies and more JQ's inconsistencies (and my lack of familiarity...). I've pushed a fix to dev which should address this. |
The fix worked! I tested all 3 example queries in my previous post in both dev and main (CI) branches. Everything worked as-intended without any errors. The PRs to deploy are:
|
first two (mydisease, biothings rare-source) are for biothings/biothings_explorer#640 last one (ctd) is for https://github.com/biothings/biothings_explorer/issues/584\#issuecomment-1842376052
Update! I've included the CTD x-bte changes in the overrides biothings/bte-server#4 - so it'll deploy alongside the orphanet changes. I think the override will end up deploying with or after the code changes (JQ / batch-size-limit), so I don't anticipate any issues. (aka I think NodeNorm will deploy the orphanet changes at the same pace or slower than our deployments to instances). |
I think we can close this issue once:
We'll then have a separate process to remove the overrides (not needed once the yaml PRs are all merged / registrations refreshed). |
I double-checked and it's not working on CI, probably because of the larger cache-update issues (recent lab Slack convo) My test
POST to CTD through BTE CI
Based on the logs in the TRAPI response, I can tell that 2 sub-queries were sent (1 ID each). But if batch-querying was working, only 1 sub-query should have been sent. This may mean BTE CI didn't successfully use the override.
|
Issue should now be addressed by 3019cec, please test again |
Now it's working on BTE CI! Yay! The previous test now works as-intended - with 1 planned batch-query. Logs:
I also tested the batch-size-limit=80 with a 150-QNode-IDs query (current max, see #762), and it worked too. Two sub-queries were sent (80 + 70) Batch-size-limit test
POST to CTD through BTE CI Logs show that two sub-queries were sent (80 + 70), so the batch-size-limit of 80 was respected
|
I've confirmed that things work as-expected after the Prod deployment. Closing issue, updating the registered yamls and registrations, and opening another issue for removing the overrides. Example: POST to https://bte.transltr.io/v1/smartapi/0212611d1c670f9107baf00b77f0889a/query, will get a response with results and a log saying
|
Intro: see intro section of #583 (comment). Originally noted in #558 (comment)
2. processing batch-queries correctly
The current x-bte-kgs-operations aren't written as batch-queries, even though the CTD API does allow batch-querying.
The problem is how BTE handles the batch-query responses. The API response is an array of associations (objects) - and each association matched to one of the input IDs. Each association has an "Input" field where the value is the matched input ID (all lowercase, has an ID-prefix for diseases (MESH or OMIM) and pathways (REACT or KEGG)).
However, BTE's default api-response-transform isn't correctly handling this - instead, it's linking the first input ID to every possible output ID.
Example:
Edit SmartAPI and run BTE locally
In a local copy of the SmartAPI yaml, copy-paste the following into the
chemical2gene
operation. It's changing thesupportBatch
andqueryInputs
info.Set up a local instance of BTE to override and use your local copy of the CTD yaml. Then POST to that specific api (v1/smartapi/{id}/query endpoint):
CTD's raw response
During execution, BTE should generate this query with two input IDs to CTD.
In CTD's raw response, some genes are only linked to the second ID D015250 / Aclarubicin, like PARP1.
BTE's current flawed response
BTE links every output gene with only the first ID C006303 / acivicin /
PUBCHEM.COMPOUND:294641
. It's easier to see through the console log:desired format for BTE's response
Instead, BTE should correctly link each input ID / entity with its associations. The console log should look like this:
PUBCHEM.COMPOUND:294641
PUBCHEM.COMPOUND:451415
PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:142
. Most genes are linked to only one of the input IDs.The text was updated successfully, but these errors were encountered: