Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging 14 Ontologies (huge merge) #403

Open
OliverHex opened this issue Feb 5, 2024 · 6 comments
Open

Merging 14 Ontologies (huge merge) #403

OliverHex opened this issue Feb 5, 2024 · 6 comments

Comments

@OliverHex
Copy link

OliverHex commented Feb 5, 2024

Hello,

I am trying to merge 14 ontologies at once with Boomer : DERMO, DO, HUGO, ICDO, IDO, IEDB, MESH, MFOMD, MPATH, NCIT, OBI, OGMS, ORPHANET and SCDO.

This is how I proceed :

  • I compute the 91 LOGMAP alignments between every pair of ontologies (i.e. 91 = n(n-1)/2 with n=14)
  • I convert and merge these alignments into a single ptable (Boomer format)
  • I join all these ontologies into a single "union" OWL file (622K classes ~ 2.5 GB)
  • I launch Boomer on the union OWL file and the single ptable (54K entries ~ 7 MB).

I have run various tests and it seems that when the ptable is too large, the problem becomes intractable.

By removing the MESH and NCIT (i.e. now I try to merge 12 ontologies), the resulting union ontology is only 81K classes (242 MB) and the ptable contains only 7K entries. In this case, Boomer ends with a result in 30 min (on a i7 - 1.90 GHz with 32 GB RAM​).

But I also need the MESH and the NCIT ontologies to be included in my merge result.

Overall, I am wondering if that's the correct way to proceed ?

Here follow some questions :

  1. Should I continue with this strategy ?
    -> Should I keep trying to merge all at once ? In order to give Boomer complete decision power on selecting the best mappings (without introducing any bias)...

  2. Or should I change my merging strategy ?
    -> Should I split the problem into smaller sub-problems
    -> Then organize them in some order (according to some criteria) : this could introduce some bias...
    -> And launch Boomer following this order.

    For example, I could try this :
    - I convert the 91 alignments into 91 ptables (instead of converting and merging them into 1 single ptable)
    - For each of the 91 ptables
    ----> I launch Boomer with this ptable and the union OWL file.
    ----> In the union OWL file, I add all the equivalence axioms generated by Boomer for this ptable.

    So far, it seems to work much faster.
    But the problem is the arbitrary order in the for-loop that is introducing a bias : since each equivalence axiom added at one step will influence Boomer results in the next steps.

Any suggestions ?

Oliver

PS : I couldn't attach the Boomer input union ontology (compressed ~ 140 MB) since the maximum attachment size is 25 MB. However, the input ptable is here ptable-91-mappings.zip .

@balhoff
Copy link
Member

balhoff commented Feb 15, 2024

Hi @OliverHex, sorry — I was busy last week then was out sick. Doing a really large mapping like this will take some experimentation. I suspect you may have to do it incrementally as you suggest. I think you're pushing the boundaries of what we've applied boomer to so far! @matentzn might have some insights, but I think he has hit some of the same issues (and may have worked with some of the same ontologies). Sorry I haven't been more helpful so far; recently I haven't had too much time to work deeply on boomer.

@matentzn
Copy link

I hid much of the same limits @OliverHex - Unfortunately I had to shelve my work on this for the time being despite it being such a super high priority. I think the best workflow is actually to re-imagine boomer as a curation tool rather than a mass alignment tool:

image

So basically, you align, and use the low priority cliques to find issues, fix the input alignment and iterate.

But the problem of aligning so many conflicting ontologies remains. In my view, even if we figure out the scaling issue, this problem cannot the solved right now properly unless we can actually encode the subclassOf edges in the input to probabilistic statements first (there are so many conflicts around disease ontologies).

Please feel free to keep us posted - I unfortunately do not have a good solution for you right now.

@OliverHex
Copy link
Author

Thank you very much for your answers, I start to understand what are the strengths/limits of Boomer.

Here is an update :

  • I have scaled the hardware setup of my server machine up to 128G RAM and 64 Core CPUs (2.44 Ghz) and Boomer execution ends after 1 hour for the 14 ontologies including MESH and NCIT (no more "out of memory" problems).
  • Some stats on the execution (see log file in attachment) :
    • The number of mappings cliques found by Boomer is : 22665
    • If (as I understand) the numbers printed on each lines are the size of each clique, there is one huge clique that contains 10075 elements. Most of them are less than 10.
  • My problem now is that Boomer doesn't produce any output.
  • The execution ends with error messages :
    • "No possible resolution of perplexity"
    • and "No configuration of the uncertainties can be added to the ontology"

What is the signification of these error messages ?

I have launched Boomer with these parameters :
- window-count : 1
- runs : 100
- exhaustive-search-limit : 14

What do these parameters mean ?
Is there a way to get boomer to produce some output by changing these parameters ?

In your answer, you say that Boomer is used as a curation tool by focusing on the low probability cliques to curate the input mappings.

This is very interesting.

Is there a documentation or wiki that explains the methodology to use Boomer as a curation tool ?
How may I find the cliques contents (the entities IRIs in the cliques) ? (I just get the cliques size in the console output)

Oliver
Log file : log.txt

@matentzn
Copy link

Hey @OliverHex What you are doing is of great importance to me as well. If you like, add me on LinkedIn or send me an email here: https://github.com/monarch-initiative/pheval/blob/a685b171344cedf0f6ab37962fd8e6da36faa575/pyproject.toml#L7 (just a random place I found where my email was published - GitHub hides these), and we can set up a call to see if we can join forces.

@matentzn
Copy link

matentzn commented Apr 1, 2024

@OliverHex just following up - are you still working on this? Interested to push the envelope a bit together?

@OliverHex
Copy link
Author

Hello,

Sorry for replying so late...
Yes sure, I am interested to further explore ontology alignment and bayesian merging !
But at the current moment, I am working on something else.
I might switch back on ontology alignment and bayesian merging in a few weeks.
I will keep you updated, thanks for asking !

Oliver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants