-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Title disambiguation #11
Comments
I take on this issue.
My approach will be make use of a jupyter notebook, to explore possible methods. Other contributors could publish there new strategies. Once chosen, we can create a a final script to be imported in the library. I am creating a notebook now, after having dealt with some issues of google big query client. |
Hello @gg4u , welcome on board! We've been thinking about an exotic strategy for this question. The idea would be the following:
In brief, there are keys (the
NB: Although 4. might seem tedious, we have been investigating an efficient technical solution using label-studio and had a great support from the community. See here for their guidelines. Importantly, this would provide high quality disambiguation. If you could investigate this idea, that would be great. In any case, any idea is most welcome! Thanks |
Hi @cverluise , thank you. I started a jupyter notebook before your note. Please help me clarify a few things. I looked at the bigquery db,
What is a Is it an arbitrary Id of a record ? E.g. Could it be an hashkey of a record for the query above? The tasks is entity matching. Any particular reason for not choosing Wikidata ? I looked at Google knowledge graph (KG) reference: As example, I query on a search engine for : Per contrast, on wikidata, I can find the two distincted entities: https://www.wikidata.org/wiki/Q15760627?wprov=srpw1_0 https://www.wikidata.org/wiki/Q15753899?wprov=srpw1_0 In my opinion it seems wikidata can be a good choice (Google KG is anyway based on wikidata, although I don't know why the above query reconciliates differently). What shall we do with them?
In the query above, I have dirty results: it seems that together with a journal title, I also have some part of an article title. I can handle that, but wonder what persons using label-studio may do with it. Shall they select a list of suggested labels (title journal) for each item? Please let me know what people should work with it, it will help to model an output. So far I created a matrix like the table you suggested, of N x N items. To update you on what I am working on : The approach is endogeneous (the only information is just from the string, I don't know what it represents). It might be expanded with I am exploring different approaches with tokenizers, n-grams and collations. I will select ones offering best results, sharing ideas on the points above (see point 3) will help. Having a list of final labels, it might help, for I could train against them. For one work to match entities ( granular food ingredients, about 10^4 raw distinct records reduced to 10^4 ), I end up creating an interface in wolfram mathematica to fine-grain overview the automatic clustering - I could not do it not, but useful to know how the community will interact (see point 4). |
Hello, thanks a lot for your feedback! on 1. The schema of the table is detailed in the "Schema" pane on GBQ https://console.cloud.google.com/bigquery?project=npl-parsing&p=npl-parsing&d=patcit&t=v02_npl&page=table It is also detailed here: https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/ on 2. At this point there is no specific reason to favor the knowledge graph API. So, if you find that wikidata can do better, feel free to experiment! On 3. That's why I suggest to:
In anyway, we cannot fully rely on the output of wikidata/knowledge graph/ ... api output On 4
No, the labeller will just accept the dirty result as long as it actually contains the info for the journal title. The idea is that, from the API, you will have, let's say, 3 keys per ambiguous
We don't have that because patent to "science" citations are not standard. For example, IBM technical disclosure bulletin is absent from major academic databases. By the way, as you should keep only the k (~ 3) most relevant API entity outputs, you should end up with a dict with "number of distinct title_j" keys x k values. Nb: the keys here will eventually be the "candidate values" and the values the "unique keys" of the labeling task. The final clustering task whould be high quality and will be done as mauch as possible by hand - that's why an efficient labeling environment is crucial. Hope it helps. If it is still unclear, we can have a call if you want. Thanks, you are doing great work! Cheers! |
Hi, Tks, I had oversighted Ok, I will look at wikidata or google knowledge graph. Interesting. For a work of entity matching of my former startup I had to match raw ingredients described by people (things like "little chunk tomatoes" and "pomodorini") to the corresponding ingredient in nutritional datasets). Complexity of 10^4 raw ingredients to 10^3 meta. I dealt with the task combining NLP and then building an GUI interface in Wolfram mathematica, that dynamically select a group of matching items and the person can tick manually which suggestions matches with the queried string. You might want to replicate something similar ( @niklub if of your interest too) for I found that approach very useful and effective (and also boring, if you are the only one to fine grain, but if there is community process goes fast). Ok.
so. There is a
Mm here I think we might have longer outputs. This approach worked reasonably well.
Please make me an example. I raise this one:
In the example above, I compute a proximity between the labels. We humans know that A few options. a) b) c) Please let me know what you think and if we are understanding well on these points. Question:
Or could you share a table ? (wetrasnfer may work, but since we are working with bigquery it would be also nice to just use that :) )
Yes, I wrote you an email proposing for having a call and share some thoughts Glad to connect! |
Hello @gg4u, Hope you are doing well. Any news to share? Cheers |
Could help a lot for this kind of tasks: |
A given title (in
title_j
,title_m
) can appear under different forms in the database. This might be due to typos (e.g Ibm Tchnical Disclosure Bulletin), abbreviations (Ibm Tdb), parsing error (Ibm Tech-Nical Disclosure Bulletin, Ibm Corp) etcExample ⬇️
Feature description
Title variables are useful to many use-cases. A clean and transparent disambiguation would definitely be a strong plus.
title_j
for the sameISSN
/ISSNe
#6 )The text was updated successfully, but these errors were encountered: