Title disambiguation #11

cverluise · 2019-11-06T21:52:54Z

A given title (in title_j, title_m) can appear under different forms in the database. This might be due to typos (e.g Ibm Tchnical Disclosure Bulletin), abbreviations (Ibm Tdb), parsing error (Ibm Tech-Nical Disclosure Bulletin, Ibm Corp) etc

Example ⬇️

SELECT
  DISTINCT(title_j)
FROM
  `npl-parsing.patcit.beta`
WHERE
  LOWER(title_j) LIKE "%ibm%"
ORDER BY
  title_j DESC

title_j
Ibme Technical Disclosure Bulletin
Ibm-Tdb
Ibm Tecnical Disclosure Bulletin
Ibm Technical Dosclosure Bulletin
Ibm Technical Document
Ibm Technical Dislosure Bulletin
Ibm Technical Disclusure Bulletin
Ibm Technical Disclosures Bulletin
Ibm Technical Disclosure Bulleting
Ibm Technical Disclosure Bulletin; 'Improved First-In First-Out'
Ibm Technical Disclosure Bulletin, Ref. No. Xp
Ibm Technical Disclosure Bulletin, Nn Corp., Us
Ibm Technical Disclosure Bulletin, Ibm Corp. Ny
Ibm Technical Disclosure Bulletin, Ibm Corp
Ibm Technical Disclosure Bulletin Ibm
Ibm Technical Disclosure Bulletin
Ibm Technical Disclosure Bullentin
Ibm Technical Disclosure Bulle
Ibm Technical Disclossure Bulletin
Ibm Techn.Discl.Mag
Ibm Techn. Discl. Bull
Ibm Tech-Nical Disclosure Bulletin, Ibm Corp
Ibm Tech Disc Bulletin
Ibm Tdb
Ibm Tchnical Disclosure Bulletin
Ibm Disclosure Bulletin

Feature description

Title variables are useful to many use-cases. A clean and transparent disambiguation would definitely be a strong plus.

At this point, I have no particular idea on the most appropriate tools/algos to be used in the disambiguation process. Anyone should feel free to contribute.
Ultimately, we want a correspondence table between a "unique identifier" (e.g "Ibm Technical Disclosure Bulletin") and all the related variations.
The output of the disambiguation could be used to propagate ISSN(e)s (see issue Multiple title_j for the same ISSN/ISSNe #6 )

The text was updated successfully, but these errors were encountered:

…rmating issues related to #11

gg4u · 2020-03-02T15:24:06Z

I take on this issue.

A clean and transparent disambiguation would definitely be a strong plus.

My approach will be make use of a jupyter notebook, to explore possible methods.

Other contributors could publish there new strategies.

Once chosen, we can create a a final script to be imported in the library.

I am creating a notebook now, after having dealt with some issues of google big query client.
I will update also how to solve them in the jupyter (migth smone has same problems)

cverluise · 2020-03-02T20:53:04Z

Hello @gg4u ,

welcome on board!

We've been thinking about an exotic strategy for this question.

The idea would be the following:

Get unique title_j (about 1.5 million) and their number of occurrences
Send these unique title_j to the Google Knowledge Graph API - API reference here . Restricting to "types"="Periodical" (extend to "Thing" if you don't get anything from "Periodical") and to the 3 highest scoring results seems to be satisfying at first sight. Should be battle tested
You end up having a table as follows:

`title_j`	`g_id`
Nature	[g_id-1, g_id-2]
Nature journal	[g_id-1, g_id-3, g_id-4]
Science	[g_id-5]

In brief, there are keys (the g_id) with a list of candidate values (the title_j) for each of these keys

Iterate over keys by decreasing number of potential matches and select the true candidates

NB: Although 4. might seem tedious, we have been investigating an efficient technical solution using label-studio and had a great support from the community. See here for their guidelines. Importantly, this would provide high quality disambiguation.

If you could investigate this idea, that would be great. In any case, any idea is most welcome!

Thanks

gg4u · 2020-03-04T23:47:37Z

Hi @cverluise , thank you.

I started a jupyter notebook before your note.

Please help me clarify a few things.

I looked at the bigquery db, title_jis the title of a journal.
What is title_m ? It is none for all fields in your example query:


q = """
    SELECT
      DISTINCT(title_j), title_m
    FROM
      `npl-parsing.patcit.beta`
    WHERE
      LOWER(title_j) LIKE "%ibm%"
    ORDER BY
      title_j DESC
"""

What is a g_id key ? I cannot find it on the schema of the db.

Is it an arbitrary Id of a record ? E.g. Could it be an hashkey of a record for the query above?
Or is it the unique id of a journal ?

The tasks is entity matching.
Do you want to reconciliate entities against google knowledge graph?

Any particular reason for not choosing Wikidata ?

I looked at Google knowledge graph (KG) reference:
https://developers.google.com/knowledge-graph/reference/rest/v1
but seems it might not complete.

As example, I query on a search engine for :
IBM Systems Journal I found it on the web.
Instead the KG points me to IBM Journal of Research and Development, which is a different entity according to the search engine.

Per contrast, on wikidata, I can find the two distincted entities:

https://www.wikidata.org/wiki/Q15760627?wprov=srpw1_0

https://www.wikidata.org/wiki/Q15753899?wprov=srpw1_0

In my opinion it seems wikidata can be a good choice (Google KG is anyway based on wikidata, although I don't know why the above query reconciliates differently).

What shall we do with them?

One options would be to point all the journals to IBM entity.
Another option would be to manually create a list of all IBM journals, as labels to train then the suggestions. I thought Wikidata might come to rescue.

In the query above, I have dirty results: it seems that together with a journal title, I also have some part of an article title. I can handle that, but wonder what persons using label-studio may do with it.

Shall they select a list of suggested labels (title journal) for each item?
How would they choose the right journal in the case there are more options (as IBM Systems Journal and IBM Journal of Research and Development may be? )

Please let me know what people should work with it, it will help to model an output.

So far I created a matrix like the table you suggested, of N x N items.
If one would have a list of the correct journals, that would help for one could create a matrix M x N, where M (number of journals) << N (number of rows) (at least, I expect that).

To update you on what I am working on :
I m exploring a bit of methods to compute differences and similarities between strings.

The approach is endogeneous (the only information is just from the string, I don't know what it represents). It might be expanded with

I am exploring different approaches with tokenizers, n-grams and collations. I will select ones offering best results, sharing ideas on the points above (see point 3) will help.

Having a list of final labels, it might help, for I could train against them.
Otherwise, I could work on some other ways to clusterize them.

For one work to match entities ( granular food ingredients, about 10^4 raw distinct records reduced to 10^4 ), I end up creating an interface in wolfram mathematica to fine-grain overview the automatic clustering - I could not do it not, but useful to know how the community will interact (see point 4).

cverluise · 2020-03-06T08:49:55Z

Hello,

thanks a lot for your feedback!

on 1.
title_m: Title of the item holding the NPL -- for non journal items only , e.g. conference, proceedings, etc.

The schema of the table is detailed in the "Schema" pane on GBQ https://console.cloud.google.com/bigquery?project=npl-parsing&p=npl-parsing&d=patcit&t=v02_npl&page=table

It is also detailed here: https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/

on 2.

At this point there is no specific reason to favor the knowledge graph API. So, if you find that wikidata can do better, feel free to experiment!

On 3.

That's why I suggest to:

keep the k best candidate values (e.g. k=3)
validate by hand the key-value matching using a custom Label Studio setting (see my points 3 and 4 above + guidelines

In anyway, we cannot fully rely on the output of wikidata/knowledge graph/ ... api output

On 4

On the query above, I have dirty results: it seems that together with a journal title, I also have some part of an article title.

No, the labeller will just accept the dirty result as long as it actually contains the info for the journal title.

The idea is that, from the API, you will have, let's say, 3 keys per ambiguous title_j. In the end, there will be multiple ambiguous title_j (values) with a common key. E.g "IBM tech Disclosure" and "IBM technical disclosure" will certainly have a common key. From the API, each key is defined by a unique id (what i call g-idin the case of the google knowledge graph) and a clean title.

If one would have a list of the correct journals, that would help for one could create a matrix M x N, where M (number of journals) << N (number of rows) (at least, I expect that).

We don't have that because patent to "science" citations are not standard. For example, IBM technical disclosure bulletin is absent from major academic databases.

By the way, as you should keep only the k (~ 3) most relevant API entity outputs, you should end up with a dict with "number of distinct title_j" keys x k values. Nb: the keys here will eventually be the "candidate values" and the values the "unique keys" of the labeling task. The final clustering task whould be high quality and will be done as mauch as possible by hand - that's why an efficient labeling environment is crucial.

Hope it helps. If it is still unclear, we can have a call if you want.

Thanks, you are doing great work!

Cheers!

gg4u · 2020-03-06T22:03:40Z

Hi,

Tks, I had oversighted title_mon the schema.

Ok, I will look at wikidata or google knowledge graph.

Interesting. For a work of entity matching of my former startup I had to match raw ingredients described by people (things like "little chunk tomatoes" and "pomodorini") to the corresponding ingredient in nutritional datasets). Complexity of 10^4 raw ingredients to 10^3 meta.

I dealt with the task combining NLP and then building an GUI interface in Wolfram mathematica, that dynamically select a group of matching items and the person can tick manually which suggestions matches with the queried string.

You might want to replicate something similar ( @niklub if of your interest too) for I found that approach very useful and effective (and also boring, if you are the only one to fine grain, but if there is community process goes fast).

Ok.

From the API, each key is defined by a unique id (what i call g-idin the case of the google knowledge graph) and a clean title.

so.

There is a g_id that is the matching id of a knowledge graph.
We will use the corresponding title of the chosen knowledge graph as the label.

you should keep only the k (~ 3) most relevant API entity outputs,

Mm here I think we might have longer outputs.
I looked a the plot of the "decay" of ranking, and the approach I used in my other works is to take all of the items till the point where the proximity distribution "bent enough" (an elbow in the curve).

This approach worked reasonably well.
But if don't have much title,s you can also take em all.

Nb: the keys here will eventually be the "candidate values" and the values the "unique keys" of the labeling task

Please make me an example.

I raise this one:


 ('Ibm Corp', 'Ibme Technical Disclosure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Tecnical Disclosure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Technical Dosclosure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Technical Disclusure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Technical Disclosure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Technical Disclossure Bulletin', 0.10831550828287981),
 ('Ibm Corp', 'Ibm Tdb', 0.07898279407844809),

In the example above, I compute a proximity between the labels. We humans know that Ibm Technical Disclusure Bulletin is the label, but how to know it is, looking only at the strings ? Why not Ibm Corp ?

A few options.

a)
I thought to look at frequency : if Ibm Technical Disclosure Bulletin or similar happen the most, then it might be a journal.

b)
Another approach may be that a human queries for the label - 'Ibm Tecnical Disclosure Bulletin' - and will be prompted with all suggestions to match with the true label, instead of the opposite (being prompted with the true labels for an entity).
(see the point 3).

c)
I thought about that we don't know the journals.
So another approach may be to selecting a sample (randomly picked ?) of values from the 1.5 millions and run queries against a knowledge graph for each of them. For the ones that get a result (not None) then we will use the corresponding g_id and true title as a label.
So we could also build a training set.

Please let me know what you think and if we are understanding well on these points.

Question:
Google big query may carry costs.
I wonder if this project is supported in somehow by university or your research center or your phd.
May I ask for an in-kind support to consume Google Big query ? E.g. a Quota.

Get unique title_j (about 1.5 million) and their number of occurrences

Or could you share a table ? (wetrasnfer may work, but since we are working with bigquery it would be also nice to just use that :) )

Hope it helps. If it is still unclear, we can have a call if you want.

Yes, I wrote you an email proposing for having a call and share some thoughts
(see the email from nifty.works)

Glad to connect!

cverluise · 2020-04-04T14:14:36Z

Hello @gg4u,

Hope you are doing well.

Any news to share?

Cheers

cverluise · 2020-06-15T15:47:13Z

Could help a lot for this kind of tasks:
https://github.com/dedupeio

cverluise self-assigned this Nov 6, 2019

cverluise added enhancement New feature or request help wanted Extra attention is needed labels Nov 6, 2019

cverluise removed their assignment Nov 6, 2019

cverluise added the good first issue Good for newcomers label Nov 6, 2019

cverluise added a commit that referenced this issue Nov 10, 2019

🎉 Add validation pipeline for npl_citation. Address #3, #4, #5 and fo…

77e9cf4

…rmating issues related to #11

cverluise assigned cverluise and gg4u Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Title disambiguation #11

Title disambiguation #11

cverluise commented Nov 6, 2019 •

edited

Loading

gg4u commented Mar 2, 2020

cverluise commented Mar 2, 2020 •

edited

Loading

gg4u commented Mar 4, 2020

cverluise commented Mar 6, 2020

gg4u commented Mar 6, 2020

cverluise commented Apr 4, 2020

cverluise commented Jun 15, 2020

Title disambiguation #11

Title disambiguation #11

Comments

cverluise commented Nov 6, 2019 • edited Loading

Feature description

gg4u commented Mar 2, 2020

cverluise commented Mar 2, 2020 • edited Loading

gg4u commented Mar 4, 2020

cverluise commented Mar 6, 2020

gg4u commented Mar 6, 2020

cverluise commented Apr 4, 2020

cverluise commented Jun 15, 2020

cverluise commented Nov 6, 2019 •

edited

Loading

cverluise commented Mar 2, 2020 •

edited

Loading