Closes #119 - Add loctext #515

napsternxg · 2022-04-24T22:53:42Z

Fixes #119

If the following information is NOT present in the issue, please populate:

Name: LocText
Description: https://pubannotation.org/projects/LocText
Paper: https://doi.org/10.1186/s12859-018-2021-9
Data: https://pubannotation.org/projects/LocText

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Load from URL and parse it via the JSON format into bigbio_kb_schema.

hakunanatasha

@napsternxg
This is almost ready to merge - would you mind converting the relation references to the entity name in the relation view? (i.e. T18 -> cell wall?) Otherwise the mapping will not be trivial to construct.

napsternxg · 2022-04-30T04:17:03Z

Hi @hakunanatasha thanks. I will finish this and send by early next week.

napsternxg · 2022-05-05T07:03:32Z

Hi @hakunanatasha I have now made the relation arguments map to the entity ID so that we can uniquely resolve them. This is similar to the format used in ddi_corpus.

data["train"]["entities"][0][:5]
data["train"]["relations"][0][:5]

Will show the following entities

[{'id': '10072396-T1',
  'type': 'go',
  'text': ['nuclear'],
  'offsets': [[46, 53]],
  'normalized': [{'db_name': 'go', 'db_id': 'GO:0005634'}]},
 {'id': '10072396-T2',
  'type': 'go',
  'text': ['cytoplasmic'],
  'offsets': [[58, 69]],
  'normalized': [{'db_name': 'go', 'db_id': 'GO:0005737'}]},
 {'id': '10072396-T3',
  'type': 'taxonomy',
  'text': ['Arabidopsis'],
  'offsets': [[86, 97]],
  'normalized': [{'db_name': 'taxonomy', 'db_id': '3702'}]},
 {'id': '10072396-T4',
  'type': 'uniprot',
  'text': ['COP1'],
  'offsets': [[98, 102]],
  'normalized': [{'db_name': 'uniprot', 'db_id': 'P43254'}]},
 {'id': '10072396-T5',
  'type': 'taxonomy',
  'text': ['Arabidopsis'],
  'offsets': [[108, 119]],
  'normalized': [{'db_name': 'taxonomy', 'db_id': '3702'}]}]

And following relations:

[{'id': '10072396-R1',
  'type': 'localizeTo',
  'arg1_id': '10072396-T4',
  'arg2_id': '10072396-T2',
  'normalized': []},
 {'id': '10072396-R10',
  'type': 'localizeTo',
  'arg1_id': '10072396-T29',
  'arg2_id': '10072396-T28',
  'normalized': []},
 {'id': '10072396-R2',
  'type': 'localizeTo',
  'arg1_id': '10072396-T4',
  'arg2_id': '10072396-T1',
  'normalized': []},
 {'id': '10072396-R3',
  'type': 'localizeTo',
  'arg1_id': '10072396-T9',
  'arg2_id': '10072396-T11',
  'normalized': []},
 {'id': '10072396-R4',
  'type': 'localizeTo',
  'arg1_id': '10072396-T9',
  'arg2_id': '10072396-T10',
  'normalized': []}]

napsternxg · 2022-05-14T05:42:14Z

@hakunanatasha can you approve the pr i have already addressed the changes.

mariosaenger · 2024-09-11T12:45:32Z

Dataset seems no longer available :-(

napsternxg and others added 2 commits April 11, 2022 16:23

Fixes bigscience-workshop#119 - Add loctext

f76705b

Working code for LocText dataset.

d125818

Load from URL and parse it via the JSON format into bigbio_kb_schema.

napsternxg requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners April 24, 2022 22:53

napsternxg mentioned this pull request Apr 24, 2022

Create dataset loader for LocText #119

Open

hakunanatasha requested changes Apr 27, 2022

View reviewed changes

hakunanatasha self-assigned this Apr 27, 2022

Fixed relations arg ids

acd8335

napsternxg requested a review from hakunanatasha May 9, 2022 07:15

sg-wbi changed the title ~~Fixes #119 - Add loctext~~ Closes #119 - Add loctext May 9, 2022

mariosaenger assigned mariosaenger and unassigned hakunanatasha Aug 3, 2024

mariosaenger closed this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #119 - Add loctext #515

Closes #119 - Add loctext #515

napsternxg commented Apr 24, 2022 •

edited

Loading

hakunanatasha left a comment

napsternxg commented Apr 30, 2022

napsternxg commented May 5, 2022

napsternxg commented May 14, 2022

mariosaenger commented Sep 11, 2024

Closes #119 - Add loctext #515

Closes #119 - Add loctext #515

Conversation

napsternxg commented Apr 24, 2022 • edited Loading

Checkbox

hakunanatasha left a comment

Choose a reason for hiding this comment

napsternxg commented Apr 30, 2022

napsternxg commented May 5, 2022

napsternxg commented May 14, 2022

mariosaenger commented Sep 11, 2024

napsternxg commented Apr 24, 2022 •

edited

Loading