Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processing :: in the meta-data #145

Open
jheinecke opened this issue Aug 5, 2024 · 5 comments
Open

processing :: in the meta-data #145

jheinecke opened this issue Aug 5, 2024 · 5 comments

Comments

@jheinecke
Copy link

AMR files usually start with an id and the sentence before the actual PENMAN graph comes

# ::id any-ID-001.1
# ::snt the cat is sleeping
# ::save-date Sat Jul 20, 2024 ::file test_0001_2.txt
( s / cleep-01
   :ARG0 ( c / cat))

the penman lib parses this without any problem and provides it in the metadata dictionary. Multiple ::keys are parsed correctly

However I cam across sentences which contain ::

# ::snt this must be separated  using :: unless it is a single line
...

unfortunately penman-lib cuts the sentence at the :: and creates a metadata-entry with a space as key.
For other comment lines having mulitple keys is OK, but for the line containing ::snt is forbids having sentences with ::. Could this be changed?

@goodmami
Copy link
Owner

goodmami commented Aug 7, 2024

This part of parsing is separate from that of parsing the PENMAN notation and I don't have any real grammar defined, so there is likely room for improvement:

penman/penman/_parse.py

Lines 87 to 99 in 1f52cbc

def _parse_comments(tokens: TokenIterator):
"""
Parse PENMAN comments from *tokens* and return any metadata.
"""
metadata = {}
while tokens.peek().type == 'COMMENT':
comment = tokens.next().text
while comment:
comment, found, meta = comment.rpartition('::')
if found:
key, _, value = meta.partition(' ')
metadata[key] = value.rstrip()
return metadata

For other comment lines having mulitple keys is OK, but for the line containing ::snt is forbids having sentences with ::

What I'm hearing is that you think ::snt is a special case that must appear on its own comment line, or at least as the last metadata key on a line. Is that correct?

While I'd agree that it's unfortunate that :: can't currently appear as a literal in a metadata value, can you link to any docs or code showing that ::snt is indeed treated specially?

@jheinecke
Copy link
Author

Hi,
yes, your paraphrase is correct, and no I have no link which specifies whether or not ::snt is a special case. I just stumbled across when annotating a sentence with a badly formed smiley which happened to contain ::. I agree it's rare.
Personally I'd prefer a single ::key per comment line, no special cases, but this would mean reformatting the LDC data where multiple keys happen to be on the same comment line

@bact
Copy link
Contributor

bact commented Aug 7, 2024

Does it possible to escaped the :?

For example, if ::snt is intended, we may have:

# ::snt this must be separated using \:\: unless it is a single line

But this also means more work to unescaping strings before further processing.

Having a single ::key per comment line is probably more maintainable.

@jheinecke
Copy link
Author

I have just found an indirect hint in the AMR 3.0 documentation that ::snt is on a single line (cited from amr_annotation_3.0/docs/README.txt of the LDC2020T02 Dataset): (my emphasizing)

2.3 Structure and content of individual AMRs

Each AMR-sentence pair in the ./data/amrs files comprises the
following data and fields:

  • Header line containing a unique workset-sentence ID for the source
    string that has been AMR annotated (::id), a completion timestamp
    for the AMR (::date), an anonymized ID for the annotator who
    produced the AMR (::annotator), and a marker for the AMRs of
    dually-annotated sentences indicating whether the AMR is the
    preferred representation for the sentence (::preferred)

  • Header line containing the English source sentence that has been
    AMR annotated (::snt)

  • Header line indicating the date on which the AMR was last saved
    (::save-date), and the file name for the AMR-sentence pair
    (::file)

  • Graph containing the manually generated AMR tree for the source
    sentence (see the AMR guidelines for a full description of the
    structure and semantics of AMR graphs).

In the LDC data ::save-date and ::file occur in the same line as do ::id, ::date and ::annotator, for instance

# ::id wiki-minicorpus-a_0001.2 ::date 2017-10-17T03:05:10 ::annotator SDL-AMR-09 ::preferred
# ::snt Like all pitcher plants, it is carnivorous and uses its nectar to attract insects that drown in the pitcher and are digested by the plant.
# ::save-date Sat Jan 20, 2018 ::file wiki-minicorpus-a_0001_2.txt
(a / and
....

If we read the documentation strictly, then there are at least three comment lines with, one of which only contains ::snt 😆

@goodmami
Copy link
Owner

goodmami commented Aug 7, 2024

@bact Keep in mind that we are not proposing a new format, but working with an existing one. And escaping the : characters does prevent the splitting, but there is no mechanism currently for unescaping them unless you do it yourself.

@jheinecke Thanks for digging up that reference. While it doesn't give explicit parsing instructions, it does hint at the expected format.

I'm thinking of passing some configurable that indicates which metadata keys are full-line (to help with both parsing and formatting). I'd like to put this information in the AMR model instead of built-in to the parser, but currently the code is not set up to handle that, so some more changes would be needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants