Adding typing to tree branches #139

chanind · 2024-01-02T04:05:11Z

This PR removes the Any from the Branch type, so trees are fully recursively typed.

This is a first step towards #129, but figured this would be a small enough self-contained chunk of work to check if this is on the right path.

chanind · 2024-01-02T04:07:21Z

penman/types.py


 Variable = str
 Constant = Union[str, float, int, None]  # None for missing values
 Role = str  # '' for anonymous relations
+Symbol = str


Is this the correct name for what's in branch targets? It seemed like the target must always be a string if it's not a Node, with even number constants showing up as strings. Are there edge-cases of trees where this isn't correct?

Is this the correct name for what's in branch targets?

I believe this would be Atom. Considering the grammar, atoms are variables or constants, constants are either strings or symbols. Here's an example:

(a / A :ARG1 (b / B) ; node target :ARG2 b ; variable target (atomic) :ARG3 "a b" ; string target (atomic) :ARG4 abc ; symbol target (atomic) )

The only difference between a symbol and a variable is that a variable is used in a node (such as a and b in the example above). The difference between strings and symbols, besides the quotes, is just that symbols cannot contain control characters like whitespace, parens, colons, etc. Quoted strings may not be used as variables (("a" / A) raises an error, and :ARG1 "a" does not reference a variable a).

even number constants showing up as strings

That is correct. Penman does no interpretation of datatypes on parsing. It will accept a limited number of non-string types (ints, floats, etc.) during encoding, but they will be strings again when decoding.

Are there edge-cases of trees where this isn't correct?

While unconventional, missing targets are allowed and parse to None (a warning may be issued as well):

>>> penman.parse('(a / A :ARG1)') Missing target: (a / A :ARG1) Tree(('a', [('/', 'A'), (':ARG1', None)]))

Similarly, an empty node target means that the type annotation of Node is not entirely correct, as mentioned in #129:

>>> penman.parse('(a / A :ARG1 ())') Tree(('a', [('/', 'A'), (':ARG1', (None, []))]))

chanind · 2024-01-02T04:08:39Z

penman/types.py


 # Tree types
-Branch = Tuple[Role, Any]
+Branch = Tuple[Role, Union[Symbol, "Node"]]
 Node = Tuple[Variable, List[Branch]]


Is it possible to have an AMR tree that consists of just a single Variable with no branches?

Probably not for AMR, but it is a possible graph for this Penman library:

>>> penman.parse('(a)') # tree node has empty branch list Tree(('a', [])) >>> penman.decode('(a)').triples # graph has :instance None [('a', ':instance', None)]

chanind · 2024-01-02T04:34:01Z

penman/layout.py

@@ -178,7 +178,8 @@ def _interpret_node(t: Node, variables: Set[Variable], model: Model):
            triples.append(triple)
            epidata.append((triple, epis))
        # nested nodes
-        else:
+        # mypy forgets that (Node ∨ Sym) ^ ¬Sym → Node
+        elif is_tgt_node(target):


This is pretty awkward, and will technically reduce performance just to get typing to work. Annoyingly, Mypy Typeguard doesn't work the same as isinstance() (python/typing#1351). So, replacing is_tgt_symbol(target) above with isinstance(target, str) then allows Mypy to infer that it must a Node here and a Symbol above without needing an elif here. Not sure which solution is best, since the method is more clear what's going on compared to a naked isinstance.

Alternatively, we could just use cast to force Mypy to recognize the correct type 🤷‍♂️

Yeah this is a little awkward. In general, I'm not too concerned about reducing performance as long as it's correct.

I haven't looked closely at this code in a while so I'd need to think a bit more about a better solution, but in the meantime I want to point out that the change from else to elif means that there is no more else case. A reader of the code would have to know that is_tgt_symbol() and is_tgt_node() are defined as opposites to determine that the elif would catch all other cases. Otherwise it looks like a latent bug.

goodmami

Apologies for the delayed response. I have some comments below which would require some changes, but I don't have specific suggestions. Hopefully they are informative for you nonetheless.

goodmami · 2024-09-11T05:05:37Z

penman/types.py


 Variable = str
 Constant = Union[str, float, int, None]  # None for missing values
 Role = str  # '' for anonymous relations
+Symbol = str


Is this the correct name for what's in branch targets?

I believe this would be Atom. Considering the grammar, atoms are variables or constants, constants are either strings or symbols. Here's an example:

(a / A :ARG1 (b / B) ; node target :ARG2 b ; variable target (atomic) :ARG3 "a b" ; string target (atomic) :ARG4 abc ; symbol target (atomic) )

The only difference between a symbol and a variable is that a variable is used in a node (such as a and b in the example above). The difference between strings and symbols, besides the quotes, is just that symbols cannot contain control characters like whitespace, parens, colons, etc. Quoted strings may not be used as variables (("a" / A) raises an error, and :ARG1 "a" does not reference a variable a).

even number constants showing up as strings

That is correct. Penman does no interpretation of datatypes on parsing. It will accept a limited number of non-string types (ints, floats, etc.) during encoding, but they will be strings again when decoding.

Are there edge-cases of trees where this isn't correct?

While unconventional, missing targets are allowed and parse to None (a warning may be issued as well):

>>> penman.parse('(a / A :ARG1)') Missing target: (a / A :ARG1) Tree(('a', [('/', 'A'), (':ARG1', None)]))

Similarly, an empty node target means that the type annotation of Node is not entirely correct, as mentioned in #129:

>>> penman.parse('(a / A :ARG1 ())') Tree(('a', [('/', 'A'), (':ARG1', (None, []))]))

goodmami · 2024-09-11T05:06:12Z

penman/types.py


 # Tree types
-Branch = Tuple[Role, Any]
+Branch = Tuple[Role, Union[Symbol, "Node"]]
 Node = Tuple[Variable, List[Branch]]


Probably not for AMR, but it is a possible graph for this Penman library:

>>> penman.parse('(a)') # tree node has empty branch list Tree(('a', [])) >>> penman.decode('(a)').triples # graph has :instance None [('a', ':instance', None)]

goodmami · 2024-09-11T05:20:09Z

penman/tree.py

+def is_tgt_node(target: Symbol | Node) -> TypeGuard[Node]:
+    """
+    Inverse of :func:`is_atomic`, only for Symbol | Node from branches.
+    Automatically narrows the type to Node for better type inference
+    """
+    return not is_atomic(target)
+
+
+def is_tgt_symbol(target: Symbol | Node) -> TypeGuard[Symbol]:
+    """
+    Same as :func:`is_atomic`, only for Symbol | Node from branches.
+    Automatically narrows the type to Symbol for better type inference
+    """
+    return is_atomic(target)


There are a few issues here:

It's not great that these don't really do anything different from is_atomic() except for type checking. I think having them as part of the public API would confuse users.

If they were to stay, I'd prefer to use non-abbreviated names in public API functions: is_target_symbol() or maybe is_symbol().

TypeGuard is added in Python 3.10, but Penman currently supports down to Python 3.8.

goodmami · 2024-09-11T05:24:56Z

penman/layout.py

@@ -178,7 +178,8 @@ def _interpret_node(t: Node, variables: Set[Variable], model: Model):
            triples.append(triple)
            epidata.append((triple, epis))
        # nested nodes
-        else:
+        # mypy forgets that (Node ∨ Sym) ^ ¬Sym → Node
+        elif is_tgt_node(target):


Yeah this is a little awkward. In general, I'm not too concerned about reducing performance as long as it's correct.

I haven't looked closely at this code in a while so I'd need to think a bit more about a better solution, but in the meantime I want to point out that the change from else to elif means that there is no more else case. A reader of the code would have to know that is_tgt_symbol() and is_tgt_node() are defined as opposites to determine that the elif would catch all other cases. Otherwise it looks like a latent bug.

chanind commented Jan 2, 2024

View reviewed changes

adding typing to tree branches

9109e11

chanind force-pushed the type-trees branch from 79390e0 to 9109e11 Compare January 2, 2024 04:24

chanind commented Jan 2, 2024

View reviewed changes

goodmami reviewed Sep 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding typing to tree branches #139

Adding typing to tree branches #139

chanind commented Jan 2, 2024

chanind Jan 2, 2024

goodmami Sep 11, 2024

chanind Jan 2, 2024

goodmami Sep 11, 2024

chanind Jan 2, 2024 •

edited

Loading

goodmami Sep 11, 2024

goodmami left a comment

goodmami Sep 11, 2024

goodmami Sep 11, 2024

goodmami Sep 11, 2024

goodmami Sep 11, 2024

Adding typing to tree branches #139

Are you sure you want to change the base?

Adding typing to tree branches #139

Conversation

chanind commented Jan 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chanind Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goodmami left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chanind Jan 2, 2024 •

edited

Loading