Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postpone position assignment #105

Merged
merged 2 commits into from
Jul 30, 2024
Merged

Conversation

dag-erling
Copy link
Collaborator

During AST expansion, it is possible for new literal nodes to be created and given the same position as existing literals. The code attempts to reposition nodes when this happens but does not always succeed. The result is a short-circuit in the TNFA which can manifest as an infinite loop, e.g. a bounded iteration being treated as an unbounded one. This reshuffling can also leave positions unfilled, wasting memory.

This bug was discovered while investigating a report of an issue with a regular expression intended to accept DNS names, which turned out to accept labels of any length, instead of only up to 63 characters. Compiling the minimal reproducer ^((a{1,3})?x)*y with debugging enabled showed the following AST after expansion:

catenation, sub -1, 0 tags
  catenation, sub 0, 0 tags
    catenation, sub -1, 0 tags
      assertions: bol 
      iteration {0, -1}, sub -1, 0 tags, greedy
        catenation, sub 1, 0 tags
          iteration {0, 1}, sub -1, 0 tags, greedy
            catenation, sub 2, 0 tags
              literal (a, a) (97, 97), pos 0, sub -1, 0 tags
              union, sub -1, 0 tags
                literal empty
                catenation, sub -1, 0 tags
                  literal (a, a) (97, 97), pos 2, sub -1, 0 tags
                  union, sub -1, 0 tags
                    literal empty
                    literal (a, a) (97, 97), pos 1, sub -1, 0 tags
          literal (x, x) (120, 120), pos 1, sub -1, 0 tags
    literal (y, y) (121, 121), pos 4, sub -1, 0 tags
  literal (, ) (0, 0), pos 5, sub -1, 0 tags

Note that the last a node and the x node both have position 1. The y node was moved from position 2 to position 4 during expansion; the x node should have been moved to position 3, but wasn't.

The solution is to postpone assigning positions to literal nodes until immediately before TNFA conversion, removing the need to reposition nodes during AST expansion. With this change, we get the following AST instead:

catenation, sub -1, 0 tags
  catenation, sub 0, 0 tags
    catenation, sub -1, 0 tags
      assertions: bol 
      iteration {0, -1}, sub -1, 0 tags, greedy
        catenation, sub 1, 0 tags
          iteration {0, 1}, sub -1, 0 tags, greedy
            catenation, sub 2, 0 tags
              literal (a, a) (97, 97), pos 0, sub -1, 0 tags
              union, sub -1, 0 tags
                literal empty
                catenation, sub -1, 0 tags
                  literal (a, a) (97, 97), pos 1, sub -1, 0 tags
                  union, sub -1, 0 tags
                    literal empty
                    literal (a, a) (97, 97), pos 2, sub -1, 0 tags
          literal (x, x) (120, 120), pos 3, sub -1, 0 tags
    literal (y, y) (121, 121), pos 4, sub -1, 0 tags
  literal (, ) (0, 0), pos 5, sub -1, 0 tags

This PR adds test cases which trigger the bug and implements the solution described above.

This test case exposes a bug which causes the upper limit of a finite
bound to be ignored in certain cases.

The regular expression used in the test case is a minimal reproducer;
a more practical but less readable and less easily debugged example
would be a regular expression that accepts fully-qualified domain
names.
During AST expansion, it is possible for new literal nodes to be created
and given the same position as existing literals.  The code attempts to
reposition nodes when this happens but does not always succeed.  The
result is a short-circuit in the TNFA which can manifest as an infinite
loop, e.g. a bounded iteration being treated as an unbounded one.  This
reshuffling can also leave positions unfilled, wasting memory.

Instead of trying to assign positions to literal nodes on creation, and
then having to renumber them on the fly during AST expansion, wait until
the end and move position assignment into `tre_compute_nfl()`, which we
rename to `tre_compute_npfl()`.
@dag-erling dag-erling merged commit 9c098ea into laurikari:master Jul 30, 2024
2 checks passed
@dag-erling dag-erling deleted the des/position branch July 30, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Found a simple expression that fails on OSX, in the latest TRE, but passes in ICU on OSX
1 participant