Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add named group support #6

Open
jehna opened this issue Jan 3, 2016 · 16 comments
Open

Add named group support #6

jehna opened this issue Jan 3, 2016 · 16 comments

Comments

@jehna
Copy link

jehna commented Jan 3, 2016

As implemented in Python repo's pull request by @andrii1812

VerbalExpressions/PythonVerbalExpressions#16

@jehna
Copy link
Author

jehna commented Jan 3, 2016

Named groups is something that is missing in at least JavaScript, so this could potentially be an issue with cross-language support.

  • If we'd keep track of the groups' indexes in e.g. JavaScript, we could emulate the behaviour.
  • Should have a look at how XRegExp does this.
  • If this feature is implementable in other languages, this would be a great addition to the library

@roger-
Copy link

roger- commented Jan 6, 2016

How about something like:

tester = (verbal_expression.
            start_of_line().
               find('http').
               maybe('s').
               find('://').
               start_group('address').
                  maybe('www.').
                  anything_but(' ').
               end_group('address').
            end_of_line()
)

@jehna
Copy link
Author

jehna commented Jan 7, 2016

I was thinking more of like the Python implementation:

VerEx().
  start_of_line()
  .group("protocol",
    VerEx()
      .find('http')
      .maybe('s')
      .find('://')
  )
  .group("domain",
    VerEx()
      .maybe('www.')
      .anything_but(' ')
  )
  .end_of_line()

The real problem is, that some languages (e.g. JavaScript) do not natively support named groups for regex, like python does. The following would be implemented in vanilla python regex like:

^(?P<protocol>https?://)(?P<domain>(www\.)?[^\ ]*)$

Which allows us to ask values for protocol and domain groups.

So getting this to function in languages that don't support named groups would be awesome.

@roger-
Copy link

roger- commented Jan 7, 2016

Not bad, but I'm not a fan of the dangling, JavaScript-esque parentheses.

@jehna
Copy link
Author

jehna commented Jan 7, 2016

Also adding group members as arguments would kind of be against how the rest of the library works.
(all other ones working with xx().xx() but groups being an exception to work xx(xx()))

How about automatically wrapping the preceding "ungrouped" part automatically with group()?

Like:

VerEx()
  .start_of_line()
  .find('http')
  .maybe('s')
  .find('://')
  .group("protocol")
  .maybe('www.')
  .anything_but(' ')
  .end_of_line()
  .group("domain")

that would compile to:

(?P<protocol>^https?://)(?P<domain>(www\.)?[^\ ]*$)

@roger-
Copy link

roger- commented Jan 7, 2016

I think it's a little clearer to label the start/end of a group instead of just the end.

@jehna
Copy link
Author

jehna commented Jan 7, 2016

Hmm.. But then there's a human error possibility, especially with large expressions. If a dev forgets to start/end one group, that can be hard to debug.

If there's just a general group() function, it will potentially create unnecessary groups; for example if you'd just like to capture the middle of a string, it would look something like this:

// Capture only names of John and Tracy
var words = ['Name: John', 'Something', 'Name: Tracy', 'Name: Bob'];
VerEx()
  .first('Name: ')
  .group()
  .maybe('John')
  .maybe('Tracy')
  .group("name")

that would create an unnecessary group around the first('Name: '), but it would always compile, even if you removed either of the groups

@roger-
Copy link

roger- commented Jan 7, 2016

How about begin_group('address') ... end_group('address')?

@jehna
Copy link
Author

jehna commented Jan 7, 2016

But consider if you'd have a big expression with lots of things going on. You start modifying the code and somewhere between the copy-pastes you lose a begin_group('address') line. How should the code work? Should it? What would happen if you typo a group's name to other of the function calls? How would you debug your code?

I really think using just the single group method would answer pretty logically to all of these questions.

@roger-
Copy link

roger- commented Jan 7, 2016

I don't think we need to worry about cases where people mess up their code, there's not much we can do about that. That said, if there's an end_group('name') with no matching start_group('name') (and vice versa) then that's easy to detect and trigger an exception.

I just think it should be clear that a group has a well defined start and end. Also a single group() probably won't work when you have nested groups.

@mihai-vlc
Copy link
Contributor

I would also vote for an approach with 2 methods. It makes it more obvious.

VerEx()
  .start_of_line()
  .find('http')
  .maybe('s')
  .find('://')
  .open_group('domain')
    .maybe('www.')
    .anything_but(' ')
    .end_of_line()
  .close_group();

In this case you specify the name only when you open it.

To avoid issues in situations like the one below we could add the group wrapper to the expression only when you call the close_group() method.

VerEx()
  .start_of_line()
  .find('http')
  .maybe('s')
  .find('://')
  .open_group('domain')
    .maybe('www.');
// this would compile to: /(?:http)(?:s)?(?:\:\/\/)(?:www\.)?/gm

For debugging maybe we could add a helper method like check_groups that will throw an error if any previous opened group is not closed.

@lukewestby
Copy link

Given the variety of languages that implement VerbalExpressions I just want to throw in my opinion that I hope we can avoid functions that throw, as some languages don't have a concept of exceptions or try/catch and so the failure case would have to be lifted, adding friction to the actual use of the API

@roger-
Copy link

roger- commented Jan 7, 2016

For debugging maybe we could add a helper method like check_groups

I'd do that by default just to cut down on errors like @jehna mentioned.

I also think it's better to return lists of dicts (or namedtuples?) with results instead of SRE_Match objects.

@jehna
Copy link
Author

jehna commented Jan 11, 2016

Or we could just omit the whole group method for now and just make the then and maybe accept VerEx objects as arguments. Could this cause issues in strongly typed languages?

@jehna
Copy link
Author

jehna commented Jan 11, 2016

Also a bit related: VerbalExpressions/purescript-verbal-expressions#2

@sharkdp
Copy link

sharkdp commented Jan 11, 2016

Also a bit related: VerbalExpressions/purescript-verbal-expressions#2

Thanks for the reference. I have implemented this now. It allows to write something like this:

pattern = do
  find "start"
  some whitespace
  possiblyV do
    find "middle"
    some whitespace
  find "end"

I realize it probably doesn't help this discussion here too much, but its a nice example where a monadic interface has some advantages over the JS-style method-chaining interface, because we can also bind to intermediate names:

pattern = do
  firstWord <- capture word
  whitespace
  capture word
  whitespace
  findAgain firstWord

This pattern matches "foo bar foo" but not "foo bar baz".

The expression x <- capture inner creates a new capture group for the inner expression and binds the index of this capture group to the name x.

This also allows us to "simulate" named groups for purposes of replacing. For more details, see the Replacing with named groups example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants