Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lookarounds #3

Closed
tylerlittlefield opened this issue Mar 6, 2019 · 12 comments
Closed

Add lookarounds #3

tylerlittlefield opened this issue Mar 6, 2019 · 12 comments
Labels
enhancement New feature or request

Comments

@tylerlittlefield
Copy link
Member

Add ways to express lookarounds. This was brought up by @dmi3kno and he mentioned a pretty intuitive way using step_ahead() and step_back()

Source: https://twitter.com/dmi3k/status/1103401979152355328

lookarounds

@tylerlittlefield tylerlittlefield added the enhancement New feature or request label Mar 6, 2019
@dmi3kno
Copy link

dmi3kno commented Mar 6, 2019

Suggested functions (need good unit testing);

step_ahead <- function(.data=NULL){
  val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
  if (!length(val)) return(.data)
  post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
  pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
  paste0(pre, "(?<=", post)
  }
step_back <- function(.data=NULL){
  val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
  if (!length(val)) return(.data)
  post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
  pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
  paste0(pre, "(?=", post)
}

These function boil down to trailing back and modifying previous find or maybe, if detected.

Since step_ahead() and step_back() are modifiers of find, perhaps they can be an argument in find (with default being step=0 or step=NULL):

x <- find("(", step='after') %>%     # or find("(", step=1)
  begin_capture() %>% 
  anything() %>% 
  find(")", step='before') %>%     # or find(")", step=-1)
  end_capture()

x
#> [1] "((?<=\\()(?:.*)(?=\\)))"

This does not cover negative lookarounds. The water is getting pretty deep already with my complex regex, so perhaps implementing (all kinds of) lookarounds would be easier as find modifiers. Or adverbs lookahead/lookbehind. Or synonyms stop_before(), start_after() ?

@tylerlittlefield
Copy link
Member Author

tylerlittlefield commented Mar 6, 2019

Thanks for all the suggestions, this is awesome. I like your idea on adding an additional step argument to modify find().

I think the current regex in your example ((?<=\\()(?:.*)(?=\\))) would match "foo" in between something like "(extract) foo (me)", so maybe add a greedy argument to anything()? Otherwise, you could get by with something like:

x <- find(value = "(", step = 'forward') %>%
  anything_but(")") %>%
  find(")", step = 'backward')

stringr::str_extract_all("(extract) foo (me)", x)
#> [[1]]
#> [1] "extract" "me"  

Or with a greedy argument in anything():

z <- find(value = "(", step = "forward") %>% 
  anything(greedy = FALSE) %>% 
  find(")", step = "backward")

z
#> [1] "(?<=\\()(?:.*?)(?=\\))"

stringr::str_extract_all("(extract) foo (me)", z)
#> [[1]]
#> [1] "extract" "me"  

Reproducible example:

# in case you want to copy paste and run the example above
library(dplyr)

sanitize <- function(.data) {
  escape_chrs <- c(".", "|", "*", "?", "+", "(", ")", "{", "}", "^", "$", "\\", ":", "=", "[", "]")
  string_chrs <- strsplit(.data, "")[[1]]
  idx <- which(string_chrs %in% escape_chrs)
  idx_new <- paste0("\\", string_chrs[idx])
  paste0(replace(string_chrs, idx, idx_new), collapse = "")
}

# add greedy arg
anything <- function(.data = NULL, greedy = TRUE) {
  if(isTRUE(greedy)) {
    paste0(.data, "(?:.*)")
  } else if(isFALSE(greedy)){
    paste0(.data, "(?:.*?)")
  }
}

# add step arg
find <- function(.data = NULL, value, step = NULL) {
  if(is.null(step)) {
    paste0(.data, "(?:", sanitize(value), ")")
  } else if(step == "forward") {
    paste0(.data, "(?<=", sanitize(value), ")")
  } else if(step == "backward") {
    paste0(.data, "(?=", sanitize(value), ")")
  }
}

@dmi3kno
Copy link

dmi3kno commented Mar 7, 2019

I think greedy as argument looks a bit ugly. How about making lazy (non-greedy) versions of anything() and everything()?

regex                 greedy                    non-greedy
 .*                   anything()                whatever()
 .+                   everything()              something()    

I think find() as it stands now, should only initiate non-capturing group. We need another group of verbs for creating lookarounds (positive and negative): seek_suffix, seek_prefix and avoid_suffix, avoid_prefix.

seek_prefix <- function(.data = NULL, value) {
    paste0(.data, "(?<=", sanitize(value), ")")
}

seek_suffix <- function(.data = NULL, value) {
    paste0(.data, "(?=", sanitize(value), ")")
}


avoid_prefix <- function(.data = NULL, value) {
    paste0(.data, "(?<!", sanitize(value), ")")
}

avoid_suffix <- function(.data = NULL, value) {
    paste0(.data, "(?!", sanitize(value), ")")
}

I also think that exact number of repetitions can be expressed as count() (or n() or repeated()):

count <- function(.data = NULL, n = 1) {
  paste0(.data, "{", n,"}")
}

Here are some unit tests for lookarounds, all returning single value 100:

# positive lookahead
x <- start_of_line() %>% 
  digit() %>% count(3) %>% 
  seek_suffix(" dollars")
x
stringr::str_extract_all("100 dollars", x)
  
# negative lookahead
x <- start_of_line() %>% 
  digit() %>% count(3) %>%
  avoid_suffix(" dollars")
x
stringr::str_extract_all("100 pesos", x)

# positive lookbehind
x <- seek_prefix(value="USD") %>% 
  digit() %>% count(3)
x
stringr::str_extract_all("USD100", x)

#negative lookbehind
x <- avoid_prefix(value="USD") %>% 
  digit() %>% count(3)
x
stringr::str_extract_all("JPY100", x)

Finally, as Hadley suggested you need to start thinking about prefix for the function names to avoid namespace collisions. I suggest we go for rx_, so it would be rx_whatever(), rx_digit() or rx_count().

@tylerlittlefield
Copy link
Member Author

tylerlittlefield commented Mar 8, 2019

Thanks for this! One thing to keep in mind is that the following currently exist:

  1. anything()
  2. anything_but()
  3. something()
  4. something_but()

Where:

# matches anything, including nothing i.e. an empty character
anything()
#> [1] "(?:.*)"
anything_but(value = "foo")
#> [1] "(?:[^foo]*)"

# matches something, excluding nothing
something()
#> [1] "(?:.+)"
something_but(value = "foo")
#> [1] "(?:[^foo]+)"

grepl(anything(), "")
#> [1] TRUE
grepl(something(), "")
#> [1] FALSE

I like the idea of anything(), whatever(), everything(), and something() but they all sound greedy to me. What about anything_lazy(), something_lazy()?

This would create 3 options for each, a total of 6 functions:

  1. anything() matches literally anything, including nothing.
  2. anything_but() matches anything but whatever you give it.
  3. anything_lazy() matches anything as little as needed.
  4. something() matches something, excluding nothing.
  5. something_but() matches something but whatever you give it.
  6. something_lazy() matches something as little as needed.

You could then do something like:

something_lazy <- function(.data = NULL) {
  paste0(.data, "(?:.+?)")
}

anything_lazy <- function(.data = NULL) {
  paste0(.data, "(?:.*?)")
}

x <- seek_prefix(value = "(") %>% 
  something_lazy() %>% 
  seek_suffix(")")
x
#> [1] "(?<=\\()(?:.+?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", x)
#> [[1]]
#> [1] "extract" "me"

y <- seek_prefix(value = "(") %>% 
  anything_lazy() %>% 
  seek_suffix(")")
y
#> [1] "(?<=\\()(?:.*?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", y)
#> [[1]]
#> [1] "extract" "me"      ""

Not sure if this is the right way, but for whatever reason something() and anything() make sense to me.

Also, the lookaround functions and count() look great. rx_ sounds like a good prefix as well. I might just pull the trigger and add rx_ once I get home. Take a look at #1 as well, do you like rx_ better than vex_? I like both, rx_ is nice because it's shorter.

@dmi3kno
Copy link

dmi3kno commented Mar 8, 2019

After some thinking I agree that introducing more synonyms for non-greedy variants of existing functions is a bad idea (I like the implicit "lazyness" of whatever() though :) ).

Having said that, I don't like the _lazy suffix. There are potentially many functions that may need to be turned non-greedy. Examples are one_or_more() (the + modifier and the * counterpart, which you could call none_or_more()), or even the anything_but(), something_but() and generally every function that results in regex piece ending with + or * modifier).

Could the "non-greediness" be turned on in any of these by an argument called lazy? We can even leave greedy=!lazy for more advanced users (who know what "greediness" is).

I think more fundamental decision that we seem to have landed on is that the default verbs should be greedy (to match default perl-style regex behavior). Although I just want to drop it in here that we could think of a rx world where lazy verbs are defaults and you need to turn greediness on. This is a difficult world to comprehend for me right now.

References:
Lazy vs greedy

P.S. One afterthought: when inventing new verbs, we should probably try and stay closer to the words that have been implemented in other languages of VerbalExpressions. That would honor the work done by others and make transitions between languages smoother. We are free to invent arguments, though. This means that anything(), anything_but() are here to stay.

@tylerlittlefield
Copy link
Member Author

Good point, _lazy() isn't going to cut it. rex uses: type = c("greedy", "lazy", "possessive"), what do you think? With the prefix and constructor, it would look like:

x <- rx() %>% 
  seek_prefix("(") %>% 
  anything(type = "lazy") %>% 
  seek_suffix(")")

Regarding lazy by default, I like the idea but regular expressions have been greedy by default for so long that it may just be confusing. There is a thread about this here.

@dmi3kno
Copy link

dmi3kno commented Mar 8, 2019

LGTM. Do you want a PR?

@tylerlittlefield
Copy link
Member Author

Up to you. I don't mind adding the changes but I'd rather let people contribute if they want. So just let me know 😄

@tylerlittlefield
Copy link
Member Author

tylerlittlefield commented Mar 8, 2019

Oh and if you do submit a PR, please let me know what you're working on so we don't duplicate efforts. I worked a bit on adding the rx_ prefix last night so no need to work on that.

@dmi3kno
Copy link

dmi3kno commented Mar 8, 2019

I am adding lookarounds and implementing type (lazy/greedy) argument in relevant functions

@dmi3kno
Copy link

dmi3kno commented Mar 8, 2019

Last question: do you think the argument should be called type or mode? Or yet something else?

@tylerlittlefield
Copy link
Member Author

I think mode sounds better than type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants