Add lookarounds #3

tylerlittlefield · 2019-03-06T21:15:18Z

Add ways to express lookarounds. This was brought up by @dmi3kno and he mentioned a pretty intuitive way using step_ahead() and step_back()

Source: https://twitter.com/dmi3k/status/1103401979152355328

The text was updated successfully, but these errors were encountered:

dmi3kno · 2019-03-06T22:18:09Z

Suggested functions (need good unit testing);

step_ahead <- function(.data=NULL){
  val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
  if (!length(val)) return(.data)
  post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
  pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
  paste0(pre, "(?<=", post)
  }

step_back <- function(.data=NULL){
  val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
  if (!length(val)) return(.data)
  post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
  pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
  paste0(pre, "(?=", post)
}

These function boil down to trailing back and modifying previous find or maybe, if detected.

Since step_ahead() and step_back() are modifiers of find, perhaps they can be an argument in find (with default being step=0 or step=NULL):

x <- find("(", step='after') %>%     # or find("(", step=1)
  begin_capture() %>% 
  anything() %>% 
  find(")", step='before') %>%     # or find(")", step=-1)
  end_capture()

x
#> [1] "((?<=\\()(?:.*)(?=\\)))"

This does not cover negative lookarounds. The water is getting pretty deep already with my complex regex, so perhaps implementing (all kinds of) lookarounds would be easier as find modifiers. Or adverbs lookahead/lookbehind. Or synonyms stop_before(), start_after() ?

tylerlittlefield · 2019-03-06T23:31:38Z

Thanks for all the suggestions, this is awesome. I like your idea on adding an additional step argument to modify find().

I think the current regex in your example ((?<=\\()(?:.*)(?=\\))) would match "foo" in between something like "(extract) foo (me)", so maybe add a greedy argument to anything()? Otherwise, you could get by with something like:

x <- find(value = "(", step = 'forward') %>%
  anything_but(")") %>%
  find(")", step = 'backward')

stringr::str_extract_all("(extract) foo (me)", x)
#> [[1]]
#> [1] "extract" "me"

Or with a greedy argument in anything():

z <- find(value = "(", step = "forward") %>% 
  anything(greedy = FALSE) %>% 
  find(")", step = "backward")

z
#> [1] "(?<=\\()(?:.*?)(?=\\))"

stringr::str_extract_all("(extract) foo (me)", z)
#> [[1]]
#> [1] "extract" "me"

Reproducible example:

# in case you want to copy paste and run the example above
library(dplyr)

sanitize <- function(.data) {
  escape_chrs <- c(".", "|", "*", "?", "+", "(", ")", "{", "}", "^", "$", "\\", ":", "=", "[", "]")
  string_chrs <- strsplit(.data, "")[[1]]
  idx <- which(string_chrs %in% escape_chrs)
  idx_new <- paste0("\\", string_chrs[idx])
  paste0(replace(string_chrs, idx, idx_new), collapse = "")
}

# add greedy arg
anything <- function(.data = NULL, greedy = TRUE) {
  if(isTRUE(greedy)) {
    paste0(.data, "(?:.*)")
  } else if(isFALSE(greedy)){
    paste0(.data, "(?:.*?)")
  }
}

# add step arg
find <- function(.data = NULL, value, step = NULL) {
  if(is.null(step)) {
    paste0(.data, "(?:", sanitize(value), ")")
  } else if(step == "forward") {
    paste0(.data, "(?<=", sanitize(value), ")")
  } else if(step == "backward") {
    paste0(.data, "(?=", sanitize(value), ")")
  }
}

dmi3kno · 2019-03-07T23:39:31Z

I think greedy as argument looks a bit ugly. How about making lazy (non-greedy) versions of anything() and everything()?

regex                 greedy                    non-greedy
 .*                   anything()                whatever()
 .+                   everything()              something()

I think find() as it stands now, should only initiate non-capturing group. We need another group of verbs for creating lookarounds (positive and negative): seek_suffix, seek_prefix and avoid_suffix, avoid_prefix.

seek_prefix <- function(.data = NULL, value) {
    paste0(.data, "(?<=", sanitize(value), ")")
}

seek_suffix <- function(.data = NULL, value) {
    paste0(.data, "(?=", sanitize(value), ")")
}


avoid_prefix <- function(.data = NULL, value) {
    paste0(.data, "(?<!", sanitize(value), ")")
}

avoid_suffix <- function(.data = NULL, value) {
    paste0(.data, "(?!", sanitize(value), ")")
}

I also think that exact number of repetitions can be expressed as count() (or n() or repeated()):

count <- function(.data = NULL, n = 1) {
  paste0(.data, "{", n,"}")
}

Here are some unit tests for lookarounds, all returning single value 100:

# positive lookahead
x <- start_of_line() %>% 
  digit() %>% count(3) %>% 
  seek_suffix(" dollars")
x
stringr::str_extract_all("100 dollars", x)
  
# negative lookahead
x <- start_of_line() %>% 
  digit() %>% count(3) %>%
  avoid_suffix(" dollars")
x
stringr::str_extract_all("100 pesos", x)

# positive lookbehind
x <- seek_prefix(value="USD") %>% 
  digit() %>% count(3)
x
stringr::str_extract_all("USD100", x)

#negative lookbehind
x <- avoid_prefix(value="USD") %>% 
  digit() %>% count(3)
x
stringr::str_extract_all("JPY100", x)

Finally, as Hadley suggested you need to start thinking about prefix for the function names to avoid namespace collisions. I suggest we go for rx_, so it would be rx_whatever(), rx_digit() or rx_count().

tylerlittlefield · 2019-03-08T00:42:38Z

Thanks for this! One thing to keep in mind is that the following currently exist:

anything()
anything_but()
something()
something_but()

Where:

# matches anything, including nothing i.e. an empty character
anything()
#> [1] "(?:.*)"
anything_but(value = "foo")
#> [1] "(?:[^foo]*)"

# matches something, excluding nothing
something()
#> [1] "(?:.+)"
something_but(value = "foo")
#> [1] "(?:[^foo]+)"

grepl(anything(), "")
#> [1] TRUE
grepl(something(), "")
#> [1] FALSE

I like the idea of anything(), whatever(), everything(), and something() but they all sound greedy to me. What about anything_lazy(), something_lazy()?

This would create 3 options for each, a total of 6 functions:

anything() matches literally anything, including nothing.
anything_but() matches anything but whatever you give it.
anything_lazy() matches anything as little as needed.
something() matches something, excluding nothing.
something_but() matches something but whatever you give it.
something_lazy() matches something as little as needed.

You could then do something like:

something_lazy <- function(.data = NULL) {
  paste0(.data, "(?:.+?)")
}

anything_lazy <- function(.data = NULL) {
  paste0(.data, "(?:.*?)")
}

x <- seek_prefix(value = "(") %>% 
  something_lazy() %>% 
  seek_suffix(")")
x
#> [1] "(?<=\\()(?:.+?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", x)
#> [[1]]
#> [1] "extract" "me"

y <- seek_prefix(value = "(") %>% 
  anything_lazy() %>% 
  seek_suffix(")")
y
#> [1] "(?<=\\()(?:.*?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", y)
#> [[1]]
#> [1] "extract" "me"      ""

Not sure if this is the right way, but for whatever reason something() and anything() make sense to me.

Also, the lookaround functions and count() look great. rx_ sounds like a good prefix as well. I might just pull the trigger and add rx_ once I get home. Take a look at #1 as well, do you like rx_ better than vex_? I like both, rx_ is nice because it's shorter.

dmi3kno · 2019-03-08T09:51:11Z

After some thinking I agree that introducing more synonyms for non-greedy variants of existing functions is a bad idea (I like the implicit "lazyness" of whatever() though :) ).

Having said that, I don't like the _lazy suffix. There are potentially many functions that may need to be turned non-greedy. Examples are one_or_more() (the + modifier and the * counterpart, which you could call none_or_more()), or even the anything_but(), something_but() and generally every function that results in regex piece ending with + or * modifier).

Could the "non-greediness" be turned on in any of these by an argument called lazy? We can even leave greedy=!lazy for more advanced users (who know what "greediness" is).

I think more fundamental decision that we seem to have landed on is that the default verbs should be greedy (to match default perl-style regex behavior). Although I just want to drop it in here that we could think of a rx world where lazy verbs are defaults and you need to turn greediness on. This is a difficult world to comprehend for me right now.

References:
Lazy vs greedy

P.S. One afterthought: when inventing new verbs, we should probably try and stay closer to the words that have been implemented in other languages of VerbalExpressions. That would honor the work done by others and make transitions between languages smoother. We are free to invent arguments, though. This means that anything(), anything_but() are here to stay.

tylerlittlefield · 2019-03-08T17:09:59Z

Good point, _lazy() isn't going to cut it. rex uses: type = c("greedy", "lazy", "possessive"), what do you think? With the prefix and constructor, it would look like:

x <- rx() %>% 
  seek_prefix("(") %>% 
  anything(type = "lazy") %>% 
  seek_suffix(")")

Regarding lazy by default, I like the idea but regular expressions have been greedy by default for so long that it may just be confusing. There is a thread about this here.

dmi3kno · 2019-03-08T17:36:03Z

LGTM. Do you want a PR?

tylerlittlefield · 2019-03-08T17:39:07Z

Up to you. I don't mind adding the changes but I'd rather let people contribute if they want. So just let me know 😄

tylerlittlefield · 2019-03-08T17:49:43Z

Oh and if you do submit a PR, please let me know what you're working on so we don't duplicate efforts. I worked a bit on adding the rx_ prefix last night so no need to work on that.

dmi3kno · 2019-03-08T18:18:00Z

I am adding lookarounds and implementing type (lazy/greedy) argument in relevant functions

dmi3kno · 2019-03-08T18:23:28Z

Last question: do you think the argument should be called type or mode? Or yet something else?

tylerlittlefield · 2019-03-08T18:24:19Z

I think mode sounds better than type.

tylerlittlefield added the enhancement New feature or request label Mar 6, 2019

dmi3kno mentioned this issue Mar 9, 2019

Adding lazy mode, lookarounds, none_or_more, count and digits #7

Merged

tylerlittlefield closed this as completed in #7 Mar 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lookarounds #3

Add lookarounds #3

tylerlittlefield commented Mar 6, 2019

dmi3kno commented Mar 6, 2019 •

edited

Loading

tylerlittlefield commented Mar 6, 2019 •

edited

Loading

dmi3kno commented Mar 7, 2019

tylerlittlefield commented Mar 8, 2019 •

edited

Loading

dmi3kno commented Mar 8, 2019 •

edited

Loading

tylerlittlefield commented Mar 8, 2019

dmi3kno commented Mar 8, 2019

tylerlittlefield commented Mar 8, 2019

tylerlittlefield commented Mar 8, 2019 •

edited

Loading

dmi3kno commented Mar 8, 2019

dmi3kno commented Mar 8, 2019 •

edited

Loading

tylerlittlefield commented Mar 8, 2019

Add lookarounds #3

Add lookarounds #3

Comments

tylerlittlefield commented Mar 6, 2019

dmi3kno commented Mar 6, 2019 • edited Loading

tylerlittlefield commented Mar 6, 2019 • edited Loading

dmi3kno commented Mar 7, 2019

tylerlittlefield commented Mar 8, 2019 • edited Loading

dmi3kno commented Mar 8, 2019 • edited Loading

tylerlittlefield commented Mar 8, 2019

dmi3kno commented Mar 8, 2019

tylerlittlefield commented Mar 8, 2019

tylerlittlefield commented Mar 8, 2019 • edited Loading

dmi3kno commented Mar 8, 2019

dmi3kno commented Mar 8, 2019 • edited Loading

tylerlittlefield commented Mar 8, 2019

dmi3kno commented Mar 6, 2019 •

edited

Loading

tylerlittlefield commented Mar 6, 2019 •

edited

Loading

tylerlittlefield commented Mar 8, 2019 •

edited

Loading

dmi3kno commented Mar 8, 2019 •

edited

Loading

tylerlittlefield commented Mar 8, 2019 •

edited

Loading

dmi3kno commented Mar 8, 2019 •

edited

Loading