-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lookarounds #3
Comments
Suggested functions (need good unit testing); step_ahead <- function(.data=NULL){
val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
if (!length(val)) return(.data)
post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
paste0(pre, "(?<=", post)
} step_back <- function(.data=NULL){
val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
if (!length(val)) return(.data)
post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
paste0(pre, "(?=", post)
} These function boil down to trailing back and modifying previous Since x <- find("(", step='after') %>% # or find("(", step=1)
begin_capture() %>%
anything() %>%
find(")", step='before') %>% # or find(")", step=-1)
end_capture()
x
#> [1] "((?<=\\()(?:.*)(?=\\)))" This does not cover negative lookarounds. The water is getting pretty deep already with my complex regex, so perhaps implementing (all kinds of) lookarounds would be easier as |
Thanks for all the suggestions, this is awesome. I like your idea on adding an additional step argument to modify I think the current regex in your example x <- find(value = "(", step = 'forward') %>%
anything_but(")") %>%
find(")", step = 'backward')
stringr::str_extract_all("(extract) foo (me)", x)
#> [[1]]
#> [1] "extract" "me" Or with a greedy argument in z <- find(value = "(", step = "forward") %>%
anything(greedy = FALSE) %>%
find(")", step = "backward")
z
#> [1] "(?<=\\()(?:.*?)(?=\\))"
stringr::str_extract_all("(extract) foo (me)", z)
#> [[1]]
#> [1] "extract" "me" Reproducible example: # in case you want to copy paste and run the example above
library(dplyr)
sanitize <- function(.data) {
escape_chrs <- c(".", "|", "*", "?", "+", "(", ")", "{", "}", "^", "$", "\\", ":", "=", "[", "]")
string_chrs <- strsplit(.data, "")[[1]]
idx <- which(string_chrs %in% escape_chrs)
idx_new <- paste0("\\", string_chrs[idx])
paste0(replace(string_chrs, idx, idx_new), collapse = "")
}
# add greedy arg
anything <- function(.data = NULL, greedy = TRUE) {
if(isTRUE(greedy)) {
paste0(.data, "(?:.*)")
} else if(isFALSE(greedy)){
paste0(.data, "(?:.*?)")
}
}
# add step arg
find <- function(.data = NULL, value, step = NULL) {
if(is.null(step)) {
paste0(.data, "(?:", sanitize(value), ")")
} else if(step == "forward") {
paste0(.data, "(?<=", sanitize(value), ")")
} else if(step == "backward") {
paste0(.data, "(?=", sanitize(value), ")")
}
} |
I think
I think seek_prefix <- function(.data = NULL, value) {
paste0(.data, "(?<=", sanitize(value), ")")
}
seek_suffix <- function(.data = NULL, value) {
paste0(.data, "(?=", sanitize(value), ")")
}
avoid_prefix <- function(.data = NULL, value) {
paste0(.data, "(?<!", sanitize(value), ")")
}
avoid_suffix <- function(.data = NULL, value) {
paste0(.data, "(?!", sanitize(value), ")")
} I also think that exact number of repetitions can be expressed as count <- function(.data = NULL, n = 1) {
paste0(.data, "{", n,"}")
} Here are some unit tests for lookarounds, all returning single value # positive lookahead
x <- start_of_line() %>%
digit() %>% count(3) %>%
seek_suffix(" dollars")
x
stringr::str_extract_all("100 dollars", x)
# negative lookahead
x <- start_of_line() %>%
digit() %>% count(3) %>%
avoid_suffix(" dollars")
x
stringr::str_extract_all("100 pesos", x)
# positive lookbehind
x <- seek_prefix(value="USD") %>%
digit() %>% count(3)
x
stringr::str_extract_all("USD100", x)
#negative lookbehind
x <- avoid_prefix(value="USD") %>%
digit() %>% count(3)
x
stringr::str_extract_all("JPY100", x) Finally, as Hadley suggested you need to start thinking about prefix for the function names to avoid namespace collisions. I suggest we go for |
Thanks for this! One thing to keep in mind is that the following currently exist:
Where: # matches anything, including nothing i.e. an empty character
anything()
#> [1] "(?:.*)"
anything_but(value = "foo")
#> [1] "(?:[^foo]*)"
# matches something, excluding nothing
something()
#> [1] "(?:.+)"
something_but(value = "foo")
#> [1] "(?:[^foo]+)"
grepl(anything(), "")
#> [1] TRUE
grepl(something(), "")
#> [1] FALSE I like the idea of This would create 3 options for each, a total of 6 functions:
You could then do something like: something_lazy <- function(.data = NULL) {
paste0(.data, "(?:.+?)")
}
anything_lazy <- function(.data = NULL) {
paste0(.data, "(?:.*?)")
}
x <- seek_prefix(value = "(") %>%
something_lazy() %>%
seek_suffix(")")
x
#> [1] "(?<=\\()(?:.+?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", x)
#> [[1]]
#> [1] "extract" "me"
y <- seek_prefix(value = "(") %>%
anything_lazy() %>%
seek_suffix(")")
y
#> [1] "(?<=\\()(?:.*?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", y)
#> [[1]]
#> [1] "extract" "me" "" Not sure if this is the right way, but for whatever reason Also, the lookaround functions and |
After some thinking I agree that introducing more synonyms for non-greedy variants of existing functions is a bad idea (I like the implicit "lazyness" of Having said that, I don't like the Could the "non-greediness" be turned on in any of these by an argument called I think more fundamental decision that we seem to have landed on is that the default verbs should be greedy (to match default perl-style regex behavior). Although I just want to drop it in here that we could think of a References: P.S. One afterthought: when inventing new verbs, we should probably try and stay closer to the words that have been implemented in other languages of |
Good point, x <- rx() %>%
seek_prefix("(") %>%
anything(type = "lazy") %>%
seek_suffix(")") Regarding lazy by default, I like the idea but regular expressions have been greedy by default for so long that it may just be confusing. There is a thread about this here. |
LGTM. Do you want a PR? |
Up to you. I don't mind adding the changes but I'd rather let people contribute if they want. So just let me know 😄 |
Oh and if you do submit a PR, please let me know what you're working on so we don't duplicate efforts. I worked a bit on adding the |
I am adding lookarounds and implementing |
Last question: do you think the argument should be called |
I think |
Add ways to express lookarounds. This was brought up by @dmi3kno and he mentioned a pretty intuitive way using
step_ahead()
andstep_back()
Source: https://twitter.com/dmi3k/status/1103401979152355328
The text was updated successfully, but these errors were encountered: