You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for developing this very useful package. However, I have a problem with the crawlUrlfilter argument.
From a large website, I would like to crawl and scrape only those URLs that match a specific pattern. According to the documentation, the crawlUrlfilter does exactly what I am looking for.
When the pattern passed to crawlUrlfilter contains only one level of the URL, like in the following code Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/")
I get the desired results, i.e. only those URLS that match the pattern "article", e.g.
Is this a bug, or am I getting something wrong?
Following the example given in the documentation dataUrlfilter ="/[0-9]{4}/[0-9]{2}/[0-9]{2}/" it should be no problem at all passing an argument that contains several "/".
The text was updated successfully, but these errors were encountered:
Thank you for developing this very useful package. However, I have a problem with the
crawlUrlfilter
argument.From a large website, I would like to crawl and scrape only those URLs that match a specific pattern. According to the documentation, the
crawlUrlfilter
does exactly what I am looking for.When the pattern passed to
crawlUrlfilter
contains only one level of the URL, like in the following codeRcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/")
I get the desired results, i.e. only those URLS that match the pattern "article", e.g.
https://www.somewebsite.org/article/sample-article-217 or
https://www.somewebsite.org/article/2019-01-20-another-example
However, when I want to filter URLs based on a pattern of two levels of the URL, such as:
https://www.somewebsite.org/article/news/january-2019-meeting_of_trainers or
https://www.somewebsite.org/article/news/review-of-meetup
the following code does not find any matches:
Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/news")
Is this a bug, or am I getting something wrong?
Following the example given in the documentation
dataUrlfilter ="/[0-9]{4}/[0-9]{2}/[0-9]{2}/"
it should be no problem at all passing an argument that contains several "/".The text was updated successfully, but these errors were encountered: