-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug with an escaped slash #23
base: master
Are you sure you want to change the base?
Conversation
Thanks for your interest! C++ REP parsing is kind of a niche thing, and it's always fun, surprising when someone finds this little corner of the world. I'm not sure I understand this use case -- is the goal to treat My other concern is the use of
|
Thank you for your response! Yes, performance decreases dramatically and I should find better solution. And thank you pointing out on ./bench, I totally missed it. This case appears to be important:
|
E.g. we have rule So, in current implementation http://example.com/about is allowed, but http://example.com/about/ is not. |
Interesting! I suppose I missed that section of the RFC initially. The code internal to Originally, I thought it would be possible that code could be adapted or the interface of Still, it's something to investigate, and that |
This is a very interesting edge case. I'm not sure exactly the best way to handle this is, but the recommendation by @dlecocq seems like a reasonable approach to investigate as a first pass. Alternatively, you could try doing the substitution manually using direct character replacements rather than using the regex replace which is probably doing multiple string allocations. |
@@ -82,6 +83,14 @@ namespace Rep | |||
|
|||
std::string Agent::escape(const std::string& query) | |||
{ | |||
return Url::Url(query).defrag().escape().fullpath(); | |||
std::regex escaped_slash ("%2[Ff]"); | |||
std::regex escaped_newline ("\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a newline character for this seems like a gross hack rather than a long term solution. I'd really prefer a fix that was cleaner.
After thinking about this issue a little more, the 1996 RFC for REP does not actually support wildcards in the path. Wildcard support in REP was added later by Yahoo, Google, and other search engines, and I find no mention of special treatment of the escaped slash character in Google's spec, so this particular edge case is odd in that it falls in between any of the public standards for REP. |
This PR fixes incorrect checking of Disallow: /*%2F rule