-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix domain_is_in?
for deeply nested subdomain
#270
base: main
Are you sure you want to change the base?
Conversation
domain_is_in?
for deeply nested subdomain
it "calls the the MX servers lookup" do | ||
it "calls the the MX servers lookup" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been automatically fixed by my code editor. Removing unnecessary spaces.
@@ -119,10 +119,12 @@ def domain_is_in?(domain_list) | |||
address_domain = address.domain.downcase | |||
return true if domain_list.include?(address_domain) | |||
|
|||
i = address_domain.index('.') | |||
return false unless i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmhmmm, i guess it has always been truethy? i mean domains typically contain a dot. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@micke was it on purpose, that in any case, the TLD so com
from example.com
was checked against the list? i guess we can speed up the lib significantly by getting rid of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me as well, the TLD check seems unnecessary.
Shall I remove the TLD check while we have this opportunity? I think it's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation might look something like this:
domain_hierarchy_level = 0
while (i = address_domain.index('.')) && domain_hierarchy_level < @domain_hierarchy_max_depth
address_domain = address_domain[(i + 1)..-1]
# Skip checking TLD (if there's no more dot, it's considered as TLD)
break unless address_domain.include?('.')
return true if domain_list.include?(address_domain)
domain_hierarchy_level += 1
end
This implementation includes this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have not tested this thoroughly, but maybe you get the basic idea here and can finish it off
def domain_is_in?(domain_list, max_depth = 3)
address_domain = address.domain.downcase
return true if domain_list.include?(address_domain)
tokens = address_domain.split('.')
return false if tokens.length < 3
limited_sub_domain_part = tokens.reverse.first(max_depth).reverse.join('.')
domain_list.include?(limited_sub_domain_part)
end
there is no iteration over parts, so this is faster and we get rid of the while.
@@ -119,10 +119,12 @@ def domain_is_in?(domain_list) | |||
address_domain = address.domain.downcase | |||
return true if domain_list.include?(address_domain) | |||
|
|||
i = address_domain.index('.') | |||
return false unless i | |||
while i = address_domain.index('.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've seen so many dead systems because of a broken while
that i would try to get rid of it in any library implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it would be good if this was actually a configurable option like max_depth = 3
on the domain so subdomains deeper that x would just be ignored and the rest would need to be configured individually in the disposable list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, I understand your concern. I think it's possible.
How about an interface like this?
- def initialize(address, dns_timeout = 5, dns_nameserver = nil)
+ def initialize(address, dns_timeout = 5, dns_nameserver = nil, domain_hierarchy_max_depth: 3)
@parse_error = false
@raw_address = address
@dns_timeout = dns_timeout
@dns_nameserver = dns_nameserver
+ @domain_hierarchy_max_depth = domain_hierarchy_max_depth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation might look something like this:
domain_hierarchy_level = 0
while i = address_domain.index('.') && domain_hierarchy_level < @domain_hierarchy_max_depth
address_domain = address_domain[(i + 1)..-1]
return true if domain_list.include?(address_domain)
domain_hierarchy_level += 1
end
@@ -56,6 +56,30 @@ | |||
end | |||
end | |||
|
|||
describe "#disposable_domain?" do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 great you added tests for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 🙌
@micke
Hi,
Thank you for reviewing and merging my previous PR.
I found that there are some email address patterns that cannot be detected when using
domain_is_in?
indisposable_domain?
.For example, when
disposable_domain.com
is included in the blocklist:foo@disposable_domain.com
[email protected]_domain.com
can be detected, but:
[email protected]_domain.com
with two or more nested subdomains cannot be detected.
Therefore, I modified
domain_is_in?
to detect deeply nested subdomains as well.I've also added RSpec test cases for this.
Thanks.