Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix domain_is_in? for deeply nested subdomain #270

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cm-dyoshikawa
Copy link
Contributor

@cm-dyoshikawa cm-dyoshikawa commented Feb 3, 2025

@micke

Hi,
Thank you for reviewing and merging my previous PR.

I found that there are some email address patterns that cannot be detected when using domain_is_in? in disposable_domain?.

For example, when disposable_domain.com is included in the blocklist:

can be detected, but:

with two or more nested subdomains cannot be detected.

Therefore, I modified domain_is_in? to detect deeply nested subdomains as well.
I've also added RSpec test cases for this.

Thanks.

@cm-dyoshikawa cm-dyoshikawa changed the title fix domain checking logic in address.rb Fix domain_is_in? for deeply nested subdomain Feb 3, 2025
Comment on lines -184 to +208
it "calls the the MX servers lookup" do
it "calls the the MX servers lookup" do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been automatically fixed by my code editor. Removing unnecessary spaces.

@@ -119,10 +119,12 @@ def domain_is_in?(domain_list)
address_domain = address.domain.downcase
return true if domain_list.include?(address_domain)

i = address_domain.index('.')
return false unless i
Copy link
Collaborator

@phoet phoet Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmhmmm, i guess it has always been truethy? i mean domains typically contain a dot. 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@micke was it on purpose, that in any case, the TLD so com from example.com was checked against the list? i guess we can speed up the lib significantly by getting rid of this.

Copy link
Contributor Author

@cm-dyoshikawa cm-dyoshikawa Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me as well, the TLD check seems unnecessary.

Shall I remove the TLD check while we have this opportunity? I think it's possible.

Copy link
Contributor Author

@cm-dyoshikawa cm-dyoshikawa Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation might look something like this:

      domain_hierarchy_level = 0
      while (i = address_domain.index('.')) && domain_hierarchy_level < @domain_hierarchy_max_depth
        address_domain = address_domain[(i + 1)..-1]
        # Skip checking TLD (if there's no more dot, it's considered as TLD)
        break unless address_domain.include?('.')
        return true if domain_list.include?(address_domain)
        domain_hierarchy_level += 1
      end

This implementation includes this discussion.

Copy link
Collaborator

@phoet phoet Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have not tested this thoroughly, but maybe you get the basic idea here and can finish it off

    def domain_is_in?(domain_list, max_depth = 3)
      address_domain = address.domain.downcase
      return true if domain_list.include?(address_domain)

      tokens = address_domain.split('.')
      return false if tokens.length < 3

      limited_sub_domain_part = tokens.reverse.first(max_depth).reverse.join('.')

      domain_list.include?(limited_sub_domain_part)
    end

there is no iteration over parts, so this is faster and we get rid of the while.

@@ -119,10 +119,12 @@ def domain_is_in?(domain_list)
address_domain = address.domain.downcase
return true if domain_list.include?(address_domain)

i = address_domain.index('.')
return false unless i
while i = address_domain.index('.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've seen so many dead systems because of a broken while that i would try to get rid of it in any library implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would be good if this was actually a configurable option like max_depth = 3 on the domain so subdomains deeper that x would just be ignored and the rest would need to be configured individually in the disposable list?

Copy link
Contributor Author

@cm-dyoshikawa cm-dyoshikawa Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, I understand your concern. I think it's possible.

How about an interface like this?

-   def initialize(address, dns_timeout = 5, dns_nameserver = nil)
+   def initialize(address, dns_timeout = 5, dns_nameserver = nil, domain_hierarchy_max_depth: 3)
      @parse_error = false
      @raw_address = address
      @dns_timeout = dns_timeout
      @dns_nameserver = dns_nameserver
+     @domain_hierarchy_max_depth = domain_hierarchy_max_depth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation might look something like this:

      domain_hierarchy_level = 0
      while i = address_domain.index('.') && domain_hierarchy_level < @domain_hierarchy_max_depth
        address_domain = address_domain[(i + 1)..-1]
        return true if domain_list.include?(address_domain)
        domain_hierarchy_level += 1
      end

@@ -56,6 +56,30 @@
end
end

describe "#disposable_domain?" do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 great you added tests for this.

Copy link
Contributor Author

@cm-dyoshikawa cm-dyoshikawa Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants