Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nucleo-matcher: Normalization is incomplete. #51

Open
markus-bauer opened this issue Oct 4, 2024 · 0 comments
Open

nucleo-matcher: Normalization is incomplete. #51

markus-bauer opened this issue Oct 4, 2024 · 0 comments

Comments

@markus-bauer
Copy link

Thanks for this crate. But I had some issues with normalization.

Here are some random examples from the polish alphabet, that don't work as expected; there might be more issues, though.

use nucleo_matcher::pattern::{AtomKind, CaseMatching, Normalization};

fn main() {
    // Example: https://en.wikipedia.org/wiki/Polish_alphabet
    let tests = [
        ('ą', 'a'),
        ('ć', 'c'),
        ('ę', 'e'),
        ('ł', 'l'),
        ('ń', 'n'),
        ('ó', 'o'),
        ('ś', 's'),
        ('ź', 'z'),
        ('ż', 'z'),
        ('Ą', 'A'),
        ('Ć', 'C'),
        ('Ę', 'E'),
        ('Ł', 'L'),
        ('Ń', 'N'),
        ('Ó', 'O'),
        ('Ś', 'S'),
        ('Ź', 'Z'),
        ('Ż', 'Z'),
    ];

    // test normalization funcion
    println!("--- normalize ---");
    for (c, expected) in tests {
        let normalized = nucleo_matcher::chars::normalize(c);
        if normalized == expected {
            println!("okay: {} -> {}.", c, normalized);
        } else {
            println!("fail: {} -> {}. expected {}", c, normalized, expected);
            let c_lower = c.to_lowercase().next().unwrap();
            let expected_lower = expected.to_lowercase().next().unwrap();
            let normalized_lower = nucleo_matcher::chars::normalize(c_lower);
            if normalized_lower == expected_lower {
                println!(" but lowercase works: {} -> {}", c_lower, normalized_lower);
            }
        }
    }

    // test matcher
    println!("--- matcher ---");
    let mut matcher = nucleo_matcher::Matcher::new(nucleo_matcher::Config::DEFAULT);

    for (c, expected) in tests {
        // search for c in expected
        let haystack = nucleo_matcher::Utf32String::from(c.to_string());
        let atom = nucleo_matcher::pattern::Atom::new(
            &expected.to_string(),
            CaseMatching::Smart,
            Normalization::Smart,
            AtomKind::Exact,
            false,
        );

        if atom.score(haystack.slice_u32(..), &mut matcher).is_some() {
            println!("okay: {}", c)
        } else {
            println!("fail: {} not in {}", atom.needle_text(), haystack)
        }
    }
}
--- normalize ---
okay: ą -> a.
okay: ć -> c.
okay: ę -> e.
okay: ł -> l.
okay: ń -> n.
okay: ó -> o.
okay: ś -> s.
okay: ź -> z.
okay: ż -> z.
fail: Ą -> Ą. expected A
 but lowercase works: ą -> a
fail: Ć -> Ć. expected C
 but lowercase works: ć -> c
fail: Ę -> Ę. expected E
 but lowercase works: ę -> e
fail: Ł -> Ł. expected L
 but lowercase works: ł -> l
fail: Ń -> Ń. expected N
 but lowercase works: ń -> n
okay: Ó -> O.
fail: Ś -> Ś. expected S
 but lowercase works: ś -> s
fail: Ź -> Ź. expected Z
 but lowercase works: ź -> z
fail: Ż -> Ż. expected Z
 but lowercase works: ż -> z
--- matcher ---
okay: ą
okay: ć
okay: ę
okay: ł
okay: ń
okay: ó
okay: ś
okay: ź
okay: ż
fail: A not in Ą
fail: C not in Ć
fail: E not in Ę
fail: L not in Ł
fail: N not in Ń
okay: Ó
fail: S not in Ś
fail: Z not in Ź
fail: Z not in Ż

Also, I think you might have made some copy-paste/codegen mistakes (here and in other places):

('\u{1E37}', 'l'), // WITH DOT BELOW, LATIN SMALL LETTER ('\u{1E3B}', 'l'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E3D}', 'l'), // WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER
('\u{1E3F}', 'm'), // WITH ACUTE, LATIN SMALL LETTER
('\u{1E41}', 'm'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E43}', 'm'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E45}', 'n'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E47}', 'n'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E49}', 'n'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E4B}', 'n'), // WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER
('\u{1E55}', 'p'), // WITH ACUTE, LATIN SMALL LETTER
('\u{1E57}', 'p'), // WITH DOT ABOVE, LATIN SMALL LETTER ('\u{1E59}', 'r'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E5B}', 'r'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E5F}', 'r'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E61}', 's'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E63}', 's'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E6B}', 't'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E6D}', 't'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E6F}', 't'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E71}', 't'), // WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER
('\u{1E73}', 'u'), // WITH DIAERESIS BELOW, LATIN SMALL LETTER
('\u{1E75}', 'u'), // WITH TILDE BELOW, LATIN SMALL LETTER ('\u{1E77}', 'u'), // WITH CIRCUMFLEX BELOW, LATIN SMALL LETTER
('\u{1E7D}', 'v'), // WITH TILDE, LATIN SMALL LETTER
('\u{1E7F}', 'v'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E81}', 'w'), // WITH GRAVE, LATIN SMALL LETTER
('\u{1E83}', 'w'), // WITH ACUTE, LATIN SMALL LETTER
('\u{1E85}', 'w'), // WITH DIAERESIS, LATIN SMALL LETTER ('\u{1E87}', 'w'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E89}', 'w'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E8B}', 'x'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E8D}', 'x'), // WITH DIAERESIS, LATIN SMALL LETTER
('\u{1E8F}', 'y'), // WITH DOT ABOVE, LATIN SMALL LETTER
('\u{1E91}', 'z'), // WITH CIRCUMFLEX, LATIN SMALL LETTER ('\u{1E93}', 'z'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1E95}', 'z'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E96}', 'h'), // WITH LINE BELOW, LATIN SMALL LETTER
('\u{1E97}', 't'), // WITH DIAERESIS, LATIN SMALL LETTER
('\u{1E98}', 'w'), // WITH RING ABOVE, LATIN SMALL LETTER
('\u{1E99}', 'y'), // WITH RING ABOVE, LATIN SMALL LETTER
('\u{1E9A}', 'a'), // WITH RIGHT HALF RING, LATIN SMALL LETTER
('\u{1E9B}', 's'), // WITH DOT ABOVE, LATIN SMALL LETTER LONG
('\u{1EA1}', 'a'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1EA3}', 'a'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1EB9}', 'e'), // WITH DOT BELOW, LATIN SMALL LETTER ('\u{1EBB}', 'e'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1EBD}', 'e'), // WITH TILDE, LATIN SMALL LETTER
('\u{1EC9}', 'i'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1ECB}', 'i'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1ECD}', 'o'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1ECF}', 'o'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1EE5}', 'u'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1EE7}', 'u'), // WITH HOOK ABOVE, LATIN SMALL LETTER
('\u{1EF3}', 'y'), // WITH GRAVE, LATIN SMALL LETTER
('\u{1EF5}', 'y'), // WITH DOT BELOW, LATIN SMALL LETTER
('\u{1EF7}', 'y'), // WITH HOOK ABOVE, LATIN SMALL LETTER ('\u{1EF9}', 'y'), // WITH TILDE, LATIN SMALL LETTER

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant