Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: An API to optimally encapsulate several searching algorithms and techniques, using one simple class #59649

Closed
panost opened this issue Sep 27, 2021 · 24 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime
Milestone

Comments

@panost
Copy link

panost commented Sep 27, 2021

Background and motivation

Too many overloads of searching in ReadOnlySpan<T> and yet they cover only very simple search cases. In addition to those methods, in each invocation, they calculate or examine several cases before actually find the best course of how to optimally perform the search. See for example SpanHelpers.IndexOfAny or in the same class what IndexOf does when searching for a sequence.

We need a class where not only all those cases will be calculated or examined once and used thousand or even million of times, but also a class that will support more complex searching scenarios.

API Proposal

A. Base class

public abstract class Scanner<T> {
    public abstract bool Contain( T value );
    // Actually a placeholder. This should be overridden by all implementations
    public virtual int IndexOf( ReadOnlySpan<T> span ) {
        for ( int i = 0; i < span.Length; i++ ) {
            if ( Contain( span[ i ] ) )
                return i;
        }
        return -1;
    }

    public virtual int LastIndexOf( ReadOnlySpan<T> span ) {
        for ( int i = span.Length-1; i >=0 ; i-- ) {
            if ( Contain( span[ i ] ) )
                return i;
        }

        return -1;
    }
    
    public int IndexOf( ReadOnlySpan<T> span, int index ) {
        int idx = IndexOf( span.Slice( index ) );
        if ( idx >= 0 )
            return idx + index;
        return -1;
    }
    //..More IndexOf, LastIndexOf and others follow, that actually call the virtual methods above
    
    protected virtual Scanner<T> GetInverse(){
        return new ScannerInverse( this );
    }
    
    protected Scanner<T> _inverse;
    /// <summary>
    /// Get the scanner that performs the inverse search
    /// </summary>
    public Scanner<T> Inverse {
        get => _inverse ??= GetInverse();
        protected internal set => _inverse = value;
    }
}

B. An example of a Scanner that is searching for a simple value. The example is incomplete

public class ScanOne<T> : Scanner<T> where T : struct, IEquatable<T> {
    private readonly T _value;

    public ScanOne( T value ) {
        _value = value;
    }

    public override bool Contain( T value ) {
        return _value.Equals( value );
    }

    public override int IndexOf( ReadOnlySpan<T> span ) {
        return span.IndexOf( _value );
    }

    public override int LastIndexOf( ReadOnlySpan<T> span ) {
        return span.LastIndexOf( _value );
    }

    protected override Scanner<T> GetInverse() {
        return new Inv<T>(this);
    }

    class Inv<T> : Scanner<T> where T : struct, IEquatable<T> {
        private readonly T _value;

        public override bool Contain( T value ) {
            return !_value.Equals( value );
        }
        public Inv( ScanOne<T> parent ) {
            this.Inverse = parent;
            _value = parent._value;
        }
        public override int IndexOf( ReadOnlySpan<T> span ) {
            T value=_value;
            for ( int i = 0; i < span.Length; i++ ) {
                if ( !span[ i ].Equals( value ) )
                    return i;
            }

            return -1;
        }
    }
}

C. An example of a Scanner that is using an ASCII lookup table. The example is incomplete

public class Table16CharScanner : Scanner<char> {
    private readonly ushort _mask;
    private ushort[] _table;

    public Table16CharScanner( ushort mask, ushort[] table ) {
        _mask = mask;
        _table = table;
    }
    /// <summary>
    /// Check for characters that are above 0x7F
    /// </summary>
    /// <param name="ch">character to check</param>
    /// <returns>true if is found;false otherwise</returns>
    protected internal virtual bool ContainExt( char ch ) {
        return false;
    }

   protected override Scanner<char> GetInverse() {
        return new Table16CharScannerInverse( this, _mask, _table );
    }

    public sealed override bool Contain( char value ) {
        if ( value < _table.Length ) {
            return ( _table[ value ] & _mask ) != 0;
        }
        return ContainExt( value );
    }

    public override int IndexOf( ReadOnlySpan<char> span ) {
        var mask = _mask;
        var tableLength = _table.Length;
        var table = _table;
        for ( int i = 0; i < span.Length; i++ ) {
            var ch = span[i];
            if ( ch < tableLength ) {
                if ( ( table[ ch ] & mask ) != 0 ) {
                    return i;
                }
            } else if ( ContainExt( ch ) ) {
                return i;
            }
        }
        return -1;
    }

D. ReadOnlySpan extensions

public static class ScannerExtensions {
    public int IndexOf<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
    public int LastIndexOf<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
    public ReadOnlySpan<T> Trim<T>( this ReadOnlySpan<T> span, Scanner<T> trim );
    public ReadOnlySpan<T> Skip<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
    public ReadOnlySpan<T> Remain<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
    public ReadOnlySpan<T> Trim<T>( this ReadOnlySpan<T> span, Scanner<T> trimStart, Scanner<T> trimEnd );
    public ReadOnlySpan<T> TrimStart<T>( this ReadOnlySpan<T> span, Scanner<T> trimStart );
    public ReadOnlySpan<T> TrimEnd<T>( this ReadOnlySpan<T> span, Scanner<T> trimEnd );
    public SplitIterator<T> Split<T>( this ReadOnlySpan<T> span, Scanner<T> search );
    public SplitIterator<T> Split<T>( this ReadOnlySpan<T> span, Scanner<T> search, Scanner<T> trim, bool skipEmpty );
    public int Count( this ReadOnlySpan<char> span, Scanner<T> search );
}

API Usage

ReadOnlySpan<char> span="..";
// trim Unicode whitespace
span = span.Trim( WhiteSpaceScanner );
// trim what JS spec considers white spaces from the start and WhiteSpace or ';' or ',' from the end
span = span.Trim( JavascriptWhitespace, WhiteSpaceSemicolonOrComma ); 
// skip invalid id chars, find id until first invalid
span = span.Skip(HtmlIdScanner ).Remain( HtmlIdScanner.Inverse ); 
// split on ';' or ',' and trim whitespaces for each entry
var iterator = span.Split(SemicolonOrCommaScanner,WhiteSpaceScanner); 

Risks

The current API of Regular expressions can't be optimally encapsulated by a Scanner because it requires a string as input

@panost panost added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Sep 27, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 27, 2021
@ghost
Copy link

ghost commented Sep 27, 2021

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Too many overloads of searching in ReadOnlySpan<T> and yet they cover only very simple search cases. In addition to those methods, in each invocation, they calculate or examine several cases before actually find the best course of how to optimally perform the search. See for example SpanHelpers.IndexOfAny or in the same class what IndexOf does when searching for a sequence.

We need a class where not only all those cases will be calculated or examined once and used thousand or even million of times, but also a class that will support more complex searching scenarios.

API Proposal

A. Base class

public abstract class Scanner<T> {
    public abstract bool Contain( T value );
    // Actually a placeholder. This should be overridden by all implementations
    public virtual int IndexOf( ReadOnlySpan<T> span ) {
        for ( int i = 0; i < span.Length; i++ ) {
            if ( Contain( span[ i ] ) )
                return i;
        }
        return -1;
    }

    public virtual int LastIndexOf( ReadOnlySpan<T> span ) {
        for ( int i = span.Length-1; i >=0 ; i-- ) {
            if ( Contain( span[ i ] ) )
                return i;
        }

        return -1;
    }
    
    public int IndexOf( ReadOnlySpan<T> span, int index ) {
        int idx = IndexOf( span.Slice( index ) );
        if ( idx >= 0 )
            return idx + index;
        return -1;
    }
    //..More IndexOf, LastIndexOf and others follow, that actually call the virtual methods above
    
    protected virtual Scanner<T> GetInverse(){
        return new ScannerInverse( this );
    }
    
    protected Scanner<T> _inverse;
    /// <summary>
    /// Get the scanner that performs the inverse search
    /// </summary>
    public Scanner<T> Inverse {
        get => _inverse ??= GetInverse();
        protected internal set => _inverse = value;
    }
}

B. An example of a Scanner that is searching for a simple value. The example is incomplete

public class ScanOne<T> : Scanner<T> where T : struct, IEquatable<T> {
    private readonly T _value;

    public ScanOne( T value ) {
        _value = value;
    }

    public override bool Contain( T value ) {
        return _value.Equals( value );
    }

    public override int IndexOf( ReadOnlySpan<T> span ) {
        return span.IndexOf( _value );
    }

    public override int LastIndexOf( ReadOnlySpan<T> span ) {
        return span.LastIndexOf( _value );
    }

    protected override Scanner<T> GetInverse() {
        return new Inv<T>(this);
    }

    class Inv<T> : Scanner<T> where T : struct, IEquatable<T> {
        private readonly T _value;

        public override bool Contain( T value ) {
            return !_value.Equals( value );
        }
        public Inv( ScanOne<T> parent ) {
            this.Inverse = parent;
            _value = parent._value;
        }
        public override int IndexOf( ReadOnlySpan<T> span ) {
            T value=_value;
            for ( int i = 0; i < span.Length; i++ ) {
                if ( !span[ i ].Equals( value ) )
                    return i;
            }

            return -1;
        }
    }
}

C. An example of a Scanner that is using an ASCII lookup table. The example is incomplete

public class Table16CharScanner : Scanner<char> {
    private readonly ushort _mask;
    private ushort[] _table;

    public Table16CharScanner( ushort mask, ushort[] table ) {
        _mask = mask;
        _table = table;
    }
    /// <summary>
    /// Check for characters that are above 0x7F
    /// </summary>
    /// <param name="ch">character to check</param>
    /// <returns>true if is found;false otherwise</returns>
    protected internal virtual bool ContainExt( char ch ) {
        return false;
    }

   protected override Scanner<char> GetInverse() {
        return new Table16CharScannerInverse( this, _mask, _table );
    }

    public sealed override bool Contain( char value ) {
        if ( value < _table.Length ) {
            return ( _table[ value ] & _mask ) != 0;
        }
        return ContainExt( value );
    }

    public override int IndexOf( ReadOnlySpan<char> span ) {
        var mask = _mask;
        var tableLength = _table.Length;
        var table = _table;
        for ( int i = 0; i < span.Length; i++ ) {
            var ch = span[i];
            if ( ch < tableLength ) {
                if ( ( table[ ch ] & mask ) != 0 ) {
                    return i;
                }
            } else if ( ContainExt( ch ) ) {
                return i;
            }
        }
        return -1;
    }

D. ReadOnlySpan extensions

public static class ScannerExtensions {
    public int IndexOf<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
    public int LastIndexOf<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
    public ReadOnlySpan<T> Trim<T>( this ReadOnlySpan<T> span, Scanner<T> trim );
    public ReadOnlySpan<T> Skip<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
    public ReadOnlySpan<T> Remain<T>( this ReadOnlySpan<T> span, Scanner<T> scanner );
    public ReadOnlySpan<T> Trim<T>( this ReadOnlySpan<T> span, Scanner<T> trimStart, Scanner<T> trimEnd );
    public ReadOnlySpan<T> TrimStart<T>( this ReadOnlySpan<T> span, Scanner<T> trimStart );
    public ReadOnlySpan<T> TrimEnd<T>( this ReadOnlySpan<T> span, Scanner<T> trimEnd );
    public SplitIterator<T> Split<T>( this ReadOnlySpan<T> span, Scanner<T> search );
    public SplitIterator<T> Split<T>( this ReadOnlySpan<T> span, Scanner<T> search, Scanner<T> trim, bool skipEmpty );
    public int Count( this ReadOnlySpan<char> span, Scanner<T> search );
}

API Usage

ReadOnlySpan<char> span="..";
// trim Unicode whitespace
span = span.Trim( WhiteSpaceScanner );
// trim what JS spec considers white spaces from the start and WhiteSpace or ';' or ',' from the end
span = span.Trim( JavascriptWhitespace, WhiteSpaceSemicolonOrComma ); 
// skip invalid id chars, find id until first invalid
span = span.Skip(HtmlIdScanner ).Remain( HtmlIdScanner.Inverse ); 
// split on ';' or ',' and trim whitespaces for each entry
var iterator = span.Split(SemicolonOrCommaScanner,WhiteSpaceScanner); 

Risks

The current API of Regular expressions can't be optimally encapsulated by a Scanner because it requires a string as input

Author: panost
Assignees: -
Labels:

api-suggestion, area-System.Runtime, untriaged

Milestone: -

@Joe4evr
Copy link
Contributor

Joe4evr commented Sep 28, 2021

The current API of Regular expressions can't be optimally encapsulated by a Scanner because it requires a string as input

Clearly you missed the efforts of span-ifying the Regex API.

@jeffhandley jeffhandley added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Sep 30, 2021
@jeffhandley jeffhandley added this to the Future milestone Sep 30, 2021
@panost
Copy link
Author

panost commented Feb 6, 2023

It seems that is partially implemented under a new name IndexOfAnyValues here #68328. As I can see, it can only have internal implementations, so no custom optimizations such as lookup tables, are supported. I do not know how can be effectively used in cases like those in #78204

@MihaZupan
Copy link
Member

As I can see, it can only have internal implementations, so no custom optimizations such as lookup tables, are supported.

What sort of sets of values were you hoping to see optimized? The IndexOfAnyValues implementation can do much better than a scalar loop around a table lookup in most cases.

I do not know how can be effectively used in cases like those in #78204

Can you elaborate? The issue links to a dozen PRs doing just that

@panost
Copy link
Author

panost commented Feb 7, 2023

Lexer just found a letter and wants to execute an IndexOf first non-valid identifier to locate the end of the identifier. In this case invalid identifier characters are all except A-Z, a-z, _ in the ASCII range. In Unicode range all except letters and digits.
All those rules for the ASCII range are encoded in a lookup table. The same table includes other scan rules, such as valid number characters or white space characters.

If the parser is SQL then use a table that includes the '$' as a valid identifier character, if it is CSS use a lookup table that includes '-'.

Which PR covers such a usage ?
To be clear I want three instances of IndexOfAnyValues to cover identifier scanning for Js, SQL and CSS, so that i can use selectively in a lexer

@MihaZupan
Copy link
Member

You could do something like

private static readonly IndexOfAnyValues<char> s_validJsIdentifierChars = IndexOfAnyValues.Create("abc");
private static readonly IndexOfAnyValues<char> s_validSQLIdentifierChars = IndexOfAnyValues.Create("def");
private static readonly IndexOfAnyValues<char> s_validCSSIdentifierChars = IndexOfAnyValues.Create("ghi");

private static int JsIdentifierLength(ReadOnlySpan<char> text)
{
    int i = text.IndexOfAnyExcept(s_validJsIdentifierChars);
    return i < 0 ? text.Length : i;
}

private static int SQLIdentifierLength(ReadOnlySpan<char> text)
{
    int i = text.IndexOfAnyExcept(s_validSQLIdentifierChars);
    return i < 0 ? text.Length : i;
}

// or a shared thing like
private static int IdentifierLength(ReadOnlySpan<char> text, IndexOfAnyValues<char> identifierChars)
{
    int i = text.IndexOfAnyExcept(identifierChars);
    return i < 0 ? text.Length : i;
}

You can essentially treat the IndexOfAnyValues instance as your lookup table.

@panost
Copy link
Author

panost commented Feb 7, 2023

  1. Where is the support for the cases outside of the ASCII range? I'll admit it is rare case, but still you have to be consistent. To create a string with length > 60,000, so you can use the IndexOfAnyValues.Create is out of the question!
  2. The IndexOfAnyValues.Create when more than 5 chars are used (in my case around 50 chars, just for the ASCII) is finally using an instance of the IndexOfAnyCharValuesProbabilistic class which is a generic Bloom filter and far more slower than a lookup in a table

@MihaZupan
Copy link
Member

MihaZupan commented Feb 7, 2023

We're open to adding more specialized implementations as scenarios come up, within reason.
For example, if you have 50 ascii chars, the implementation will end up using this.
For the probabilistic map (what you'd end up getting when using a mix of ascii and non-ascii values), we can vectorize the non-except case - #80963.

It may make sense to add another implementation that really is just a lookup table to handle the -Except case of the probabilistic map, or for large sets of values where the bloom filter isn't as effective.
How many characters per set are you looking at in practice?

To create a string with length > 60,000, so you can use the IndexOfAnyValues.Create is out of the question!

If you have over 60k characters, you can invert the condition and create the lookup with 65536 - 60000 characters to make it a bit more manageable. Keep in mind that you don't have to use a constant there, you can generate the set at runtime.

cc: @stephentoub

@stephentoub
Copy link
Member

It seems the feature being requested already exists in the form of the new IndexOfAnyValues. What exactly is the feature missing from an API perspective? As Miha cites, we can add more implementations under the covers as need demonstrates. If the remaining concern from an API perspective is simply lack of extensibility, we've discussed that and opted to keep it closed for now because the extensibility point you'd need to implement buys you very little: you'd need to implement effectively all of {Last}IndexOfAny{Except}, at which point a consumer can just use your APIs directly... the only the thing the abstraction would buy you is if you'd expect to e.g. publish a library containing your custom implementations for others to consume generically, but a key aspect of the design here isn't just the implementations but also the logic for choosing the optimal implementation to use. If in the future we saw it would really be beneficial to allow IndexOfAnyValues to be publicly extended, we could do so then, but we don't want to do so now.

@panost
Copy link
Author

panost commented Feb 7, 2023

My first objection is: If I have to write some generic method (or class) that involves searching, for example splitting at some separators but also trim each segment, I have to use two instances of scanners, one that finds the splitting characters and generates string segments and a second one that trims each segment before return them.

What are my options?
Use the IndexOfAnyValues? It is optimal for many cases, but not extensible, meaning that if I have a case of a splitter or a trimmer that uses some unsupported algorithm (Boyer-Moor for example or the one I mention above using a lookup table) I have to write it again or I have to write a generic class Scanner<T> that wraps IndexOfAnyValues for the cases that is optimal and provide custom implementation for the rest.

My second objection is the IndexOfAnyExcept : I cannot pass to a generic search method an instance of IndexOfAnyValues and an additional parameter that says, "do not use IndexOf but IndexOfAnyExcept" because this instance does find all white spaces, but I want you to search for the opposite.
My proposal was that IndexOfAnyValues has an Inverse property that lazily constructs another IndexOfAnyValues instance that does the opposite search. In some algorithms the optimal inverse searching requires a totally different approach with different pre-computed tables.
Also the Inverse can be used by the same method. In the example of the splitter above, I can use the scanner to find the first delimiter and the Inverse scanner of it, to skip any subsequent character that are delimiters (same that StringSplitOptions.RemoveEmptyEntries does)

@stephentoub
Copy link
Member

I have to use two instances of scanners, one that finds the splitting characters and generates string segments and a second one that trims each segment before return them

Why? Do you mean because of X and InverseX? That's why there are both IndexOfAny and IndexOfAnyExcept methods, and there's nothing that requires them to use the same data. The IndexOfAnyValues.Create method can generate whatever state it needs to generate to make all of the exposed operations as fast as possible. For example, the IndexOfAnyAsciiCharValues implementation computes both a vectorization bitmap as well as a scalar lookup table, where the former is the main workhorse behind APIs like IndexOfAny and the latter is the main workhorse behind Contains. If there were a more efficient representation that could be created to support the other operations, it could do that, too. Currently there isn't and so it hasn't.

but not extensible, meaning that if I have a case of a splitter or a trimmer that uses some unsupported algorithm (Boyer-Moor for example or the one I mention above using a lookup table) I have to write it again or I have to write a generic class Scanner that wraps IndexOfAnyValues for the cases that is optimal and provide custom implementation for the rest.

I already addressed that. If we allowed for external extensibility, we'd just be making the handful of entry point methods abstract/virtual. There's no other boilerplate that's saved; the public IndexOfAny(..., IndexOfAnyValues<T>) just delegates to those entry point methods. On top of that, we wouldn't be able to somehow change IndexOfAnyValues.Create to factor custom implementations into the logic it uses to choose which to construct. So even if we made those public, the only thing it buys you is if you want to be able to create some reusable IndexOfAnyValues<T> with your own implementation and pass that off to someone via the base type, e.g. if you wanted to create your own MySuperDuperIndexOfAnyValuesCreator.Create that could special-case for your additional cases and otherwise delegate to Create. There is possibly some benefit there, but we've discussed it multiple times and have opted not to expose that extensibility now. We can in the future if sufficient demand arises. And in the meantime, we can add additional implementations under IndexOfAnyValues.Create where it yields measurable-enough benefit. If there's a specific case you're hoping to see there that can be optimized more than what already is, please feel free to open an issue suggesting such an implementation. It doesn't require a separate public abstract API.

I cannot pass to a generic search method an instance of IndexOfAnyValues and an additional parameter that says, "do not use IndexOf but IndexOfAnyExcept" because this instance does find all white spaces, but I want you to search for the opposite.

Why do you want to restrict it in that manner? Are you concerned about construction overhead necessary to create state to optimize IndexOfAny but not IndexOfAnyExcept?

@panost
Copy link
Author

panost commented Feb 8, 2023

Why?

Because that's the requirements of this method, requires one instance of a IndexOfAnyValues for example to split the string for every ',' or ';' character, but before return the segment must trim the white-spaces using another instance of IndexOfAnyValues. Let's say in that white-spaces is what Javascript considers as white-space (which is different on what filesystem, XML, C# or HTML considers as white-space)

And in the meantime, we can add additional implementations under IndexOfAnyValues.Create where it yields measurable-enough benefit.

Ok, I have to find and read those suggestions and I'll be back

Why do you want to restrict it in that manner?

I wasn't clear, let's have a method that accepts one IndexOfAnyValues instance. This method makes repeatable calls to IndexOf method using a buffered TextStream, some options and configuration and yields some results. There is complex logic in there, in buffering, indexing etc.

I have an instance of IndexOfAnyValues named UriSafeScanner that finds Uri-Safe-characters, how can I call this method to find me the inverse, Non-Uri-Safe characters? I can't. I have to create a new instance named UriNonSafeScanner that does the inverse search and provide some logic to synchronize those two, to avoid overlapping . In my proposal I can call the method, easily just using UriSafeScanner.Inverse property.

@panost
Copy link
Author

panost commented Feb 8, 2023

@MihaZupan I somehow missed reading your answer

if you have 50 ascii chars, the implementation will end up using this.

The algorithm above is :

  • Is the next input character in ASCII range (< 127)?
  • Yes: then make a check using the lookup table, only 50 slots (characters) in that table have the flag set to true. This is the most common case 99.9% of the cases
  • No: then make a slow virtual check with the Unicode character. The code there may check the Unicode category of the character or perform some dictionary search and return true or false

So IndexOfAnyAsciiSearcher will not return the correct value, because in the 0.1% of the cases will return false

@MihaZupan
Copy link
Member

IndexOfAnyAsciiSearcher will only be used if all your values are in the ASCII range.

Currently, the implementation used for mixed sets of ascii and non-ascii values will use the probabilistic map.

If mixed inputs are common enough, we can consider adding an internal implementation that does something like what you describe but vectorized. But to Stephen's point, that's an implementation-only change that doesn't impact the public API.

I have an instance of IndexOfAnyValues named UriSafeScanner that finds Uri-Safe-characters, how can I call this method to find me the inverse, Non-Uri-Safe characters? I can't. I have to create a new instance named UriNonSafeScanner that does the inverse search and provide some logic to synchronize those two, to avoid overlapping . In my proposal I can call the method, easily just using UriSafeScanner.Inverse property.

Just to confirm, the only thing you wish to see from an API perspective is

public class IndexOfAnyValues<T>
{
    public IndexOfAnyValues<T> Inverse { get; }
}

?

I am not sure how common that would be, but if you believe it's important feel free to open a new API proposal issue to discuss just that part (it wasn't obvious to me that's a significant part of the current proposal).

If it's common in your scenario to reuse the same helper for searching for the inverse, why not add that as an option that you use to decide which method to use?

public static int MyComplexIndexOf(ReadOnlySpan<char> data, IndexOfAnyValues<char> values, bool inverse)
{
   // ...
   int i = inverse ? data.IndexOfAnyExcept(values) : data.IndexOfAny(values);
   // ...
}

Alternatively, the pattern we sometimes use in the runtime is to make use of static abstract methods to deduplicate implementations.

public static int MyComplexIndexOf(ReadOnlySpan<char> data, IndexOfAnyValues<char> values) =>
    MyComplexIndexOf<DontNegate>(data, values);

public static int MyComplexInverseIndexOf(ReadOnlySpan<char> data, IndexOfAnyValues<char> values) =>
    MyComplexIndexOf<Negate>(data, values);

private static int MyComplexIndexOf<TNegator>(ReadOnlySpan<char> data, IndexOfAnyValues<char> values)
    where TNegator : struct, INegator
{
   // ...
   int i = TNegator.IndexOfAny(data, values);
   // ...
}

internal interface INegator
{
    static abstract int IndexOfAny(ReadOnlySpan<char> source, IndexOfAnyValues<char> values);
}

internal readonly struct DontNegate : INegator
{
    public static int IndexOfAny(ReadOnlySpan<char> source, IndexOfAnyValues<char> values) =>
        source.IndexOfAny(values);
}

internal readonly struct Negate : INegator
{
    public static int IndexOfAny(ReadOnlySpan<char> source, IndexOfAnyValues<char> values) =>
        source.IndexOfAnyExcept(values);
}

@panost
Copy link
Author

panost commented Feb 8, 2023

I find this pattern confusing, non-optimal and misses many opportunities for pre-calculations

For example if I read correctly here in the main loop of IndexOfAny4Values at SpanHelpers.LastIndexOfAnyValueType.

equals = TNegator.NegateIfNeeded(Vector128.Equals(current, values0) | Vector128.Equals(current, values1) | Vector128.Equals(current, values2)
                        | Vector128.Equals(current, values3) | Vector128.Equals(current, values4));

It combines results with binary OR, which is ok for the Negate case, but is non an optimal approach for the normal case

@MihaZupan
Copy link
Member

misses many opportunities for pre-calculations

What do you mean by that?
IndexOfAnyValues will pre-calculate anything it deems beneficial to speed up both IndexOfAny and IndexOfAnyExcept.

It combines results with binary OR, which is ok for the Negate case, but is non an optimal approach for the normal case

It is not optimal in what way?

@panost
Copy link
Author

panost commented Feb 9, 2023

I mean they are algorithms that can benefit from computed lookups, a computation that is performed once on ctor. The exact algorithm for searching needs some research. An example of precomputed lookups is here in Base64 Vectorized decoding here.

Another is this search (old non-vectorized) that computes some seach masks each time is called

internal static unsafe char* IndexOfAny( char* scan, int length, char c1, char c2 ) {
// char* scan = basePtr + index;
char ch;
if ( length > 12 ) {
    int idx = ( 8 - ( (int)scan & 7 ) ) / 2;
    for ( ; idx > 0; length--, idx--, scan++ ) {
        ch = *scan;
        if ( ch == c1 || ch == c2 ) {
            return scan;
        }
    }
    ulong controlValue1 = ( (uint)c1 << 16 ) | c1;
    controlValue1 |= controlValue1 << 32;
    ulong controlValue2 = ( (uint)c2 << 16 ) | c2;
    controlValue2 |= controlValue2 << 32;
    const ulong Magic = 0x7FFEFFFEFFFEFFFF;
    const ulong Mask = 0x8001000100010000;
    for ( ; length > 3; length -= 4, scan += 4 ) {
        ulong next4Chars = *( (ulong*)scan );
        ulong temp = next4Chars ^ controlValue1;
        temp = ~temp ^ ( Magic + temp );

        if ( ( temp & Mask ) != 0 ) {
            //idx = (int)( scan - basePtr );
            ch = *scan;
            if ( ch == c1 || ch == c2 ) {
                return scan;
            }
            ch = *( ++scan );
            if ( ch == c1 || ch == c2 ) {
                return scan;
            }
            ch = *( ++scan );
            if ( ch == c1 || ch == c2 ) {
                return scan;
            }
            ch = *( ++scan );
            if ( ch == c1 || ch == c2 ) {
                return scan;
            }
            scan -= 3;

        }
        temp = next4Chars ^ controlValue2;
        temp = ~temp ^ ( Magic + temp );
        if ( ( temp & Mask ) != 0 ) {
            //idx = (int)( scan - basePtr );
            if ( *scan == c2 ) {
                return scan;
            }
            if ( *( ++scan ) == c2 ) {
                return scan;
            }
            if ( *( ++scan ) == c2 ) {
                return scan;
            }
            if ( *( ++scan ) == c2 ) {
                return scan;
            }
            scan -= 3;
        }
    }
}

Perhaps this can be vectorized also and keep the vectorized lookup values in a private field in the class

This point is one of the three reasons to justify, "why we need a second instance of IndexOfAnyValues for the Negate search, since there is a IndexOfAnyExcept method"

It is not optimal in what way?

That if the first value matches, there is no need to check the others. Something like short circuit OR (boolean OR). Here it does a binary OR, which is needed only in the Negate searching

@MihaZupan
Copy link
Member

MihaZupan commented Feb 9, 2023

I mean they are algorithms that can benefit from computed lookups, a computation that is performed once on ctor.
This point is one of the three reasons to justify, "why we need a second instance of IndexOfAnyValues for the Negate search, since there is a IndexOfAnyExcept method"

I don't understand this. IndexOfAnyValues already precomputes the data it needs once. What do you think is missing?

You are expected to keep and reuse the IndexOfAnyValues instance. It will store any needed data in its fields.

Perhaps this can be vectorized also and keep the vectorized lookup values in a private field in the class

Helpers like this are already vectorized in the runtime. You can use them by calling span.IndexOfAny('a', 'b') -- there's no need to roll your own.

That if the first value matches, there is no need to check the others. Something like short circuit OR (boolean OR). Here it does a binary OR, which is needed only in the Negate searching

That is not how SIMD works - it would be inefficient to check each value on its own compared to just doing 4 checks upfront (you would need more instructions, more branching ...).

@panost
Copy link
Author

panost commented Feb 9, 2023

You are expected to keep and reuse the IndexOfAnyValues instance. It will store any needed data in its fields.

Yes, my point is that the IndexOfAnyValues stores precalculated values in its fields. The Inverse class of this, also needs it's own precalculated fields. Sometimes the Inverse precalculation may be radical different and has different algorithm and fields.

But the main point why an Inverse (or Negate) class is needed is not that, but how it is used, the consuming methods that accept an IndexOfAnyValues argument. Passing an additional argument to the method to specify if they should call IndexOf or IndexOfExcept is counter-intuitive and of course slower. If you pass to the consumer method just the Inverse instance of IndexOfAnyValues, it is easier and faster.

there's no need to roll your own.

It was an old example (is not used anymore) of an algorithm, to demonstrate what I mean by precalculation and why you need to store it in fields

That is not how SIMD works - it would be inefficient to check each value on its own compared to just doing 4 checks upfront (you would need more instructions, more branching ...).

You imply that this statement is one a simple SIMD instruction? Because is not that simple

@MihaZupan
Copy link
Member

MihaZupan commented Feb 9, 2023

Yes, my point is that the IndexOfAnyValues stores precalculated values in its fields. The Inverse class of this, also needs it's own precalculated fields. Sometimes the Inverse precalculation may be radical different and has different algorithm and fields.

If it is beneficial to store different data or use a different approach for IndexOfAnyExcept, the IndexOfAnyValues instance can store that data as well. It's not limited to using exactly the same algorithm for the inverse.

That is not how SIMD works - it would be inefficient to check each value on its own compared to just doing 4 checks upfront (you would need more instructions, more branching ...).

You imply that this statement is one a simple SIMD instruction? Because is not that simple

Yes, it's not that simple :) You're looking at debug codegen for a method where the vectors are parameters.

If you instead look at the actual codegen for span.IndexOfAny("abcd"), you will see something like
this: https://gist.github.com/MihaZupan/56033bb7f3cc47c9fe6c6a55241ffad6.
With the main loop that processes 32 characters at a time looking like

M01_L08:
       vmovups   ymm4,[rdx]
       vpcmpeqw  ymm5,ymm0,ymm4
       vpcmpeqw  ymm6,ymm1,ymm4
       vpor      ymm5,ymm5,ymm6
       vpcmpeqw  ymm6,ymm2,ymm4
       vpor      ymm5,ymm5,ymm6
       vpcmpeqw  ymm4,ymm3,ymm4
       vpor      ymm4,ymm5,ymm4
       vptest    ymm4,ymm4
       jne       short M01_L09
       add       rdx,20
       cmp       rdx,rax
       jbe       short M01_L08

Each comparison is 1 and each OR is one instruction.

Checking each individual comparison result to return early would require a bunch of extra code and branches.

@panost
Copy link
Author

panost commented Feb 9, 2023

Yes, good point on Debug codegen :)

Since you consider the use of an Inverse class for IndexOfAnyValues redundant and IndexOfAnyExcept an intuitive approach, for all usages and consumers, I am closing this.

@panost panost closed this as completed Feb 9, 2023
@En3Tho
Copy link
Contributor

En3Tho commented Feb 9, 2023

@panost just noting that your sharplab link is pointing to x86 debug assembly code.

@panost
Copy link
Author

panost commented Feb 10, 2023

@En3Tho Yes, I copied the link before selecting Release & x64. Sorry. The correct link is here

Still not a representative use case, because I the pass the Vectors as arguments, missing the opportunity for further optimizations. The code in @MihaZupan's comment above, shows what actually happens

@ghost ghost locked as resolved and limited conversation to collaborators Mar 12, 2023
@tannergooding tannergooding removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime
Projects
None yet
Development

No branches or pull requests

8 participants