-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cursively to the benchmark suite. #7
Conversation
At the moment, we're just using Dictionary<ReadOnlyMemory<byte>, string> for encoding.
- Instead of the naive hash code, use XXH64. XXH3 would be better, since most of these are very short, but I happen to have an implementation of XXH64 lying around. - Reimplement a pool of encoded strings using Sylvan.StringPool as a base.
Side note, it would be interesting to see how everything stacks up when you add the |
Haha, I didn't even realize there was a Reddit post! Great discussion, especially between you and @MarkPflug. BTW, your implementation is awesome! I'm curious if xxHash is really the best for hash coding. Have you compared the performance to .NET newish I agree that The main benefit of I have been thinking about adding a data mapping benchmark with the same |
Regarding |
Unrelated, I have a nice list of hashing functions for .NET listed in my .NET library : https://github.com/jzabroski/Home/#non-cryptographic-hashing-functions |
Also, it's a bit off-topic, but I have been meaning to add an answer to https://stackoverflow.com/questions/102742/why-is-397-used-for-resharper-gethashcode-override/ |
I went with XXH64 just because I had a port of it lying around. I think XXH3 has properties that would make it favorable in this particular data set (particularly, it is supposedly much faster at dealing with short keys, which this has a lot of), I just didn't want to bother with all the work to port it, for what would likely be only a marginal improvement. If you want to get really lost in this, check out this blog post written by the author of XXH3 and of, I think, XXH32 and XXH64.
airbreather/Cursively#21 / airbreather/Cursively#22 are there to layer more conventional ways of consuming the data on top of this, but it's not a high priority for me because it's actually a lot of work to flip the loop on its head, and I think there are plenty of other libraries that would fit better if that's how you wanted the data anyway. The reason to do that layering would mainly be to just offer a way to use a single library for all your CSV processing instead of having to choose Cursively for cases where runtime efficiency matters more than developer productivity, and some other library for cases where that's reversed. That said, .NET itself is filled with "start with this productive approach, drop down to more efficient code as needed" APIs, so it's not out of the question. If the issue is just the accumulating from
The specific idea of For example, we could have a visitor base class where all you have to implement is a method like: protected override void VisitRecord(ReadOnlySpan<byte> fullLineData, ReadOnlySpan<int> firstIndex)
{
// firstIndex[n] is the offset of the nth field's first byte in fullLineData
// there's one extra slot at the end that always holds the value of fullLineData.Length.
for (int i = 0; i < firstIndex.Length - 1; i++)
{
ReadOnlySpan<byte> field = fullLineData[firstIndex[i]..firstIndex[i + 1]];
DoStuffWith(field);
}
} To completely get away from the users having to implement a visitor, I think it would be reasonably doable to write something that converts this to In any event, from
Anything that doesn't strictly require UTF-16 is always a plus for Cursively. Of course, the majority of the .NET ecosystem is built around using UTF-16 for strings, and although there's been lots of great progress towards making UTF-8 more than just "that thing that we have to keep converting to and from when we interact with the rest of the world", UTF-8 still feels like a second-class citizen. |
My Cursively library is a lot different from most others. Rather than work on a stream of UTF-16 encoded
char
values, it works directly on the input bytes, leaving it up to the caller to choose how to process the individual fields.It produces its output in an unusual way as well: rather than something like
IEnumerable<string>
(or similar), it requires the caller to provide a series of callbacks (exposed as a visitor-pattern implementation) that accept slices of the original input.As a result, Cursively has the potential to be the fastest CSV implementation that I've seen, by far, if you are willing to spend the extra effort to integrate it with your application, and you don't need too much of your data to be UTF-16 encoded (since you'd have to transcode the individual fields one-by-one, which is not very CPU cache friendly).
Since practically every member in this benchmark must be fully hydrated as separate UTF-16
string
instances, for unknown reasons, I had originally just written a couple of comments on the Reddit thread where your post was linked, and then gave it a rest......until @MarkPflug pinged me a while later with his own benchmark suite and tipped me off that the interesting benchmarks in yours actually have enough duplicated strings that a string pool makes a big difference.
So I went ahead and tweaked his string pool to work with a UTF-8 source (and to use XXH64 for hashing, since I had an implementation lying around from when I got bored one day), and now I'm getting some pretty good timings with Cursively, even though "transcode everything to individual UTF-16
string
instances" is roughly the worst-case scenario for Cursively when it's compared against solutions that are optimized to assume that this is what you need.