-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign with a better design #31
Comments
Hi, Thanks for filing this issue! I'm always looking for ways to improve faster's user-facing API. Here are some quick thoughts on some of the recommendations:
I'd argue that this is a good thing, as it forces users of the "manual" API to unroll their partial-vector behaviour. I'm concerned that something like
While I can imagine this would be a reasonable pattern in a more sophisticated vector ISA, I can't think of why one would want to do this with, say, AVX. Do you have an algorithm in mind which could make use of this, or a motivation for using this instead of upcasting the u8s before zipping them?
I remember this definitely not making sense when I added the zipping API, but I think our internal abstractions have improved since then. We can probably do this.
SIMD iterators are usually slice iterators, are they not? I see no harm in exposing this, because there is no situation in which we don't know it.
This is a very good idea, but I wonder if it would cause x86 perf to regress significantly for smaller collections, as their misaligned load penalty is usually minuscule.
The SIMDArray abstraction is used to good effect in our strided iterators, so I'm hesitant to remove it. I'm in the process of removing UnsafeIterator, but I have to nail down a few perf regressions on SSE systems before pushing it upstream.
I'd like to wait until specialization is fully stable before I start using it in core parts of the library. |
Well, for map() operations there is no need to branch, you just do the operation on the partial vector and pass the amount of elements through. For reduce operations (e.g. summing all elements) and collection, there could indeed be an inefficiency, although if one represents Partial as a Full/Partial enum, then the iterator will return Option<Partial> and the new Rust's data type optimizations should coalesce the Option and Partial discriminator bytes allowing to test for Some(Full(_)) with a single test (but LLVM needs to somehow be told that this is the most likely case). However there is a more general problem, which is needing a branch on every element to test if we are at the end of the iterator, which happens both in the current design and with Partial. The solution to this is to have, potentially in addition to Partial-returning iterators, iterators that return a "chunk" type, which would consist of 2^N vectors and a item count (this could either be a Partial<[VecType; 16]>, extending Partial and Packed to support arrays, or an ad-hoc type), and then would have map() and reduce functions that would special case the "full" case and call the provided closure in an unrolled loop in that case, using a normal loop otherwise (in the reduce case an extra closure to deal with the final partial case would probably be needed). Obviously the slice iterator adapter would also have a single check for whether chunk_size elements are available and then load them all in an unrolled loop (this design also could allow to cacheline-size-align the chunks instead of just vector-size-aligning them, although I think that's probably useless and in fact harmful since it might cause zip() to have extra overhead for realigning). So you would do something like iter.simd_chunk_iterator().map(|x| x.map(|y| my_func(y))) or iter.simd_chunk_iterator()>.reduce(|a, x| x.reduce(a, |y| my_func_on_vec(a, y), |y| my_func_on_partial(a, y))), and possibly have simd_map() and simd_reduce() helper methods that would allow reduce this to a single call to them avoiding the inner map/reduce calls.
Well, in faster you don't know how many elements are in an "u8s" or "f32s" type (and there's no relationship between them because e.g. u8s changes from AVX1-only to AVX2 targets while f32s doesn't - or at least it would if faster had proper support for AVX1-only CPUs), so I don't think there is a way to "upcast" them, without specifically providing an helper to do so (and to determine the type of the result, i.e. what is probably need is to have a WidthType associated type in Packable that is a lightweight "type-level integer", and then do something like VectorTypeWithWidth<u8, MinWidth<u8s::WidthType, f32s::WidthType>::WidthType> to get a vector of u8 with the minimum width of u8s and 32s). Regarding zip(), it's necessary to have the same number of elements zipped together so that SIMD width is completely abstracted, so if you are zipping an u8x16 and an f32x4 you need somehow to convert the u8x16 iterator into one returning either a bunch of u8x4 or a bunch of partial u8x16, hence it makes sense for zip() to call such an adapter itself, so everything "just works" regardless of actual vector sizes. This might also mean adding support for "reduced vectors", e.g. a type that contains a f32x4 but for which only 2 elements are valid (this would differ from Partial in that the number of valid elements is a compile type constant) - using f32x2 could be an option but would probably lead to suboptimal code if it's not natively supported.
Well, you can convert any iterator returning a scalar type into a SIMD iterator (by reading scalars and putting them in a vector) so SIMD iterators are just as generic as normal iterators, so there doesn't seem to be any reason to prioritize slice iterators. Obviously an extra SIMD "random access"/"index" iterator trait can and probably should be provided, although I'm not sure if it's good to make adapters like map/zip/etc. preserve that as well, since normal iterators don't do that and it might not be very useful anyway (since it's probably bad to recompute things if random access reference a single index multiple times, so collecting is probably better).
I think a design without specialization should be possible, although it would probably require the user to explicitly call a function to e.g. readapt a SIMD iterator that Iterator::map has been called on to again support the SIMD iterator interface, rather than just having a blanket impl of SIMDIterator for Iterator<Item = ...>. |
The current design of "faster" has several big flaws that should be addressed, including:
Here is a better design:
With this design:
The text was updated successfully, but these errors were encountered: