Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort could be faster #194

Open
odinthenerd opened this issue Aug 8, 2016 · 22 comments
Open

sort could be faster #194

odinthenerd opened this issue Aug 8, 2016 · 22 comments

Comments

@odinthenerd
Copy link
Contributor

odinthenerd commented Aug 8, 2016

Ok I admit its my pet peeve ;) I find it interesting how combining algorithms are not efficient in runtime stuff but are efficient in metaprogramming. Basically the current implementation is good to about 500 elements, however even with the fast tracks merging 16 elements into 500 elements will be slow.

Coming back to the original partition implementation the problem was that a bad pivot element would cause a partition of everything only to eliminate one element from the input list. If we were to build up a large list using the merge/insertion sort, split it in the middle and then use that as a pivot at least we are eliminating half of the output list each partition even if we are not eliminating anything from the input list.

Another approach would be to sort chunks of lets say 64 elements each and then "partition"-ing them would essentially be a "split_if" rather than a normal partition. Since we need to presort the lists for merging any way we aren't really losing much and splitting a sorted list should be more efficient than a classic partition on all of its elements. This would essentially be a pure merge sort combined with a split_if.

I would really like to get to 1000 elements for kvasir.

@odinthenerd
Copy link
Contributor Author

@jonathanpoelen OMG your sort works up to several thousand elements! You are my hero!

@odinthenerd
Copy link
Contributor Author

I think the algorithm is taking most of its time in merge. Problem is that merge is hopelessly recursive, I think we need to do some partitioning to shorten merge runs.

@odinthenerd
Copy link
Contributor Author

I added a partition here https://github.com/edouarda/brigand/blob/master/brigand/algorithms/sort.hpp#L191 by splitting our 256 element sorted list and using the middle as a pivot element to partition the rest of the input as described above. The result is a 28% speed up on my machine. I think this is the right direction.

@odinthenerd
Copy link
Contributor Author

here is the fast tracked version https://github.com/porkybrain/brigand/tree/merge-partition-sort

on my machine 2000 !!! element list in random order
old : 12.81s
new : 6.67s

not sure yet if I broke any thing, there are a lot of moving parts in this algorithm.

@brunocodutra
Copy link

brunocodutra commented Aug 24, 2016 via email

@odinthenerd
Copy link
Contributor Author

Wow I wasn't aware that you had improved it that much. I tested metal several months ago and it was much slower but that sounds quite similar to brigand and, as most of your stuff is, solved in a puristic fashion. I think we can slim down the brigand algorithm quite a lot.

@brunocodutra
Copy link

I wasn't aware that you had improved it that much

Since our last discussions metal's API became more stable, so I decided to investigate which algorithms would benefit most from fast tracking and it basically boiled down to join, fold and reverse (brunocodutra/metal#41).

sounds quite similar to brigand

Take it with a grain of salt until we compare results on the same machine, because we might be comparing apples to bananas here depending on how different our hardware and setups are, but at a first glance it looks like metal doesn't lag too much behind brigand, which makes me proud to be honest.

@odinthenerd
Copy link
Contributor Author

either you dev system is awesome or I'm doing something wrong, here is my code:

template<typename T, typename U>
using eager_less = meta::number<(T::value < U::value)>;
using a = typename metal::sort<my_list, metal::lambda<eager_less>>::type;

using brigand I get 0.55s for 300 elements and 15.26s for meta. Meta crashes my laptop at 400 elements, brigand goes to like 3000.

I'm thinking about a making a generic "chunker" which splits a list into smaller lists of a fixed size (like chunker16 or chunker256) which would allow us to fast track and still use generic fold and would save us from writing so damn many typenames in every algorithm.

@brunocodutra
Copy link

brunocodutra commented Aug 25, 2016 via email

@odinthenerd
Copy link
Contributor Author

Ok now I'm using a list of ransom metal::number and using a = metal::sort<my_list, metal::lambda<metal::less>> and the numbers are slightly worse than last run.

@brunocodutra
Copy link

I have been blindly using a list ordered in reverse order as a worst case test for sort, but that obviously doesn't apply for merge sort, now that I take a closer look at it. Silly me, looks like I have been inadvertedly cheating :c

Curiously, that means merging is the culprit, so perhaps it deserves fast tracking. I'll see what I can lern from Brigand's.

@brunocodutra
Copy link

Thanks for benchmarking btw!

@odinthenerd
Copy link
Contributor Author

no problem, @ldionne made the same benchmarking mistake a while back ;) fast tracking merge is pretty hard because it is essentially a proper fold, @jonathanpoelen's idea of making a kind of partial merge seems to be the key behind the last speed up and is a genius idea. splitting the merged list and using that element as a pivot in order to partition the rest of input seems to bring another speed up (can be seen in my unmerged branch) but I think that can be improved on further.

@brunocodutra
Copy link

brunocodutra commented Aug 25, 2016 via email

@odinthenerd
Copy link
Contributor Author

Nice to see attention to sort, it is really something I need done well for kvasir. I would love the hear original ideas!

@brunocodutra
Copy link

BTW, which compiler are you using for these benchmarks?

@odinthenerd
Copy link
Contributor Author

Clang 3.7

@brunocodutra
Copy link

brunocodutra commented Aug 26, 2016

So I was able to pin join as the culprit for the memory overflow. I even managed to rewrite it so as to reduce memory consumption by about 20-30% on gcc, but it is still prohibitively high.

If only there was a way to implement take directly just like one implements drop, divide-and-conquer algorithms wouldn't have to depend on join (nor on any linearly recursive algorithm for what's worth).

@brunocodutra
Copy link

Finally, I decided to implement merge the dumbest way possible and, to my surprise, that's what happens

gcc 6.1 clang 4.0
sort_gcc sort_clang

@odinthenerd
Copy link
Contributor Author

odinthenerd commented Aug 31, 2016

I finally found time to look at your solution. I think one thing your doing that is faster than us is your merge specialization struct _merge<ret, list<xh, xt...>, list<yh, yt...>, lambda<expr>, if_<expr<xh, yh>, true_, false_>> where you know that the last parameter is going to be true_. The problem is that this restricts the concept of a predicate in that it must return a metal::number. Currently (as far as I know) brigand will accept anything that has a ::value as a predicate, which is far less restrictive. I wonder if our loose definition of this is causing a performance hit.

@brunocodutra
Copy link

brunocodutra commented Aug 31, 2016 via email

@odinthenerd
Copy link
Contributor Author

have not been able to put a whole lot of time into this but brigand::sort could benefit from chunking the input using a meta monad pattern as well as joining many lists rather than 2 at a time. Join can be implemented using more aliases and less complex types as well, see metals implementation of join for a reference. Also the nested alias version of conditional should help too.

Just wanted to keep the thought process going.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants