-
Notifications
You must be signed in to change notification settings - Fork 149
Query Speed and Compression
A while ago somebody asked a nice question on the NumPy list:
How to do this efficiently? In a 1-d array, find the first point where all subsequent points have values less than a threshold, T.
There were many replies to the question, but I think one of the most optimal in terms of both memory space and CPU usage was:
>>> np.random.seed(1) >>> x = np.random.normal(scale=100, size=1e7) >>> np.where(x>=100)[0][-1] 9999991 >>> %timeit np.where(x>=100)[0][-1] 10 loops, best of 3: 186 ms per loop
After looking at this I thought that bcolz could this quite efficiently too. After thinking a bit, one of the best ways that I came with for solving this problem was:
>>> [r for r in bcolz.eval("x>=100").wheretrue()][-1] 9999991 >>> timeit [r for r in bcolz.eval("x>=100").wheretrue()][-1] 1 loops, best of 3: 313 ms per loop
But, as it can be seen, it is somewhat slower than Numpy's way. Then, I thought that the skip parameter of wheretrue iterator could be used to accelerate the query:
>>> timeit xc=ca.eval("cx>=100"); [r for r in xc.wheretrue(skip=xc.sum()-1)][0] 1 loops, best of 3: 252 ms per loop
Mmh, not bad, but still behind NumPy. What if we disable compression (it is active by default)?:
>>> timeit xc=ca.eval("cx>=100",cparams=ca.cparams(0)); [r for r in xc.wheretrue(skip=xc.sum()-1)][0] 10 loops, best of 3: 138 ms per loop
Hey! this was actually a good 34% faster than NumPy. That was pretty nice.
But then I wondered about the main bottleneck of the process and came to the conclusion that it was the wheretrue iterator returning too many values into Python space (i.e. had many positives). To check the validity of this, in order to reduce this amount, I raised the threshold to 500:
>>> np.where(x>=500)[0][-1] 9977747 >>> timeit np.where(x>=500)[0][-1] 10 loops, best of 3: 154 ms per loop
Okay, approximately the same time than before (but a bit faster, probably because there are much less positives now). Now, back to using bcolz weaponry:
>>> [r for r in bcolz.eval("x>=500",cparams=ca.cparams(0)).wheretrue()][-1] 9977747 >>> timeit [r for r in bcolz.eval("x>=500",cparams=ca.cparams(0)).wheretrue()][-1] 100 loops, best of 3: 17.2 ms per loop
As predicted, that's almost 9x times faster than NumPy. Let's see how much speed is being lost by using compression:
>>> timeit [r for r in ca.eval("x>=500",cparams=ca.cparams(5)).wheretrue()][-1] 100 loops, best of 3: 17.4 ms per loop
Okay, for compression level 5 (default), the degradation in performance is barely noticeable. The reason for this is that, although Blosc is using more CPU in the second case, this is compensated by the less amount of time needed to bring data from memory to CPU for the boolean array (x>=500).
The machine used for this benchmark had 2-CPU with 4 cores each. Only 4 threads were used for bcolz (choosing more lead to worse results, because of contention problems).
The operating system was a Linux 2.6.32, with SMP support. Compiler was GCC 4.4.5. NumPy version was 1.5.0 and bcolz was 0.4.0.