Query Speed and Compression

About query and compression speeds of bcolz objects

The problem

A while ago somebody asked a nice question on the NumPy list:

How to do this efficiently?

In a 1-d array, find the first point where all subsequent points have values
less than a threshold, T.

There were many replies to the question, but I think one of the most optimal in terms of both memory space and CPU usage was:

>>> np.random.seed(1)
>>> x = np.random.normal(scale=100, size=1e7)
>>> np.where(x>=100)[0][-1]
9999991
>>> %timeit np.where(x>=100)[0][-1]
10 loops, best of 3: 186 ms per loop

Putting bcolz at work

After looking at this I thought that bcolz could this quite efficiently too. After thinking a bit, one of the best ways that I came with for solving this problem was:

>>> [r for r in bcolz.eval("x>=100").wheretrue()][-1]
9999991
>>> timeit [r for r in bcolz.eval("x>=100").wheretrue()][-1]
1 loops, best of 3: 313 ms per loop

But, as it can be seen, it is somewhat slower than Numpy's way. Then, I thought that the skip parameter of wheretrue iterator could be used to accelerate the query:

>>> timeit xc=ca.eval("cx>=100"); [r for r in xc.wheretrue(skip=xc.sum()-1)][0]
1 loops, best of 3: 252 ms per loop

Mmh, not bad, but still behind NumPy. What if we disable compression (it is active by default)?:

>>> timeit xc=ca.eval("cx>=100",cparams=ca.cparams(0)); [r for r in xc.wheretrue(skip=xc.sum()-1)][0]
10 loops, best of 3: 138 ms per loop

Hey! this was actually a good 34% faster than NumPy. That was pretty nice.

Reducing positives

But then I wondered about the main bottleneck of the process and came to the conclusion that it was the wheretrue iterator returning too many values into Python space (i.e. had many positives). To check the validity of this, in order to reduce this amount, I raised the threshold to 500:

>>> np.where(x>=500)[0][-1]
9977747
>>> timeit np.where(x>=500)[0][-1]
10 loops, best of 3: 154 ms per loop

Okay, approximately the same time than before (but a bit faster, probably because there are much less positives now). Now, back to using bcolz weaponry:

>>> [r for r in bcolz.eval("x>=500",cparams=ca.cparams(0)).wheretrue()][-1]
9977747
>>> timeit [r for r in bcolz.eval("x>=500",cparams=ca.cparams(0)).wheretrue()][-1]
100 loops, best of 3: 17.2 ms per loop

As predicted, that's almost 9x times faster than NumPy. Let's see how much speed is being lost by using compression:

>>> timeit [r for r in ca.eval("x>=500",cparams=ca.cparams(5)).wheretrue()][-1]
100 loops, best of 3: 17.4 ms per loop

Okay, for compression level 5 (default), the degradation in performance is barely noticeable. The reason for this is that, although Blosc is using more CPU in the second case, this is compensated by the less amount of time needed to bring data from memory to CPU for the boolean array (x>=500).

About hardware and software

The machine used for this benchmark had 2-CPU with 4 cores each. Only 4 threads were used for bcolz (choosing more lead to worse results, because of contention problems).

The operating system was a Linux 2.6.32, with SMP support. Compiler was GCC 4.4.5. NumPy version was 1.5.0 and bcolz was 0.4.0.

Provide feedback

Saved searches