Tutorials

Tutorial on barray objects

Creating barrays

A barray can be created from any NumPy ndarray by using its barray constructor:

>>> import numpy as np
>>> a = np.arange(10)
>>> import blz
>>> b = blz.barray(a)                   # for in-memory storage
>>> c = blz.barray(a, rootdir='mydir')  # for on-disk storage

Or, you can also create it by using one of its multiple constructors (see Top level functions for the complete list):

>>> d = blz.arange(10, rootdir='mydir')

Please note that BLZ allows to create disk-based arrays by just specifying the rootdir parameter in all its constructors. Disk-based arrays fully support all the operations of in-memory counterparts, so depending on your needs, you may want to use one or another (or even a combination of both).

Now, b is a barray object. Just check this:

>>> type(b)
<type 'blz.blz_ext.barray'>

You can have a peek at it by using its string form:

>>> print b
[0, 1, 2... 7, 8, 9]

And get more info about uncompressed size (nbytes), compressed (cbytes) and the compression ratio (ratio = nbytes/cbytes), by using its representation form:

>>> b   # <==> print repr(b)
barray((10,), int64)  nbytes: 80; cbytes: 4.00 KB; ratio: 0.02
  bparams := bparams(clevel=5, shuffle=True)
[0 1 2 3 4 5 6 7 8 9]

As you can see, the compressed size is much larger than the uncompressed one. How this can be? Well, it turns out that barray wears an I/O buffer for accelerating some internal operations. So, for small arrays (typically those taking less than 1 MB), there is little point in using a barray.

However, when creating barrays larger than 1 MB (its natural scenario), the size of the I/O buffer is generally negligible in comparison:

>>> b = blz.arange(1e8)
>>> b
barray((100000000,), float64)  nbytes: 762.94 MB; cbytes: 23.38 MB; ratio: 32.63
  bparams := bparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0, ..., 99999997.0, 99999998.0, 99999999.0]

The barray consumes less than 24 MB, while the original data would have taken more than 760 MB; that’s a huge gain. You can always get a hint on how much space it takes your barray by using sys.getsizeof():

>>> import sys
>>> sys.getsizeof(b)
24520482

That moral here is that you can create very large arrays without the need to create a NumPy array first (that may not fit in memory).

Finally, you can get a copy of your created barrays by using the copy() method:

>>> c = b.copy()
>>> c
barray((100000000,), float64)  nbytes: 762.94 MB; cbytes: 23.38 MB; ratio: 32.63
  bparams := bparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0, ..., 99999997.0, 99999998.0, 99999999.0]

and you can control parameters for the newly created copy:

>>> b.copy(bparams=blz.bparams(clevel=9))
barray((100000000,), float64)  nbytes: 762.94 MB; cbytes: 8.22 MB; ratio: 92.78
  bparams := bparams(clevel=9, shuffle=True)
[0.0, 1.0, 2.0, ..., 99999997.0, 99999998.0, 99999999.0]

Enlarging your barray

One of the nicest features of barray objects is that they can be enlarged very efficiently. This can be done via the barray.append() method.

For example, if b is a barray with 10 million elements:

>>> b
barray((10000000,), float64)  nbytes: 80000000; cbytes: 2691722; ratio: 29.72
  bparams := bparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0... 9999997.0, 9999998.0, 9999999.0]

it can be enlarged by 10 elements with:

>>> b.append(np.arange(10.))
>>> b
barray((10000010,), float64)  nbytes: 80000080; cbytes: 2691722;  ratio: 29.72
  bparams := bparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0... 7.0, 8.0, 9.0]

Let’s check how fast appending can be:

>>> a = np.arange(1e7)
>>> b = blz.arange(1e7)
>>> %time b.append(a)
CPU times: user 0.06 s, sys: 0.00 s, total: 0.06 s
Wall time: 0.06 s
>>> %time np.concatenate((a, a))
CPU times: user 0.08 s, sys: 0.04 s, total: 0.12 s
Wall time: 0.12 s  # 2x slower than BLZ
array([  0.00000000e+00,   1.00000000e+00,   2.00000000e+00, ...,
         9.99999700e+06,   9.99999800e+06,   9.99999900e+06])

This is specially true when appending small bits to large arrays:

>>> b = blz.barray(a)
>>> %timeit b.append(np.arange(1e1))
100000 loops, best of 3: 3.17 µs per loop
>>> %timeit np.concatenate((a, np.arange(1e1)))
10 loops, best of 3: 64 ms per loop  # 2000x slower than BLZ

You can also enlarge your arrays by using the resize() method:

>>> b = blz.arange(10)
>>> b.resize(20)
>>> b
barray((20,), int64)  nbytes: 160; cbytes: 4.00 KB; ratio: 0.04
  bparams := bparams(clevel=5, shuffle=True)
[0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0]

Note how the append values are filled with zeros. This is because the default value for filling is 0. But you can choose a different value too:

>>> b = blz.arange(10, dflt=1)
>>> b.resize(20)
>>> b
barray((20,), int64)  nbytes: 160; cbytes: 4.00 KB; ratio: 0.04
  bparams := bparams(clevel=5, shuffle=True)
[0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1]

Also, you can trim barrays:

>>> b = blz.arange(10)
>>> b.resize(5)
>>> b
barray((5,), int64)  nbytes: 40; cbytes: 4.00 KB; ratio: 0.01
  bparams := bparams(clevel=5, shuffle=True)
[0 1 2 3 4]

You can even set the size to 0:

>>> b.resize(0)
>>> len(b)
0

Definitely, resizing is one of the strongest points of BLZ objects, so do not be afraid to use that feature extensively.

Compression level and shuffle filter

BLZ uses Blosc as the internal compressor, and Blosc can be directed to use different compression levels and to use (or not) its internal shuffle filter. The shuffle filter is a way to improve compression when using items that have type sizes > 1 byte, although it might be counter-productive (very rarely) for some data distributions.

By default barrays are compressed using Blosc with compression level 5 with shuffle active. But depending on you needs, you can use other compression levels too:

>>> blz.barray(a, blz.bparams(clevel=1))
barray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 9.88 MB; ratio: 7.72
  bparams := bparams(clevel=1, shuffle=True)
[0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
>>> blz.barray(a, blz.bparams(clevel=9))
barray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 1.11 MB; ratio: 68.60
  bparams := bparams(clevel=9, shuffle=True)
[0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]

Also, you can decide if you want to disable the shuffle filter that comes with Blosc:

>>> blz.barray(a, blz.bparams(shuffle=False))
barray((10000000,), float64)  nbytes: 80000000; cbytes: 38203113; ratio: 2.09
  bparams := bparams(clevel=5, shuffle=False)
[0.0, 1.0, 2.0... 9999997.0, 9999998.0, 9999999.0]

but, as can be seen, the compression ratio is much worse in this case. In general it is recommend to let shuffle active (unless you are fine-tuning the performance for an specific size of a barray).

See Optimization tips chapter for info on how you can change other internal parameters like the size of the chunk.

Accessing BLZ objects data

The way to access BLZ data is very similar to the NumPy indexing scheme, and in fact, supports all the indexing methods supported by NumPy.

Specifying an index or slice:

>>> a = np.arange(10)
>>> b = blz.barray(a)
>>> b[0]
0
>>> b[-1]
9
>>> b[2:4]
array([2, 3])
>>> b[::2]
array([0, 2, 4, 6, 8])
>>> b[3:9:3]
array([3, 6])

Note that NumPy objects are returned as the result of an indexing operation. This is on purpose because normally NumPy objects are more featured and flexible (specially if they are small). In fact, a handy way to get a NumPy array out of a barray object is asking for the complete range:

>>> b[:]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Fancy indexing is supported too. For example, indexing with boolean arrays gives:

>>> barr = np.array([True]*5+[False]*5)
>>> b[barr]
array([0, 1, 2, 3, 4])
>>> b[blz.barray(barr)]
array([0, 1, 2, 3, 4])

Or, with a list of indices:

>>> b[[2,3,0,2]]
array([2, 3, 0, 2])
>>> b[blz.barray([2,3,0,2])]
array([2, 3, 0, 2])

Querying barrays

barrays can be queried in different ways. The most easy (yet powerful) way is by using its set of iterators:

>>> a = np.arange(1e7)
>>> b = blz.barray(a)
>>> %time sum(v for v in a if v < 10)
CPU times: user 7.44 s, sys: 0.00 s, total: 7.45 s
Wall time: 7.57 s
45.0
>>> %time sum(v for v in b if v < 10)
CPU times: user 0.89 s, sys: 0.00 s, total: 0.90 s
Wall time: 0.93 s   # 8x faster than NumPy
45.0

The iterator also has support for looking into slices of the array:

>>> %time sum(v for v in b.iter(start=2, stop=20, step=3) if v < 10)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
15.0
>>> %timeit sum(v for v in b.iter(start=2, stop=20, step=3) if v < 10)
10000 loops, best of 3: 121 µs per loop

See that the time taken in this case is much shorter because the slice to do the lookup is much shorter too.

Also, you can quickly retrieve the indices of a boolean barray that have a true value:

>>> barr = blz.eval("b<10")  # see 'Operating with barrays' section below
>>> [i for i in barr.wheretrue()]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> %timeit [i for i in barr.wheretrue()]
1000 loops, best of 3: 1.06 ms per loop

And get the values where a boolean array is true:

>>> [i for i in b.where(barr)]
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
>>> %timeit [i for i in b.where(barr)]
1000 loops, best of 3: 1.59 ms per loop

Note how wheretrue and where iterators are really fast. They are also very powerful. For example, they support limit and skip parameters for limiting the number of elements returned and skipping the leading elements respectively:

>>> [i for i in barr.wheretrue(limit=5)]
[0, 1, 2, 3, 4]
>>> [i for i in barr.wheretrue(skip=3)]
[3, 4, 5, 6, 7, 8, 9]
>>> [i for i in barr.wheretrue(limit=5, skip=3)]
[3, 4, 5, 6, 7]

The advantage of the barray iterators is that you can use them in generator contexts and hence, you don’t need to waste memory for creating temporaries, which can be important when dealing with large arrays.

We have seen that this iterator toolset is very fast, so try to express your problems in a way that you can use them extensively.

Modifying barrays

Although it is a somewhat slow operation, barrays can be modified too. You can do it by specifying scalar or slice indices:

>>> a = np.arange(10)
>>> b = blz.arange(10)
>>> b[1] = 10
>>> print b
[ 0 10  2  3  4  5  6  7  8  9]
>>> b[1:4] = 10
>>> print b
[ 0 10 10 10  4  5  6  7  8  9]
>>> b[1::3] = 10
>>> print b
[ 0 10 10 10 10  5  6 10  8  9]

Modification by using fancy indexing is supported too:

>>> barr = np.array([True]*5+[False]*5)
>>> b[barr] = -5
>>> print b
[-5 -5 -5 -5 -5  5  6 10  8  9]
>>> b[[1,2,4,1]] = -10
>>> print b
[ -5 -10 -10  -5 -10   5   6  10   8   9]

However, you must be aware that modifying a barray is expensive:

>>> a = np.arange(1e7)
>>> b = blz.barray(a)
>>> %timeit a[2] = 3
10000000 loops, best of 3: 101 ns per loop
>>> %timeit b[2] = 3
10000 loops, best of 3: 161 µs per loop  # 1600x slower than NumPy

although modifying values in latest chunk is somewhat more cheaper:

>>> %timeit a[-1] = 3
10000000 loops, best of 3: 102 ns per loop
>>> %timeit b[-1] = 3
10000 loops, best of 3: 42.9 µs per loop  # 420x slower than NumPy

In general, you should avoid modifications (if you can) when using barrays.

Multidimensional barrays

You can create multidimensional barrays too. Look at this example:

>>> a = blz.zeros((2,3))
barray((2, 3), float64)  nbytes: 48; cbytes: 3.98 KB; ratio: 0.01
  bparams := bparams(clevel=5, shuffle=True)
[[ 0.  0.  0.]
 [ 0.  0.  0.]]

So, you can access any element in any dimension:

>>> a[1]
array([ 0.,  0.,  0.])
>>> a[1,::2]
array([ 0., 0.])
>>> a[1,1]
0.0

As you see, multidimensional barrays support the same multidimensional indexes than its NumPy counterparts.

Also, you can use the reshape() method to set your desired shape to an existing barray:

>>> b = blz.arange(12).reshape((3,4))
>>> b
barray((3,), ('int64',(4,)))  nbytes: 96; cbytes: 4.00 KB; ratio: 0.02
  bparams := bparams(clevel=5, shuffle=True)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

Iterators loop over the leading dimension:

>>> [r for r in b]
[array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8,  9, 10, 11])]

And you can select columns there by using another indirection level:

>>> [r[2] for r in b]
[2, 6, 10]

Above, the third column has been selected. Although for this case the indexing is easier:

>>> b[:,2]
array([ 2,  6, 10])

the iterator approach typically consumes less memory resources.

Operating with barrays

Right now, you cannot operate with barrays directly (although that might be implemented in Blaze itself):

>>> x = blz.arange(1e7)
>>> x + x
TypeError: unsupported operand type(s) for +:
'blz.blz_ext.barray' and 'blz.blz_ext.barray'

Rather, you should use the eval function:

>>> y = blz.eval("x + x")
>>> y
barray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 2.64 MB; ratio: 28.88
  bparams := bparams(clevel=5, shuffle=True)
[0.0, 2.0, 4.0, ..., 19999994.0, 19999996.0, 19999998.0]

You can also compute arbitrarily complex expressions in one shot:

>>> y = blz.eval(".5*x**3 + 2.1*x**2")
>>> y
barray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 38.00 MB; ratio: 2.01
  bparams := bparams(clevel=5, shuffle=True)
[0.0, 2.6, 12.4, ..., 4.9999976e+20, 4.9999991e+20, 5.0000006e+20]

Note how the output of eval() is also a barray object. You can pass other parameters of the barray constructor too. Let’s force maximum compression for the output:

>>> y = blz.eval(".5*x**3 + 2.1*x**2", bparams=blz.bparams(9))
>>> y
barray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 35.66 MB; ratio: 2.14
  bparams := bparams(clevel=9, shuffle=True)
[0.0, 2.6, 12.4, ..., 4.9999976e+20, 4.9999991e+20, 5.0000006e+20]

By default, eval will use Numexpr virtual machine if it is installed and if not, it will default to use the Python one (via NumPy). You can use the vm parameter to select the desired virtual machine (“numexpr” or “python”):

>>> %timeit blz.eval(".5*x**3 + 2.1*x**2", vm="numexpr")
10 loops, best of 3: 303 ms per loop
>>> %timeit blz.eval(".5*x**3 + 2.1*x**2", vm="python")
10 loops, best of 3: 1.9 s per loop

As can be seen, using the “numexpr” virtual machine is generally (much) faster, but there are situations that the “python” one is desirable because it offers much more functionality:

>>> blz.eval("diff(x)", vm="numexpr")
NameError: variable name ``diff`` not found
>>> blz.eval("np.diff(x)", vm="python")
barray((9999389,), float64)  nbytes: 76.29 MB; cbytes: 814.25 KB; ratio: 95.94
  bparams := bparams(clevel=5, shuffle=True)
[1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0]

Finally, eval lets you select the type of the outcome to be a NumPy array by using the out_flavor argument:

>>> blz.eval("x**3", out_flavor="numpy")
array([  0.00000000e+00,   1.00000000e+00,   8.00000000e+00, ...,
         9.99999100e+20,   9.99999400e+20,   9.99999700e+20])

For setting permanently your own defaults for the vm and out_flavors, see Defaults for BLZ operation chapter.

barray metadata

barray implements several attributes, like dtype, shape and ndim that makes it to ‘quack’ like a NumPy array:

>>> a = np.arange(1e7)
>>> b = blz.barray(a)
>>> b.dtype
dtype('float64')
>>> b.shape
(10000000,)

In addition, it implements the cbytes attribute that tells how many bytes in memory (or on-disk) uses the barray object:

>>> b.cbytes
2691722

This figure is approximate and it is generally lower than the original (uncompressed) datasize can be accessed by using nbytes attribute:

>>> b.nbytes
80000000

which is the same than the equivalent NumPy array:

>>> a.size*a.dtype.itemsize
80000000

For knowing the compression level used and other optional filters, use the bparams read-only attribute:

>>> b.bparams
bparams(clevel=5, shuffle=True)

Also, you can check which the default value is (remember, used when resize -ing the barray):

>>> b.dflt
0.0

You can access the chunklen (the length for each chunk) for this barray:

>>> b.chunklen
16384

For a complete list of public attributes of barray, see section on barray attributes.

barray user attrs

Besides the regular attributes like shape, dtype or chunklen, there is another set of attributes that can be added (and removed) by the user in another name space. This space is accessible via the special attrs attribute:

>>> a = blz.barray([1,2], rootdir='mydata')
>>> a.attrs
*no attrs*

As you see, by default there are no attributes attached to attrs. Also, notice that the barray that we have created is persistent and stored on the ‘mydata’ directory. Let’s add one attribute here:

>>> a.attrs['myattr'] = 234
>>> a.attrs
myattr : 234

So, we have attached the ‘myattr’ attribute with the value 234. Let’s add a couple of attributes more:

>>> a.attrs['temp'] = 23
>>> a.attrs['unit'] = 'Celsius'
>>> a.attrs
unit : 'Celsius'
myattr : 234
temp : 23

good, we have three of them now. You can attach as many as you want, and the only current limitation is that they have to be serializable via JSON.

As the ‘a’ barray is persistent, it can re-opened in other Python session:

>>> a.flush()
>>> ^D
$ python
Python 2.7.3rc2 (default, Apr 22 2012, 22:30:17)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import blz
>>> a = blz.open(rootdir="mydata")
>>> a                            # yeah, our data is back
barray((2,), int64)
  nbytes: 16; cbytes: 4.00 KB; ratio: 0.00
  bparams := bparams(clevel=5, shuffle=True)
  rootdir := 'mydata'
[1 2]
>>> a.attrs                      # and so is user attrs!
temp : 23
myattr : 234
unit : u'Celsius'

Now, let’s remove a couple of user attrs:

>>> del a.attrs['myattr']
>>> del a.attrs['unit']
>>> a.attrs
temp : 23

So, it is really easy to make use of this feature so as to complement your data with (potentially persistent) metadata of your choice. Of course, the btable object also wears this capability.

Tutorial on btable objects

The BLZ package comes with a handy object that arranges data by column (and not by row, as in NumPy’s structured arrays). This allows for much better performance for walking tabular data by column and also for adding and deleting columns.

Creating a btable

You can build btable objects in many different ways, but perhaps the easiest one is using the fromiter constructor:

>>> N = 100*1000
>>> ct = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 283.27 KB; ratio: 4.14
  bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (1, 1.0), (2, 4.0), ...,
 (99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)]

You can also build an empty btable first and the append data:

>>> ct = blz.btable(np.empty(0, dtype="i4,f8"))
>>> for i in xrange(N):
...:    ct.append((i, i**2))
...:
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 355.48 KB; ratio: 3.30
  bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (1, 1.0), (2, 4.0), ...,
 (99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)]

However, we can see how the latter approach does not compress as well. Why? Well, BLZ has machinery for computing ‘optimal’ chunksizes depending on the number of entries. For the first case, BLZ can figure out the number of entries in final array, but not for the loop case. You can solve this by passing the final length with the expectedlen argument to the btable constructor:

>>> ct = blz.btable(np.empty(0, dtype="i4,f8"), expectedlen=N)
>>> for i in xrange(N):
...:    ct.append((i, i**2))
...:
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 283.27 KB; ratio: 4.14
  bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (1, 1.0), (2, 4.0), ...,
 (99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)]

Okay, the compression ratio is the same now.

Accessing and setting rows

The btable object supports the most common indexing operations in NumPy:

>>> ct[1]
(1, 1.0)
>>> type(ct[1])
<type 'numpy.void'>
>>> ct[1:6]
array([(1, 1.0), (2, 4.0), (3, 9.0), (4, 16.0), (5, 25.0)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])

The first thing to have in mind is that, similarly to barray objects, the result of an indexing operation is a native NumPy object (in the case above a scalar and a structured array).

Fancy indexing is also supported:

>>> ct[[1,6,13]]
array([(1, 1.0), (6, 36.0), (13, 169.0)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])
>>> ct["(f0>0) & (f1<10)"]
array([(1, 1.0), (2, 4.0), (3, 9.0)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])

Note that conditions over columns are expressed as string expressions (in order to use Numexpr under the hood), and that the column names are understood correctly.

Setting rows is also supported:

>>> ct[1] = (0,0)
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 279.89 KB; ratio: 4.19
  bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (0, 0.0), (2, 4.0), ...,
 (99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)]
>>> ct[1:6]
array([(0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])

And in combination with fancy indexing too:

>>> ct[[1,6,13]] = (1,1)
>>> ct[[1,6,13]]
array([(1, 1.0), (1, 1.0), (1, 1.0)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])
>>> ct["(f0>=0) & (f1<10)"] = (2,2)
>>> ct[:7]
array([(2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0),
       (6, 36.0)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])

As you may have noticed, fancy indexing in combination with conditions is a very powerful feature.

Adding and deleting columns

Adding and deleting columns is easy and, due to the column-wise data arrangement, very efficient. Let’s add a new column on an existing btable:

>>> N = 100*1000
>>> ct = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
>>> new_col = np.linspace(0, 1, 100*1000)
>>> ct.addcol(new_col)
>>> ct
btable((100000,), |V20) nbytes: 1.91 MB; cbytes: 528.83 KB; ratio: 3.69
  bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0, 0.0), (1, 1.0, 1.000010000100001e-05),
 (2, 4.0, 2.000020000200002e-05), ...,
 (99997, 9999400009.0, 0.99997999979999797),
 (99998, 9999600004.0, 0.99998999989999904), (99999, 9999800001.0, 1.0)]

Now, remove the already existing ‘f1’ column:

>>> ct.delcol('f1')
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 318.68 KB; ratio: 3.68
  bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (1, 1.000010000100001e-05), (2, 2.000020000200002e-05), ...,
 (99997, 0.99997999979999797), (99998, 0.99998999989999904), (99999, 1.0)]

As said, adding and deleting columns is very cheap, so don’t be afraid of using them extensively.

Iterating over btable data

You can make use of the iter() method in order to easily iterate over the values of a btable. iter() has support for start, stop and step parameters:

>>> N = 100*1000
>>> t = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
>>> [row for row in ct.iter(1,10,3)]
[row(f0=1, f1=1.0), row(f0=4, f1=16.0), row(f0=7, f1=49.0)]

Note how the data is returned as namedtuple objects of type row. This allows you to iterate the fields more easily by using field names:

>>> [(f0,f1) for f0,f1 in ct.iter(1,10,3)]
[(1, 1.0), (4, 16.0), (7, 49.0)]

You can also use the [:] accessor to get rid of the row namedtuple, and return just bare tuples:

>>> [row[:] for row in ct.iter(1,10,3)]
[(1, 1.0), (4, 16.0), (7, 49.0)]

Also, you can select specific fields to be read via the outcols parameter:

>>> [row for row in ct.iter(1,10,3, outcols='f0')]
[row(f0=1), row(f0=4), row(f0=7)]
>>> [(nr,f0) for nr,f0 in ct.iter(1,10,3, outcols='nrow__,f0')]
[(1, 1), (4, 4), (7, 7)]

Please note the use of the special ‘nrow__‘ label for referring to the current row.

Iterating over the output of conditions along columns

One of the most powerful capabilities of the btable is the ability to iterate over the rows whose fields fulfill some conditions (without the need to put the results in a NumPy container, as described in the “Accessing and setting rows” section above). This can be very useful for performing operations on very large btables without consuming lots of storage space.

Here it is an example of use:

>>> N = 100*1000
>>> t = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
>>> [row for row in ct.where("(f0>0) & (f1<10)")]
[row(f0=1, f1=1.0), row(f0=2, f1=4.0), row(f0=3, f1=9.0)]
>>> sum([row.f1 for row in ct.where("(f1>10)")])
3.3333283333312755e+17

And by using the outcols parameter, you can specify the fields that you want to be returned:

>>> [row for row in ct.where("(f0>0) & (f1<10)", "f1")]
[row(f1=1.0), row(f1=4.0), row(f1=9.0)]

You can even specify the row number fulfilling the condition:

>>> [(f1,nr) for f1,nr in ct.where("(f0>0) & (f1<10)", "f1,nrow__")]
[(1.0, 1), (4.0, 2), (9.0, 3)]

Performing operations on btable columns

The btable object also wears an eval() method that is handy for carrying out operations among columns:

>>> ct.eval("cos((3+f0)/sqrt(2*f1))")
barray((1000000,), float64)  nbytes: 7.63 MB; cbytes: 2.23 MB; ratio: 3.42
  bparams := bparams(clevel=5, shuffle=True)
[nan, -0.951363128126, -0.195699435691, ...,
 0.760243218982, 0.760243218983, 0.760243218984]

Here, one can see an exception in btable methods behaviour: the resulting output is a btable, and not a NumPy structured array. This is so because the output of eval() is of the same length than the btable, and thus it can be pretty large, so compression maybe of help to reduce its storage needs.

## Local Variables: ## fill-column: 72 ## End: