---------
Tutorials
---------
Tutorial on barray objects
==========================
Creating barrays
----------------
A barray can be created from any NumPy ndarray by using its `barray`
constructor::
>>> import numpy as np
>>> a = np.arange(10)
>>> import blz
>>> b = blz.barray(a) # for in-memory storage
>>> c = blz.barray(a, rootdir='mydir') # for on-disk storage
Or, you can also create it by using one of its multiple constructors
(see :ref:`top-level-constructors` for the complete list)::
>>> d = blz.arange(10, rootdir='mydir')
Please note that BLZ allows to create disk-based arrays by just
specifying the `rootdir` parameter in all its constructors.
Disk-based arrays fully support all the operations of in-memory
counterparts, so depending on your needs, you may want to use one or
another (or even a combination of both).
Now, `b` is a barray object. Just check this::
>>> type(b)
You can have a peek at it by using its string form::
>>> print b
[0, 1, 2... 7, 8, 9]
And get more info about uncompressed size (nbytes), compressed
(cbytes) and the compression ratio (ratio = nbytes/cbytes), by using
its representation form::
>>> b # <==> print repr(b)
barray((10,), int64) nbytes: 80; cbytes: 4.00 KB; ratio: 0.02
bparams := bparams(clevel=5, shuffle=True)
[0 1 2 3 4 5 6 7 8 9]
As you can see, the compressed size is much larger than the
uncompressed one. How this can be? Well, it turns out that barray
wears an I/O buffer for accelerating some internal operations. So,
for small arrays (typically those taking less than 1 MB), there is
little point in using a barray.
However, when creating barrays larger than 1 MB (its natural
scenario), the size of the I/O buffer is generally negligible in
comparison::
>>> b = blz.arange(1e8)
>>> b
barray((100000000,), float64) nbytes: 762.94 MB; cbytes: 23.38 MB; ratio: 32.63
bparams := bparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0, ..., 99999997.0, 99999998.0, 99999999.0]
The barray consumes less than 24 MB, while the original data would have
taken more than 760 MB; that's a huge gain. You can always get a hint
on how much space it takes your barray by using `sys.getsizeof()`::
>>> import sys
>>> sys.getsizeof(b)
24520482
That moral here is that you can create very large arrays without the
need to create a NumPy array first (that may not fit in memory).
Finally, you can get a copy of your created barrays by using the
`copy()` method::
>>> c = b.copy()
>>> c
barray((100000000,), float64) nbytes: 762.94 MB; cbytes: 23.38 MB; ratio: 32.63
bparams := bparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0, ..., 99999997.0, 99999998.0, 99999999.0]
and you can control parameters for the newly created copy::
>>> b.copy(bparams=blz.bparams(clevel=9))
barray((100000000,), float64) nbytes: 762.94 MB; cbytes: 8.22 MB; ratio: 92.78
bparams := bparams(clevel=9, shuffle=True)
[0.0, 1.0, 2.0, ..., 99999997.0, 99999998.0, 99999999.0]
Enlarging your barray
---------------------
One of the nicest features of barray objects is that they can be
enlarged very efficiently. This can be done via the `barray.append()`
method.
For example, if `b` is a barray with 10 million elements::
>>> b
barray((10000000,), float64) nbytes: 80000000; cbytes: 2691722; ratio: 29.72
bparams := bparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0... 9999997.0, 9999998.0, 9999999.0]
it can be enlarged by 10 elements with::
>>> b.append(np.arange(10.))
>>> b
barray((10000010,), float64) nbytes: 80000080; cbytes: 2691722; ratio: 29.72
bparams := bparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0... 7.0, 8.0, 9.0]
Let's check how fast appending can be::
>>> a = np.arange(1e7)
>>> b = blz.arange(1e7)
>>> %time b.append(a)
CPU times: user 0.06 s, sys: 0.00 s, total: 0.06 s
Wall time: 0.06 s
>>> %time np.concatenate((a, a))
CPU times: user 0.08 s, sys: 0.04 s, total: 0.12 s
Wall time: 0.12 s # 2x slower than BLZ
array([ 0.00000000e+00, 1.00000000e+00, 2.00000000e+00, ...,
9.99999700e+06, 9.99999800e+06, 9.99999900e+06])
This is specially true when appending small bits to large arrays::
>>> b = blz.barray(a)
>>> %timeit b.append(np.arange(1e1))
100000 loops, best of 3: 3.17 µs per loop
>>> %timeit np.concatenate((a, np.arange(1e1)))
10 loops, best of 3: 64 ms per loop # 2000x slower than BLZ
You can also enlarge your arrays by using the `resize()` method::
>>> b = blz.arange(10)
>>> b.resize(20)
>>> b
barray((20,), int64) nbytes: 160; cbytes: 4.00 KB; ratio: 0.04
bparams := bparams(clevel=5, shuffle=True)
[0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0]
Note how the append values are filled with zeros. This is because the
default value for filling is 0. But you can choose a different value
too::
>>> b = blz.arange(10, dflt=1)
>>> b.resize(20)
>>> b
barray((20,), int64) nbytes: 160; cbytes: 4.00 KB; ratio: 0.04
bparams := bparams(clevel=5, shuffle=True)
[0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1]
Also, you can trim barrays::
>>> b = blz.arange(10)
>>> b.resize(5)
>>> b
barray((5,), int64) nbytes: 40; cbytes: 4.00 KB; ratio: 0.01
bparams := bparams(clevel=5, shuffle=True)
[0 1 2 3 4]
You can even set the size to 0:
>>> b.resize(0)
>>> len(b)
0
Definitely, resizing is one of the strongest points of BLZ
objects, so do not be afraid to use that feature extensively.
Compression level and shuffle filter
------------------------------------
BLZ uses Blosc as the internal compressor, and Blosc can be directed
to use different compression levels and to use (or not) its internal
shuffle filter. The shuffle filter is a way to improve compression
when using items that have type sizes > 1 byte, although it might be
counter-productive (very rarely) for some data distributions.
By default barrays are compressed using Blosc with compression level 5
with shuffle active. But depending on you needs, you can use other
compression levels too::
>>> blz.barray(a, blz.bparams(clevel=1))
barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 9.88 MB; ratio: 7.72
bparams := bparams(clevel=1, shuffle=True)
[0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
>>> blz.barray(a, blz.bparams(clevel=9))
barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 1.11 MB; ratio: 68.60
bparams := bparams(clevel=9, shuffle=True)
[0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
Also, you can decide if you want to disable the shuffle filter that
comes with Blosc::
>>> blz.barray(a, blz.bparams(shuffle=False))
barray((10000000,), float64) nbytes: 80000000; cbytes: 38203113; ratio: 2.09
bparams := bparams(clevel=5, shuffle=False)
[0.0, 1.0, 2.0... 9999997.0, 9999998.0, 9999999.0]
but, as can be seen, the compression ratio is much worse in this case.
In general it is recommend to let shuffle active (unless you are
fine-tuning the performance for an specific size of a barray).
See :ref:`opt-tips` chapter for info on how you can change other
internal parameters like the size of the chunk.
Accessing BLZ objects data
--------------------------
The way to access BLZ data is very similar to the NumPy indexing
scheme, and in fact, supports all the indexing methods supported by
NumPy.
Specifying an index or slice::
>>> a = np.arange(10)
>>> b = blz.barray(a)
>>> b[0]
0
>>> b[-1]
9
>>> b[2:4]
array([2, 3])
>>> b[::2]
array([0, 2, 4, 6, 8])
>>> b[3:9:3]
array([3, 6])
Note that NumPy objects are returned as the result of an indexing
operation. This is on purpose because normally NumPy objects are more
featured and flexible (specially if they are small). In fact, a handy
way to get a NumPy array out of a barray object is asking for the
complete range::
>>> b[:]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Fancy indexing is supported too. For example, indexing with boolean
arrays gives::
>>> barr = np.array([True]*5+[False]*5)
>>> b[barr]
array([0, 1, 2, 3, 4])
>>> b[blz.barray(barr)]
array([0, 1, 2, 3, 4])
Or, with a list of indices::
>>> b[[2,3,0,2]]
array([2, 3, 0, 2])
>>> b[blz.barray([2,3,0,2])]
array([2, 3, 0, 2])
Querying barrays
----------------
barrays can be queried in different ways. The most easy (yet
powerful) way is by using its set of iterators::
>>> a = np.arange(1e7)
>>> b = blz.barray(a)
>>> %time sum(v for v in a if v < 10)
CPU times: user 7.44 s, sys: 0.00 s, total: 7.45 s
Wall time: 7.57 s
45.0
>>> %time sum(v for v in b if v < 10)
CPU times: user 0.89 s, sys: 0.00 s, total: 0.90 s
Wall time: 0.93 s # 8x faster than NumPy
45.0
The iterator also has support for looking into slices of the array::
>>> %time sum(v for v in b.iter(start=2, stop=20, step=3) if v < 10)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
15.0
>>> %timeit sum(v for v in b.iter(start=2, stop=20, step=3) if v < 10)
10000 loops, best of 3: 121 µs per loop
See that the time taken in this case is much shorter because the slice
to do the lookup is much shorter too.
Also, you can quickly retrieve the indices of a boolean barray that
have a true value::
>>> barr = blz.eval("b<10") # see 'Operating with barrays' section below
>>> [i for i in barr.wheretrue()]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> %timeit [i for i in barr.wheretrue()]
1000 loops, best of 3: 1.06 ms per loop
And get the values where a boolean array is true::
>>> [i for i in b.where(barr)]
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
>>> %timeit [i for i in b.where(barr)]
1000 loops, best of 3: 1.59 ms per loop
Note how `wheretrue` and `where` iterators are really fast. They are
also very powerful. For example, they support `limit` and `skip`
parameters for limiting the number of elements returned and skipping
the leading elements respectively::
>>> [i for i in barr.wheretrue(limit=5)]
[0, 1, 2, 3, 4]
>>> [i for i in barr.wheretrue(skip=3)]
[3, 4, 5, 6, 7, 8, 9]
>>> [i for i in barr.wheretrue(limit=5, skip=3)]
[3, 4, 5, 6, 7]
The advantage of the barray iterators is that you can use them in
generator contexts and hence, you don't need to waste memory for
creating temporaries, which can be important when dealing with large
arrays.
We have seen that this iterator toolset is very fast, so try to
express your problems in a way that you can use them extensively.
Modifying barrays
-----------------
Although it is a somewhat slow operation, barrays can be modified too.
You can do it by specifying scalar or slice indices::
>>> a = np.arange(10)
>>> b = blz.arange(10)
>>> b[1] = 10
>>> print b
[ 0 10 2 3 4 5 6 7 8 9]
>>> b[1:4] = 10
>>> print b
[ 0 10 10 10 4 5 6 7 8 9]
>>> b[1::3] = 10
>>> print b
[ 0 10 10 10 10 5 6 10 8 9]
Modification by using fancy indexing is supported too::
>>> barr = np.array([True]*5+[False]*5)
>>> b[barr] = -5
>>> print b
[-5 -5 -5 -5 -5 5 6 10 8 9]
>>> b[[1,2,4,1]] = -10
>>> print b
[ -5 -10 -10 -5 -10 5 6 10 8 9]
However, you must be aware that modifying a barray is expensive::
>>> a = np.arange(1e7)
>>> b = blz.barray(a)
>>> %timeit a[2] = 3
10000000 loops, best of 3: 101 ns per loop
>>> %timeit b[2] = 3
10000 loops, best of 3: 161 µs per loop # 1600x slower than NumPy
although modifying values in latest chunk is somewhat more cheaper::
>>> %timeit a[-1] = 3
10000000 loops, best of 3: 102 ns per loop
>>> %timeit b[-1] = 3
10000 loops, best of 3: 42.9 µs per loop # 420x slower than NumPy
In general, you should avoid modifications (if you can) when using
barrays.
Multidimensional barrays
------------------------
You can create multidimensional barrays too. Look at this example::
>>> a = blz.zeros((2,3))
barray((2, 3), float64) nbytes: 48; cbytes: 3.98 KB; ratio: 0.01
bparams := bparams(clevel=5, shuffle=True)
[[ 0. 0. 0.]
[ 0. 0. 0.]]
So, you can access any element in any dimension::
>>> a[1]
array([ 0., 0., 0.])
>>> a[1,::2]
array([ 0., 0.])
>>> a[1,1]
0.0
As you see, multidimensional barrays support the same multidimensional
indexes than its NumPy counterparts.
Also, you can use the `reshape()` method to set your desired shape to
an existing barray::
>>> b = blz.arange(12).reshape((3,4))
>>> b
barray((3,), ('int64',(4,))) nbytes: 96; cbytes: 4.00 KB; ratio: 0.02
bparams := bparams(clevel=5, shuffle=True)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Iterators loop over the leading dimension::
>>> [r for r in b]
[array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8, 9, 10, 11])]
And you can select columns there by using another indirection level::
>>> [r[2] for r in b]
[2, 6, 10]
Above, the third column has been selected. Although for this case the
indexing is easier::
>>> b[:,2]
array([ 2, 6, 10])
the iterator approach typically consumes less memory resources.
Operating with barrays
----------------------
Right now, you cannot operate with barrays directly (although that
might be implemented in Blaze itself)::
>>> x = blz.arange(1e7)
>>> x + x
TypeError: unsupported operand type(s) for +:
'blz.blz_ext.barray' and 'blz.blz_ext.barray'
Rather, you should use the `eval` function::
>>> y = blz.eval("x + x")
>>> y
barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 2.64 MB; ratio: 28.88
bparams := bparams(clevel=5, shuffle=True)
[0.0, 2.0, 4.0, ..., 19999994.0, 19999996.0, 19999998.0]
You can also compute arbitrarily complex expressions in one shot::
>>> y = blz.eval(".5*x**3 + 2.1*x**2")
>>> y
barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 38.00 MB; ratio: 2.01
bparams := bparams(clevel=5, shuffle=True)
[0.0, 2.6, 12.4, ..., 4.9999976e+20, 4.9999991e+20, 5.0000006e+20]
Note how the output of `eval()` is also a barray object. You can pass
other parameters of the barray constructor too. Let's force maximum
compression for the output::
>>> y = blz.eval(".5*x**3 + 2.1*x**2", bparams=blz.bparams(9))
>>> y
barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 35.66 MB; ratio: 2.14
bparams := bparams(clevel=9, shuffle=True)
[0.0, 2.6, 12.4, ..., 4.9999976e+20, 4.9999991e+20, 5.0000006e+20]
By default, `eval` will use Numexpr virtual machine if it is installed
and if not, it will default to use the Python one (via NumPy). You
can use the `vm` parameter to select the desired virtual machine
("numexpr" or "python")::
>>> %timeit blz.eval(".5*x**3 + 2.1*x**2", vm="numexpr")
10 loops, best of 3: 303 ms per loop
>>> %timeit blz.eval(".5*x**3 + 2.1*x**2", vm="python")
10 loops, best of 3: 1.9 s per loop
As can be seen, using the "numexpr" virtual machine is generally
(much) faster, but there are situations that the "python" one is
desirable because it offers much more functionality::
>>> blz.eval("diff(x)", vm="numexpr")
NameError: variable name ``diff`` not found
>>> blz.eval("np.diff(x)", vm="python")
barray((9999389,), float64) nbytes: 76.29 MB; cbytes: 814.25 KB; ratio: 95.94
bparams := bparams(clevel=5, shuffle=True)
[1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0]
Finally, `eval` lets you select the type of the outcome to be a NumPy
array by using the `out_flavor` argument::
>>> blz.eval("x**3", out_flavor="numpy")
array([ 0.00000000e+00, 1.00000000e+00, 8.00000000e+00, ...,
9.99999100e+20, 9.99999400e+20, 9.99999700e+20])
For setting permanently your own defaults for the `vm` and
`out_flavors`, see :ref:`blz-defaults` chapter.
barray metadata
---------------
barray implements several attributes, like `dtype`, `shape` and `ndim`
that makes it to 'quack' like a NumPy array::
>>> a = np.arange(1e7)
>>> b = blz.barray(a)
>>> b.dtype
dtype('float64')
>>> b.shape
(10000000,)
In addition, it implements the `cbytes` attribute that tells how many
bytes in memory (or on-disk) uses the barray object::
>>> b.cbytes
2691722
This figure is approximate and it is generally lower than the original
(uncompressed) datasize can be accessed by using `nbytes` attribute::
>>> b.nbytes
80000000
which is the same than the equivalent NumPy array::
>>> a.size*a.dtype.itemsize
80000000
For knowing the compression level used and other optional filters, use
the `bparams` read-only attribute::
>>> b.bparams
bparams(clevel=5, shuffle=True)
Also, you can check which the default value is (remember, used when
`resize` -ing the barray)::
>>> b.dflt
0.0
You can access the `chunklen` (the length for each chunk) for this
barray::
>>> b.chunklen
16384
For a complete list of public attributes of barray, see section on
:ref:`barray-attributes`.
.. _barray-attrs:
barray user attrs
-----------------
Besides the regular attributes like `shape`, `dtype` or `chunklen`,
there is another set of attributes that can be added (and removed) by
the user in another name space. This space is accessible via the
special `attrs` attribute::
>>> a = blz.barray([1,2], rootdir='mydata')
>>> a.attrs
*no attrs*
As you see, by default there are no attributes attached to `attrs`.
Also, notice that the barray that we have created is persistent and
stored on the 'mydata' directory. Let's add one attribute here::
>>> a.attrs['myattr'] = 234
>>> a.attrs
myattr : 234
So, we have attached the 'myattr' attribute with the value 234. Let's
add a couple of attributes more::
>>> a.attrs['temp'] = 23
>>> a.attrs['unit'] = 'Celsius'
>>> a.attrs
unit : 'Celsius'
myattr : 234
temp : 23
good, we have three of them now. You can attach as many as you want,
and the only current limitation is that they have to be serializable
via JSON.
As the 'a' barray is persistent, it can re-opened in other Python session::
>>> a.flush()
>>> ^D
$ python
Python 2.7.3rc2 (default, Apr 22 2012, 22:30:17)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import blz
>>> a = blz.open(rootdir="mydata")
>>> a # yeah, our data is back
barray((2,), int64)
nbytes: 16; cbytes: 4.00 KB; ratio: 0.00
bparams := bparams(clevel=5, shuffle=True)
rootdir := 'mydata'
[1 2]
>>> a.attrs # and so is user attrs!
temp : 23
myattr : 234
unit : u'Celsius'
Now, let's remove a couple of user attrs::
>>> del a.attrs['myattr']
>>> del a.attrs['unit']
>>> a.attrs
temp : 23
So, it is really easy to make use of this feature so as to complement
your data with (potentially persistent) metadata of your choice. Of
course, the `btable` object also wears this capability.
Tutorial on btable objects
==========================
The BLZ package comes with a handy object that arranges data by column
(and not by row, as in NumPy's structured arrays). This allows for
much better performance for walking tabular data by column and also
for adding and deleting columns.
Creating a btable
-----------------
You can build btable objects in many different ways, but perhaps the
easiest one is using the `fromiter` constructor::
>>> N = 100*1000
>>> ct = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 283.27 KB; ratio: 4.14
bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (1, 1.0), (2, 4.0), ...,
(99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)]
You can also build an empty btable first and the append data::
>>> ct = blz.btable(np.empty(0, dtype="i4,f8"))
>>> for i in xrange(N):
...: ct.append((i, i**2))
...:
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 355.48 KB; ratio: 3.30
bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (1, 1.0), (2, 4.0), ...,
(99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)]
However, we can see how the latter approach does not compress as well.
Why? Well, BLZ has machinery for computing 'optimal' chunksizes
depending on the number of entries. For the first case, BLZ can
figure out the number of entries in final array, but not for the loop
case. You can solve this by passing the final length with the
`expectedlen` argument to the btable constructor::
>>> ct = blz.btable(np.empty(0, dtype="i4,f8"), expectedlen=N)
>>> for i in xrange(N):
...: ct.append((i, i**2))
...:
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 283.27 KB; ratio: 4.14
bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (1, 1.0), (2, 4.0), ...,
(99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)]
Okay, the compression ratio is the same now.
Accessing and setting rows
--------------------------
The btable object supports the most common indexing operations in
NumPy::
>>> ct[1]
(1, 1.0)
>>> type(ct[1])
>>> ct[1:6]
array([(1, 1.0), (2, 4.0), (3, 9.0), (4, 16.0), (5, 25.0)],
dtype=[('f0', '>> ct[[1,6,13]]
array([(1, 1.0), (6, 36.0), (13, 169.0)],
dtype=[('f0', '>> ct["(f0>0) & (f1<10)"]
array([(1, 1.0), (2, 4.0), (3, 9.0)],
dtype=[('f0', '>> ct[1] = (0,0)
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 279.89 KB; ratio: 4.19
bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (0, 0.0), (2, 4.0), ...,
(99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)]
>>> ct[1:6]
array([(0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0)],
dtype=[('f0', '>> ct[[1,6,13]] = (1,1)
>>> ct[[1,6,13]]
array([(1, 1.0), (1, 1.0), (1, 1.0)],
dtype=[('f0', '>> ct["(f0>=0) & (f1<10)"] = (2,2)
>>> ct[:7]
array([(2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0),
(6, 36.0)],
dtype=[('f0', '>> N = 100*1000
>>> ct = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
>>> new_col = np.linspace(0, 1, 100*1000)
>>> ct.addcol(new_col)
>>> ct
btable((100000,), |V20) nbytes: 1.91 MB; cbytes: 528.83 KB; ratio: 3.69
bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0, 0.0), (1, 1.0, 1.000010000100001e-05),
(2, 4.0, 2.000020000200002e-05), ...,
(99997, 9999400009.0, 0.99997999979999797),
(99998, 9999600004.0, 0.99998999989999904), (99999, 9999800001.0, 1.0)]
Now, remove the already existing 'f1' column::
>>> ct.delcol('f1')
>>> ct
btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 318.68 KB; ratio: 3.68
bparams := bparams(clevel=5, shuffle=True)
[(0, 0.0), (1, 1.000010000100001e-05), (2, 2.000020000200002e-05), ...,
(99997, 0.99997999979999797), (99998, 0.99998999989999904), (99999, 1.0)]
As said, adding and deleting columns is very cheap, so don't be afraid
of using them extensively.
Iterating over btable data
--------------------------
You can make use of the `iter()` method in order to easily iterate
over the values of a btable. `iter()` has support for start, stop and
step parameters::
>>> N = 100*1000
>>> t = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
>>> [row for row in ct.iter(1,10,3)]
[row(f0=1, f1=1.0), row(f0=4, f1=16.0), row(f0=7, f1=49.0)]
Note how the data is returned as `namedtuple` objects of type
``row``. This allows you to iterate the fields more easily by using
field names::
>>> [(f0,f1) for f0,f1 in ct.iter(1,10,3)]
[(1, 1.0), (4, 16.0), (7, 49.0)]
You can also use the ``[:]`` accessor to get rid of the ``row``
namedtuple, and return just bare tuples::
>>> [row[:] for row in ct.iter(1,10,3)]
[(1, 1.0), (4, 16.0), (7, 49.0)]
Also, you can select specific fields to be read via the `outcols`
parameter::
>>> [row for row in ct.iter(1,10,3, outcols='f0')]
[row(f0=1), row(f0=4), row(f0=7)]
>>> [(nr,f0) for nr,f0 in ct.iter(1,10,3, outcols='nrow__,f0')]
[(1, 1), (4, 4), (7, 7)]
Please note the use of the special 'nrow__' label for referring to
the current row.
Iterating over the output of conditions along columns
-----------------------------------------------------
One of the most powerful capabilities of the btable is the ability to
iterate over the rows whose fields fulfill some conditions (without
the need to put the results in a NumPy container, as described in the
"Accessing and setting rows" section above). This can be very useful
for performing operations on very large btables without consuming lots
of storage space.
Here it is an example of use::
>>> N = 100*1000
>>> t = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
>>> [row for row in ct.where("(f0>0) & (f1<10)")]
[row(f0=1, f1=1.0), row(f0=2, f1=4.0), row(f0=3, f1=9.0)]
>>> sum([row.f1 for row in ct.where("(f1>10)")])
3.3333283333312755e+17
And by using the `outcols` parameter, you can specify the fields that
you want to be returned::
>>> [row for row in ct.where("(f0>0) & (f1<10)", "f1")]
[row(f1=1.0), row(f1=4.0), row(f1=9.0)]
You can even specify the row number fulfilling the condition::
>>> [(f1,nr) for f1,nr in ct.where("(f0>0) & (f1<10)", "f1,nrow__")]
[(1.0, 1), (4.0, 2), (9.0, 3)]
Performing operations on btable columns
---------------------------------------
The btable object also wears an `eval()` method that is handy for
carrying out operations among columns::
>>> ct.eval("cos((3+f0)/sqrt(2*f1))")
barray((1000000,), float64) nbytes: 7.63 MB; cbytes: 2.23 MB; ratio: 3.42
bparams := bparams(clevel=5, shuffle=True)
[nan, -0.951363128126, -0.195699435691, ...,
0.760243218982, 0.760243218983, 0.760243218984]
Here, one can see an exception in btable methods behaviour: the
resulting output is a btable, and not a NumPy structured array. This
is so because the output of `eval()` is of the same length than the
btable, and thus it can be pretty large, so compression maybe of help
to reduce its storage needs.
## Local Variables:
## fill-column: 72
## End: