--------- Tutorials --------- Tutorial on barray objects ========================== Creating barrays ---------------- A barray can be created from any NumPy ndarray by using its `barray` constructor:: >>> import numpy as np >>> a = np.arange(10) >>> import blz >>> b = blz.barray(a) # for in-memory storage >>> c = blz.barray(a, rootdir='mydir') # for on-disk storage Or, you can also create it by using one of its multiple constructors (see :ref:`top-level-constructors` for the complete list):: >>> d = blz.arange(10, rootdir='mydir') Please note that BLZ allows to create disk-based arrays by just specifying the `rootdir` parameter in all its constructors. Disk-based arrays fully support all the operations of in-memory counterparts, so depending on your needs, you may want to use one or another (or even a combination of both). Now, `b` is a barray object. Just check this:: >>> type(b) You can have a peek at it by using its string form:: >>> print b [0, 1, 2... 7, 8, 9] And get more info about uncompressed size (nbytes), compressed (cbytes) and the compression ratio (ratio = nbytes/cbytes), by using its representation form:: >>> b # <==> print repr(b) barray((10,), int64) nbytes: 80; cbytes: 4.00 KB; ratio: 0.02 bparams := bparams(clevel=5, shuffle=True) [0 1 2 3 4 5 6 7 8 9] As you can see, the compressed size is much larger than the uncompressed one. How this can be? Well, it turns out that barray wears an I/O buffer for accelerating some internal operations. So, for small arrays (typically those taking less than 1 MB), there is little point in using a barray. However, when creating barrays larger than 1 MB (its natural scenario), the size of the I/O buffer is generally negligible in comparison:: >>> b = blz.arange(1e8) >>> b barray((100000000,), float64) nbytes: 762.94 MB; cbytes: 23.38 MB; ratio: 32.63 bparams := bparams(clevel=5, shuffle=True) [0.0, 1.0, 2.0, ..., 99999997.0, 99999998.0, 99999999.0] The barray consumes less than 24 MB, while the original data would have taken more than 760 MB; that's a huge gain. You can always get a hint on how much space it takes your barray by using `sys.getsizeof()`:: >>> import sys >>> sys.getsizeof(b) 24520482 That moral here is that you can create very large arrays without the need to create a NumPy array first (that may not fit in memory). Finally, you can get a copy of your created barrays by using the `copy()` method:: >>> c = b.copy() >>> c barray((100000000,), float64) nbytes: 762.94 MB; cbytes: 23.38 MB; ratio: 32.63 bparams := bparams(clevel=5, shuffle=True) [0.0, 1.0, 2.0, ..., 99999997.0, 99999998.0, 99999999.0] and you can control parameters for the newly created copy:: >>> b.copy(bparams=blz.bparams(clevel=9)) barray((100000000,), float64) nbytes: 762.94 MB; cbytes: 8.22 MB; ratio: 92.78 bparams := bparams(clevel=9, shuffle=True) [0.0, 1.0, 2.0, ..., 99999997.0, 99999998.0, 99999999.0] Enlarging your barray --------------------- One of the nicest features of barray objects is that they can be enlarged very efficiently. This can be done via the `barray.append()` method. For example, if `b` is a barray with 10 million elements:: >>> b barray((10000000,), float64) nbytes: 80000000; cbytes: 2691722; ratio: 29.72 bparams := bparams(clevel=5, shuffle=True) [0.0, 1.0, 2.0... 9999997.0, 9999998.0, 9999999.0] it can be enlarged by 10 elements with:: >>> b.append(np.arange(10.)) >>> b barray((10000010,), float64) nbytes: 80000080; cbytes: 2691722; ratio: 29.72 bparams := bparams(clevel=5, shuffle=True) [0.0, 1.0, 2.0... 7.0, 8.0, 9.0] Let's check how fast appending can be:: >>> a = np.arange(1e7) >>> b = blz.arange(1e7) >>> %time b.append(a) CPU times: user 0.06 s, sys: 0.00 s, total: 0.06 s Wall time: 0.06 s >>> %time np.concatenate((a, a)) CPU times: user 0.08 s, sys: 0.04 s, total: 0.12 s Wall time: 0.12 s # 2x slower than BLZ array([ 0.00000000e+00, 1.00000000e+00, 2.00000000e+00, ..., 9.99999700e+06, 9.99999800e+06, 9.99999900e+06]) This is specially true when appending small bits to large arrays:: >>> b = blz.barray(a) >>> %timeit b.append(np.arange(1e1)) 100000 loops, best of 3: 3.17 µs per loop >>> %timeit np.concatenate((a, np.arange(1e1))) 10 loops, best of 3: 64 ms per loop # 2000x slower than BLZ You can also enlarge your arrays by using the `resize()` method:: >>> b = blz.arange(10) >>> b.resize(20) >>> b barray((20,), int64) nbytes: 160; cbytes: 4.00 KB; ratio: 0.04 bparams := bparams(clevel=5, shuffle=True) [0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0] Note how the append values are filled with zeros. This is because the default value for filling is 0. But you can choose a different value too:: >>> b = blz.arange(10, dflt=1) >>> b.resize(20) >>> b barray((20,), int64) nbytes: 160; cbytes: 4.00 KB; ratio: 0.04 bparams := bparams(clevel=5, shuffle=True) [0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1] Also, you can trim barrays:: >>> b = blz.arange(10) >>> b.resize(5) >>> b barray((5,), int64) nbytes: 40; cbytes: 4.00 KB; ratio: 0.01 bparams := bparams(clevel=5, shuffle=True) [0 1 2 3 4] You can even set the size to 0: >>> b.resize(0) >>> len(b) 0 Definitely, resizing is one of the strongest points of BLZ objects, so do not be afraid to use that feature extensively. Compression level and shuffle filter ------------------------------------ BLZ uses Blosc as the internal compressor, and Blosc can be directed to use different compression levels and to use (or not) its internal shuffle filter. The shuffle filter is a way to improve compression when using items that have type sizes > 1 byte, although it might be counter-productive (very rarely) for some data distributions. By default barrays are compressed using Blosc with compression level 5 with shuffle active. But depending on you needs, you can use other compression levels too:: >>> blz.barray(a, blz.bparams(clevel=1)) barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 9.88 MB; ratio: 7.72 bparams := bparams(clevel=1, shuffle=True) [0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0] >>> blz.barray(a, blz.bparams(clevel=9)) barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 1.11 MB; ratio: 68.60 bparams := bparams(clevel=9, shuffle=True) [0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0] Also, you can decide if you want to disable the shuffle filter that comes with Blosc:: >>> blz.barray(a, blz.bparams(shuffle=False)) barray((10000000,), float64) nbytes: 80000000; cbytes: 38203113; ratio: 2.09 bparams := bparams(clevel=5, shuffle=False) [0.0, 1.0, 2.0... 9999997.0, 9999998.0, 9999999.0] but, as can be seen, the compression ratio is much worse in this case. In general it is recommend to let shuffle active (unless you are fine-tuning the performance for an specific size of a barray). See :ref:`opt-tips` chapter for info on how you can change other internal parameters like the size of the chunk. Accessing BLZ objects data -------------------------- The way to access BLZ data is very similar to the NumPy indexing scheme, and in fact, supports all the indexing methods supported by NumPy. Specifying an index or slice:: >>> a = np.arange(10) >>> b = blz.barray(a) >>> b[0] 0 >>> b[-1] 9 >>> b[2:4] array([2, 3]) >>> b[::2] array([0, 2, 4, 6, 8]) >>> b[3:9:3] array([3, 6]) Note that NumPy objects are returned as the result of an indexing operation. This is on purpose because normally NumPy objects are more featured and flexible (specially if they are small). In fact, a handy way to get a NumPy array out of a barray object is asking for the complete range:: >>> b[:] array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Fancy indexing is supported too. For example, indexing with boolean arrays gives:: >>> barr = np.array([True]*5+[False]*5) >>> b[barr] array([0, 1, 2, 3, 4]) >>> b[blz.barray(barr)] array([0, 1, 2, 3, 4]) Or, with a list of indices:: >>> b[[2,3,0,2]] array([2, 3, 0, 2]) >>> b[blz.barray([2,3,0,2])] array([2, 3, 0, 2]) Querying barrays ---------------- barrays can be queried in different ways. The most easy (yet powerful) way is by using its set of iterators:: >>> a = np.arange(1e7) >>> b = blz.barray(a) >>> %time sum(v for v in a if v < 10) CPU times: user 7.44 s, sys: 0.00 s, total: 7.45 s Wall time: 7.57 s 45.0 >>> %time sum(v for v in b if v < 10) CPU times: user 0.89 s, sys: 0.00 s, total: 0.90 s Wall time: 0.93 s # 8x faster than NumPy 45.0 The iterator also has support for looking into slices of the array:: >>> %time sum(v for v in b.iter(start=2, stop=20, step=3) if v < 10) CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 s 15.0 >>> %timeit sum(v for v in b.iter(start=2, stop=20, step=3) if v < 10) 10000 loops, best of 3: 121 µs per loop See that the time taken in this case is much shorter because the slice to do the lookup is much shorter too. Also, you can quickly retrieve the indices of a boolean barray that have a true value:: >>> barr = blz.eval("b<10") # see 'Operating with barrays' section below >>> [i for i in barr.wheretrue()] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> %timeit [i for i in barr.wheretrue()] 1000 loops, best of 3: 1.06 ms per loop And get the values where a boolean array is true:: >>> [i for i in b.where(barr)] [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0] >>> %timeit [i for i in b.where(barr)] 1000 loops, best of 3: 1.59 ms per loop Note how `wheretrue` and `where` iterators are really fast. They are also very powerful. For example, they support `limit` and `skip` parameters for limiting the number of elements returned and skipping the leading elements respectively:: >>> [i for i in barr.wheretrue(limit=5)] [0, 1, 2, 3, 4] >>> [i for i in barr.wheretrue(skip=3)] [3, 4, 5, 6, 7, 8, 9] >>> [i for i in barr.wheretrue(limit=5, skip=3)] [3, 4, 5, 6, 7] The advantage of the barray iterators is that you can use them in generator contexts and hence, you don't need to waste memory for creating temporaries, which can be important when dealing with large arrays. We have seen that this iterator toolset is very fast, so try to express your problems in a way that you can use them extensively. Modifying barrays ----------------- Although it is a somewhat slow operation, barrays can be modified too. You can do it by specifying scalar or slice indices:: >>> a = np.arange(10) >>> b = blz.arange(10) >>> b[1] = 10 >>> print b [ 0 10 2 3 4 5 6 7 8 9] >>> b[1:4] = 10 >>> print b [ 0 10 10 10 4 5 6 7 8 9] >>> b[1::3] = 10 >>> print b [ 0 10 10 10 10 5 6 10 8 9] Modification by using fancy indexing is supported too:: >>> barr = np.array([True]*5+[False]*5) >>> b[barr] = -5 >>> print b [-5 -5 -5 -5 -5 5 6 10 8 9] >>> b[[1,2,4,1]] = -10 >>> print b [ -5 -10 -10 -5 -10 5 6 10 8 9] However, you must be aware that modifying a barray is expensive:: >>> a = np.arange(1e7) >>> b = blz.barray(a) >>> %timeit a[2] = 3 10000000 loops, best of 3: 101 ns per loop >>> %timeit b[2] = 3 10000 loops, best of 3: 161 µs per loop # 1600x slower than NumPy although modifying values in latest chunk is somewhat more cheaper:: >>> %timeit a[-1] = 3 10000000 loops, best of 3: 102 ns per loop >>> %timeit b[-1] = 3 10000 loops, best of 3: 42.9 µs per loop # 420x slower than NumPy In general, you should avoid modifications (if you can) when using barrays. Multidimensional barrays ------------------------ You can create multidimensional barrays too. Look at this example:: >>> a = blz.zeros((2,3)) barray((2, 3), float64) nbytes: 48; cbytes: 3.98 KB; ratio: 0.01 bparams := bparams(clevel=5, shuffle=True) [[ 0. 0. 0.] [ 0. 0. 0.]] So, you can access any element in any dimension:: >>> a[1] array([ 0., 0., 0.]) >>> a[1,::2] array([ 0., 0.]) >>> a[1,1] 0.0 As you see, multidimensional barrays support the same multidimensional indexes than its NumPy counterparts. Also, you can use the `reshape()` method to set your desired shape to an existing barray:: >>> b = blz.arange(12).reshape((3,4)) >>> b barray((3,), ('int64',(4,))) nbytes: 96; cbytes: 4.00 KB; ratio: 0.02 bparams := bparams(clevel=5, shuffle=True) [[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11]] Iterators loop over the leading dimension:: >>> [r for r in b] [array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8, 9, 10, 11])] And you can select columns there by using another indirection level:: >>> [r[2] for r in b] [2, 6, 10] Above, the third column has been selected. Although for this case the indexing is easier:: >>> b[:,2] array([ 2, 6, 10]) the iterator approach typically consumes less memory resources. Operating with barrays ---------------------- Right now, you cannot operate with barrays directly (although that might be implemented in Blaze itself):: >>> x = blz.arange(1e7) >>> x + x TypeError: unsupported operand type(s) for +: 'blz.blz_ext.barray' and 'blz.blz_ext.barray' Rather, you should use the `eval` function:: >>> y = blz.eval("x + x") >>> y barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 2.64 MB; ratio: 28.88 bparams := bparams(clevel=5, shuffle=True) [0.0, 2.0, 4.0, ..., 19999994.0, 19999996.0, 19999998.0] You can also compute arbitrarily complex expressions in one shot:: >>> y = blz.eval(".5*x**3 + 2.1*x**2") >>> y barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 38.00 MB; ratio: 2.01 bparams := bparams(clevel=5, shuffle=True) [0.0, 2.6, 12.4, ..., 4.9999976e+20, 4.9999991e+20, 5.0000006e+20] Note how the output of `eval()` is also a barray object. You can pass other parameters of the barray constructor too. Let's force maximum compression for the output:: >>> y = blz.eval(".5*x**3 + 2.1*x**2", bparams=blz.bparams(9)) >>> y barray((10000000,), float64) nbytes: 76.29 MB; cbytes: 35.66 MB; ratio: 2.14 bparams := bparams(clevel=9, shuffle=True) [0.0, 2.6, 12.4, ..., 4.9999976e+20, 4.9999991e+20, 5.0000006e+20] By default, `eval` will use Numexpr virtual machine if it is installed and if not, it will default to use the Python one (via NumPy). You can use the `vm` parameter to select the desired virtual machine ("numexpr" or "python"):: >>> %timeit blz.eval(".5*x**3 + 2.1*x**2", vm="numexpr") 10 loops, best of 3: 303 ms per loop >>> %timeit blz.eval(".5*x**3 + 2.1*x**2", vm="python") 10 loops, best of 3: 1.9 s per loop As can be seen, using the "numexpr" virtual machine is generally (much) faster, but there are situations that the "python" one is desirable because it offers much more functionality:: >>> blz.eval("diff(x)", vm="numexpr") NameError: variable name ``diff`` not found >>> blz.eval("np.diff(x)", vm="python") barray((9999389,), float64) nbytes: 76.29 MB; cbytes: 814.25 KB; ratio: 95.94 bparams := bparams(clevel=5, shuffle=True) [1.0, 1.0, 1.0, ..., 1.0, 1.0, 1.0] Finally, `eval` lets you select the type of the outcome to be a NumPy array by using the `out_flavor` argument:: >>> blz.eval("x**3", out_flavor="numpy") array([ 0.00000000e+00, 1.00000000e+00, 8.00000000e+00, ..., 9.99999100e+20, 9.99999400e+20, 9.99999700e+20]) For setting permanently your own defaults for the `vm` and `out_flavors`, see :ref:`blz-defaults` chapter. barray metadata --------------- barray implements several attributes, like `dtype`, `shape` and `ndim` that makes it to 'quack' like a NumPy array:: >>> a = np.arange(1e7) >>> b = blz.barray(a) >>> b.dtype dtype('float64') >>> b.shape (10000000,) In addition, it implements the `cbytes` attribute that tells how many bytes in memory (or on-disk) uses the barray object:: >>> b.cbytes 2691722 This figure is approximate and it is generally lower than the original (uncompressed) datasize can be accessed by using `nbytes` attribute:: >>> b.nbytes 80000000 which is the same than the equivalent NumPy array:: >>> a.size*a.dtype.itemsize 80000000 For knowing the compression level used and other optional filters, use the `bparams` read-only attribute:: >>> b.bparams bparams(clevel=5, shuffle=True) Also, you can check which the default value is (remember, used when `resize` -ing the barray):: >>> b.dflt 0.0 You can access the `chunklen` (the length for each chunk) for this barray:: >>> b.chunklen 16384 For a complete list of public attributes of barray, see section on :ref:`barray-attributes`. .. _barray-attrs: barray user attrs ----------------- Besides the regular attributes like `shape`, `dtype` or `chunklen`, there is another set of attributes that can be added (and removed) by the user in another name space. This space is accessible via the special `attrs` attribute:: >>> a = blz.barray([1,2], rootdir='mydata') >>> a.attrs *no attrs* As you see, by default there are no attributes attached to `attrs`. Also, notice that the barray that we have created is persistent and stored on the 'mydata' directory. Let's add one attribute here:: >>> a.attrs['myattr'] = 234 >>> a.attrs myattr : 234 So, we have attached the 'myattr' attribute with the value 234. Let's add a couple of attributes more:: >>> a.attrs['temp'] = 23 >>> a.attrs['unit'] = 'Celsius' >>> a.attrs unit : 'Celsius' myattr : 234 temp : 23 good, we have three of them now. You can attach as many as you want, and the only current limitation is that they have to be serializable via JSON. As the 'a' barray is persistent, it can re-opened in other Python session:: >>> a.flush() >>> ^D $ python Python 2.7.3rc2 (default, Apr 22 2012, 22:30:17) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import blz >>> a = blz.open(rootdir="mydata") >>> a # yeah, our data is back barray((2,), int64) nbytes: 16; cbytes: 4.00 KB; ratio: 0.00 bparams := bparams(clevel=5, shuffle=True) rootdir := 'mydata' [1 2] >>> a.attrs # and so is user attrs! temp : 23 myattr : 234 unit : u'Celsius' Now, let's remove a couple of user attrs:: >>> del a.attrs['myattr'] >>> del a.attrs['unit'] >>> a.attrs temp : 23 So, it is really easy to make use of this feature so as to complement your data with (potentially persistent) metadata of your choice. Of course, the `btable` object also wears this capability. Tutorial on btable objects ========================== The BLZ package comes with a handy object that arranges data by column (and not by row, as in NumPy's structured arrays). This allows for much better performance for walking tabular data by column and also for adding and deleting columns. Creating a btable ----------------- You can build btable objects in many different ways, but perhaps the easiest one is using the `fromiter` constructor:: >>> N = 100*1000 >>> ct = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N) >>> ct btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 283.27 KB; ratio: 4.14 bparams := bparams(clevel=5, shuffle=True) [(0, 0.0), (1, 1.0), (2, 4.0), ..., (99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)] You can also build an empty btable first and the append data:: >>> ct = blz.btable(np.empty(0, dtype="i4,f8")) >>> for i in xrange(N): ...: ct.append((i, i**2)) ...: >>> ct btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 355.48 KB; ratio: 3.30 bparams := bparams(clevel=5, shuffle=True) [(0, 0.0), (1, 1.0), (2, 4.0), ..., (99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)] However, we can see how the latter approach does not compress as well. Why? Well, BLZ has machinery for computing 'optimal' chunksizes depending on the number of entries. For the first case, BLZ can figure out the number of entries in final array, but not for the loop case. You can solve this by passing the final length with the `expectedlen` argument to the btable constructor:: >>> ct = blz.btable(np.empty(0, dtype="i4,f8"), expectedlen=N) >>> for i in xrange(N): ...: ct.append((i, i**2)) ...: >>> ct btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 283.27 KB; ratio: 4.14 bparams := bparams(clevel=5, shuffle=True) [(0, 0.0), (1, 1.0), (2, 4.0), ..., (99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)] Okay, the compression ratio is the same now. Accessing and setting rows -------------------------- The btable object supports the most common indexing operations in NumPy:: >>> ct[1] (1, 1.0) >>> type(ct[1]) >>> ct[1:6] array([(1, 1.0), (2, 4.0), (3, 9.0), (4, 16.0), (5, 25.0)], dtype=[('f0', '>> ct[[1,6,13]] array([(1, 1.0), (6, 36.0), (13, 169.0)], dtype=[('f0', '>> ct["(f0>0) & (f1<10)"] array([(1, 1.0), (2, 4.0), (3, 9.0)], dtype=[('f0', '>> ct[1] = (0,0) >>> ct btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 279.89 KB; ratio: 4.19 bparams := bparams(clevel=5, shuffle=True) [(0, 0.0), (0, 0.0), (2, 4.0), ..., (99997, 9999400009.0), (99998, 9999600004.0), (99999, 9999800001.0)] >>> ct[1:6] array([(0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0)], dtype=[('f0', '>> ct[[1,6,13]] = (1,1) >>> ct[[1,6,13]] array([(1, 1.0), (1, 1.0), (1, 1.0)], dtype=[('f0', '>> ct["(f0>=0) & (f1<10)"] = (2,2) >>> ct[:7] array([(2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0), (6, 36.0)], dtype=[('f0', '>> N = 100*1000 >>> ct = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N) >>> new_col = np.linspace(0, 1, 100*1000) >>> ct.addcol(new_col) >>> ct btable((100000,), |V20) nbytes: 1.91 MB; cbytes: 528.83 KB; ratio: 3.69 bparams := bparams(clevel=5, shuffle=True) [(0, 0.0, 0.0), (1, 1.0, 1.000010000100001e-05), (2, 4.0, 2.000020000200002e-05), ..., (99997, 9999400009.0, 0.99997999979999797), (99998, 9999600004.0, 0.99998999989999904), (99999, 9999800001.0, 1.0)] Now, remove the already existing 'f1' column:: >>> ct.delcol('f1') >>> ct btable((100000,), |V12) nbytes: 1.14 MB; cbytes: 318.68 KB; ratio: 3.68 bparams := bparams(clevel=5, shuffle=True) [(0, 0.0), (1, 1.000010000100001e-05), (2, 2.000020000200002e-05), ..., (99997, 0.99997999979999797), (99998, 0.99998999989999904), (99999, 1.0)] As said, adding and deleting columns is very cheap, so don't be afraid of using them extensively. Iterating over btable data -------------------------- You can make use of the `iter()` method in order to easily iterate over the values of a btable. `iter()` has support for start, stop and step parameters:: >>> N = 100*1000 >>> t = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N) >>> [row for row in ct.iter(1,10,3)] [row(f0=1, f1=1.0), row(f0=4, f1=16.0), row(f0=7, f1=49.0)] Note how the data is returned as `namedtuple` objects of type ``row``. This allows you to iterate the fields more easily by using field names:: >>> [(f0,f1) for f0,f1 in ct.iter(1,10,3)] [(1, 1.0), (4, 16.0), (7, 49.0)] You can also use the ``[:]`` accessor to get rid of the ``row`` namedtuple, and return just bare tuples:: >>> [row[:] for row in ct.iter(1,10,3)] [(1, 1.0), (4, 16.0), (7, 49.0)] Also, you can select specific fields to be read via the `outcols` parameter:: >>> [row for row in ct.iter(1,10,3, outcols='f0')] [row(f0=1), row(f0=4), row(f0=7)] >>> [(nr,f0) for nr,f0 in ct.iter(1,10,3, outcols='nrow__,f0')] [(1, 1), (4, 4), (7, 7)] Please note the use of the special 'nrow__' label for referring to the current row. Iterating over the output of conditions along columns ----------------------------------------------------- One of the most powerful capabilities of the btable is the ability to iterate over the rows whose fields fulfill some conditions (without the need to put the results in a NumPy container, as described in the "Accessing and setting rows" section above). This can be very useful for performing operations on very large btables without consuming lots of storage space. Here it is an example of use:: >>> N = 100*1000 >>> t = blz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N) >>> [row for row in ct.where("(f0>0) & (f1<10)")] [row(f0=1, f1=1.0), row(f0=2, f1=4.0), row(f0=3, f1=9.0)] >>> sum([row.f1 for row in ct.where("(f1>10)")]) 3.3333283333312755e+17 And by using the `outcols` parameter, you can specify the fields that you want to be returned:: >>> [row for row in ct.where("(f0>0) & (f1<10)", "f1")] [row(f1=1.0), row(f1=4.0), row(f1=9.0)] You can even specify the row number fulfilling the condition:: >>> [(f1,nr) for f1,nr in ct.where("(f0>0) & (f1<10)", "f1,nrow__")] [(1.0, 1), (4.0, 2), (9.0, 3)] Performing operations on btable columns --------------------------------------- The btable object also wears an `eval()` method that is handy for carrying out operations among columns:: >>> ct.eval("cos((3+f0)/sqrt(2*f1))") barray((1000000,), float64) nbytes: 7.63 MB; cbytes: 2.23 MB; ratio: 3.42 bparams := bparams(clevel=5, shuffle=True) [nan, -0.951363128126, -0.195699435691, ..., 0.760243218982, 0.760243218983, 0.760243218984] Here, one can see an exception in btable methods behaviour: the resulting output is a btable, and not a NumPy structured array. This is so because the output of `eval()` is of the same length than the btable, and thus it can be pretty large, so compression maybe of help to reduce its storage needs. ## Local Variables: ## fill-column: 72 ## End: