Optimization tips

Changing explicitly the length of chunks

You may want to use explicitly the chunklen parameter to fine-tune your compression levels:

>>> a = np.arange(1e7)
>>> blz.barray(a)
barray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 2.57 MB; ratio: 29.72
  cparams := cparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
>>> blz.barray(a).chunklen
16384   # 128 KB = 16384 * 8 is the default chunk size for this barray
>>> blz.barray(a, chunklen=512)
barray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 10.20 MB; ratio: 7.48
  cparams := cparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
>>> blz.barray(a, chunklen=8*1024)
barray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 1.50 MB; ratio: 50.88
  cparams := cparams(clevel=5, shuffle=True)
[0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]

You see, the length of the chunk affects very much compression levels and the performance of I/O to barrays too.

In general, however, it is safer (and quicker!) to use the expectedlen parameter (see next section).

Informing about the length of your barrays

If you are going to add a lot of rows to your barrays, be sure to use the expectedlen parameter in creating time to inform the constructor about the expected length of your final barray; this allows barray to fine-tune the length of its chunks more easily. For example:

>>> a = np.arange(1e7)
>>> blz.barray(a, expectedlen=10).chunklen
512
>>> blz.barray(a, expectedlen=10*1000).chunklen
4096
>>> blz.barray(a, expectedlen=10*1000*1000).chunklen
16384
>>> blz.barray(a, expectedlen=10*1000*1000*1000).chunklen
131072

Table Of Contents

Previous topic

Library Reference

Next topic

Defaults for BLZ operation

This Page