========================== Persistent format for BLZ ========================== Introduction ============ BLZ is designed to work with data that is both in memory and disk in a transparent way. BLZ is also the format that is implemented internally so as to persist data on-disk (although it supports in-memory starge too). The goals of the BLZ format are: 1. Allow to work with data directly on disk, exactly on the same way than data in memory. 2. The persistence layer should support the next access capabilities: modifying, appending and removing data, as well as direct access to data (in the same way than RAM). 3. Transparent data compression must be possible. 4. User metadata addition must be possible too. 5. Data objects are allowed to be enlarged or shrunk. 6. Data is not allowed to be modified. 7. And last but not least, the data should be easily 'shardable' for optimal behavior in distributed storage. Providing a format that is already 'sharded' by default represents a big advantage for allowing spreading a BLZ object among different nodes. These points, in combination with a distributed filesystem, and with a system that would be aware of the physical topology of the underlying infrastructure allows to largely avoid the need for a Disco/Hadoop infrastructure, permitting much better flexibility and performance. The low level description for the BLZ format follows. It must be noted that with the implementation is almost complete, except for the fact that that superchunks are not yet there. The BLZ format ============== The data files will be made of a series of chunks put together using the Blosc metacompressor by default. Blosc being a metacompressor, means that it can use different compressors and filters, while leveraging its blocking and multithreading capabilities. The layout ---------- For every dataset, it will be created a directory, with a user-provided name that, for generality, we will call it `root` here. The root will have another couple of subdirectories, named data and meta:: root (the name of the dataset) / \ data meta The `data` directory will contain the actual data of the dataset, while the `meta` will contain the metainformation (dtype, shape, chunkshape, compression level, filters...). The `data` layout ----------------- Data will be stored by what is called a `superchunk`, and each superchunk will use exactly one file. The size of each superchunk will be decided automatically by default, but it could be specified by the user too. Each of these directories will contain one or more superchunks for storing the actual data. Every data superchunk will be named after its sequential number. For example:: $ ls data __1__.bin __2__.bin __3__.bin __4__.bin ... __1030__.bin This structure of separate superchunk files allows for two things: 1. Datasets can be enlarged and shrink very easily 2. Horizontal sharding in a distributed system is possible (and cheap!) At its time, the `data` directory might contain other subdirectories that are meant for storing components for a 'nested' dtype (i.e. an structured array, stored in column-wise order):: data (the root for a nested datatype) / \ \ col1 col2 col3 / \ sc1 sc3 This structure allows for quick access to specific chunks of columns without a need to load the complete dataset in memory. The `superchunk` layout ----------------------- Here it is how the superchunks are going to be laid out. It is worth to mention that this format will be based on the Bloscpack format [1]_ and that it will continue to evolve the next future. .. [1] https://github.com/esc/bloscpack Header format ~~~~~~~~~~~~~ The design goals of the header format are to contain as much information as possible to achieve interesting things in the future and to be as general as possible so as to ensure compatibility with the chunked persistence format. The following ASCII representation shows the layout of the header:: |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-| | b l p k | ^ | ^ | ^ | ^ | chunk-size | last-chunk | | | | | version ----+ | | | options --------+ | | checksum ------------+ | typesize ----------------+ |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-| | nchunks | meta-size | RESERVED | The first 4 bytes are the magic string ``blpk``. Then there are 4 bytes, the first three are described below and the last one is reserved. This is followed by 4 bytes for the ``chunk-size``, another 4 bytes for the ``last-chunk-size`` and 8 bytes for the number of chunks. Finally, the ``meta-size`` accounts for the amount of bytes that takes the metadata to be stored. The last 4 bytes are reserved for use in future versions of the format. Effectively, storing the number of chunks as a signed 8 byte integer, limits the number of chunks to ``2**63-1 = 9223372036854775807``, but this should not be relevant in practice, since, even with the moderate default value of ``1MB`` for chunk-size, we can still stores files as large as ``8ZB`` (!) Given that in 2012 the maximum size of a single file in the Zettabye File System (zfs) is ``16EB``, Bloscpack should be safe for a few more years. Description of the header entries ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All entries are little-endian. :version: (``uint8``) format version of the Bloscpack header, to ensure exceptions in case of forward incompatibilities. :options: (``bitfield``) A bitfield which allows for setting certain options in this file. :``bit 0 (0x01)``: If the offsets to the chunks are present in this file. :``bit 1 (0x02)``: If metadata is present in this file. :checksum: (``uint8``) The checksum used. The following checksums, available in the python standard library should be supported. The checksum is always computed on the compressed data and placed after the chunk. :``0``: ``no checksum`` :``1``: ``zlib.adler32`` :``2``: ``zlib.crc32`` :``3``: ``hashlib.md5`` :``4``: ``hashlib.sha1`` :``5``: ``hashlib.sha224`` :``6``: ``hashlib.sha256`` :``7``: ``hashlib.sha384`` :``8``: ``hashlib.sha512`` :typesize: (``uint8``) The typesize of the data in the chunks. Currently, assume that the typesize is uniform. The space allocated is the same as in the Blosc header. :chunk-size: (``int32``) Denotes the chunk-size. Since the maximum buffer size of Blosc is 2GB having a signed 32 bit int is enough (``2GB = 2**31 bytes``). The special value of ``-1`` denotes that the chunk-size is unknown or possibly non-uniform. :last-chunk: (``int32``) Denotes the size of the last chunk. As with the ``chunk-size`` an ``int32`` is enough. Again, ``-1`` denotes that this value is unknown. :nchunks: (``int64``) The total number of chunks used in the file. Given a chunk-size of one byte, the total number of chunks is ``2**63``. This amounts to a maximum file-size of 8EB (``8EB = 2*63 bytes``) which should be enough for the next couple of years. Again, ``-1`` denotes that the number of is unknown. The overall file-size can be computed as ``chunk-size * (nchunks - 1) + last-chunk-size``. In a streaming scenario ``-1`` can be used as a placeholder. For example if the total number of chunks, or the size of the last chunk is not known at the time the header is created. Description of the metadata section ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This section goes after the header, and it is just a JSON serialized version of the metadata that is to be saved. As JSON has its limitations as any other serializer, only a subset of Python structures can be stored, so probably some additional object handling must be done prior to serialize some metadata. Example of metadata stored: {'dtype': 'float64', 'shape': [1024], 'others': []} Description of the offsets entries ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Offsets of the chunks into the file are to be used for accelerated seeking. The offsets (if activated) follow the metadata section . Each offset is a 64 bit signed little-endian integer (``int64``). A value of ``-1`` denotes an unknown offset. Initially, all offsets should be initialized to ``-1`` and filled in after writing all chunks. Thus, If the compression of the file fails prematurely or is aborted, all offsets should have the value ``-1``. Each offset denotes the exact position of the chunk in the file such that seeking to the offset, will position the file pointer such that, reading the next 16 bytes gives the Blosc header, which is at the start of the desired chunk. The layout of the file is then:: |-bloscpack-header-|-offset-|-offset-|...|-chunk-|-chunk-|...| Description of the chunk format ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The header for the Blosc chunk has this format (Blosc 1.0 on):: |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-| ^ ^ ^ ^ | nbytes | blocksize | ctbytes | | | | | | | | +--typesize | | +------flags | +----------blosclz version +--------------blosc version Following the header there will come the compressed data itself. Blosc ensures that the compressed buffer will not take more space than the original one + 16 bytes (the length of the header). At the end of each blosc chunk some empty space could be added (this can be parametrized) in order to allow the modification of some data elements inside each block. The reason for the additional space is that, as these chunks will be typically compressed, when modifying some element of the chunk it is not guaranteed that the resulting chunk will fit in the same space than the old one. Having this provision of a small empty space at the end of each chunk will allow for storing the modified chunks in many cases, without a need to save the entire file on a different part of the disk. Overhead ~~~~~~~~ Depending on which configuration for the file is used a constant, or linear overhead may be added to the file. The Bloscpack header adds 32 bytes in any case. If the data is non-compressible, Blosc will add 16 bytes of header to each chunk. If used, both the checksum and the offsets will add overhead to the file. The offsets add 8 bytes per chunk and the checksum adds a fixed constant value which depends on the checksum to each chunk. For example, 32 bytes for the ``adler32`` checksum. Also, depending on the number of reserved bytes at the end of each chunk (the default is to not reserve them), that will add another overhead to the final size. The `meta` files ---------------- Here there can be as many files as necessary. The format for every file will be JSON, so caution should be used for ensuring that all the metadata can be serialized and deserialized in this format. There could be three (or more, in the future) files: The `sizes` file ~~~~~~~~~~~~~~~~ This contains the shape of the dataset, as well as the uncompressed size (``nbytes``) and the compressed size (``cbytes``). For example:: $ cat meta/sizes {"shape": [10000000], "nbytes": 80000000, "cbytes": 17316745} The `storage` file ~~~~~~~~~~~~~~~~~~ Here comes the information about the data type, defaults and how data is being stored. Example:: $ cat myarray/meta/storage {"dtype": "float64", "cparams": {"shuffle": true, "clevel": 5}, "chunklen": 16384, "dflt": 0.0, "expectedlen": 10000000} The `attributes` file ~~~~~~~~~~~~~~~~~~~~~ In this file it comes additional user information. Example:: $ cat myarray/meta/attributes {"temperature": 11.4, "scale": "Celsius", "coords": {"lat": 40.1, "lon": 0.5}}