Re: bcolz for Debian Science
On 14.08.2016 20:59, Daniel Stender wrote:
> Hi,
>
> I've just pushed a new package of bcolz (https://github.com/Blosc/bcolz) to a new
> team repo (packages/bcolz.git), the ITP is https://bugs.debian.org/831408.
>
> bcolz is a chunked compressed data container build on the top of Numpy, which can be
> used either on disk but also in-memory. For compression it uses blosc (c-blosc). I would
> say it belongs to Numerical Computation task, where Numpy is.
>
> The package builds and works all right but can't go in right now because it temporarily
> (#834318) builds with the vendored blosc + sublibraries (e.g. lots of copyright entries
> missing).
>
> Best,
> DS
If you want to test drive this, here's an example in-memory operation (adapted from the presentation
"Open source tools for financial time series analysis" by Yves Hilpisch on PyData London 2015 [1]):
<cut>
In [1]: import bcolz
In [2]: N = 100000 * 100
In [4]: %%time
...: ct = bcolz.fromiter(((i, i ** 2) for i in range(N)), dtype="i4, i8", count=N, cparams=bcolz.cparams(clevel=9))
...:
CPU times: user 5.48 s, sys: 20 ms, total: 5.5 s
Wall time: 5.5 s
In [5]: ct
Out[5]:
ctable((10000000,), [('f0', '<i4'), ('f1', '<i8')])
nbytes: 114.44 MB; cbytes: 10.86 MB; ratio: 10.54
cparams := cparams(clevel=9, shuffle=1, cname='lz4', quantize=0)
[(0, 0) (1, 1) (2, 4) ..., (9999997, 99999940000009)
(9999998, 99999960000004) (9999999, 99999980000001)]
In [6]: %time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 328 ms, sys: 20 ms, total: 348 ms
Wall time: 164 ms
Out[6]:
carray((10000000,), float64)
nbytes := 76.29 MB; cbytes := 30.57 MB; ratio: 2.50
cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
chunklen := 65536; chunksize: 524288; blocksize: 32768
[ 0.00000000e+00 2.00000000e+00 6.00000000e+00 ..., 2.26447238e+08
2.46447234e+08 2.66447232e+08]
In [7]: bcolz.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version: 1.1.0
NumPy version: 1.11.1rc1
Blosc version: 1.9.2 ($Date:: 2016-06-08 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version: 2.6.0
Dask version: not available (version >= 0.9.0 not detected)
Python version: 3.5.2 (default, Jul 5 2016, 11:33:36)
[GCC 5.4.0 20160609]
Platform: linux-x86_64
Byte-ordering: little
Detected cores: 4
</cut>
Compression ratios and access time really aren't bad (Core-i5 notebook). Dask is on ITP
(https://bugs.debian.org/817777).
DS
[1] https://www.youtube.com/watch?v=isDJVYF2F54
--
4096R/DF5182C8
http://www.danielstender.com/blog/
Reply to: