[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: bcolz for Debian Science



On 14.08.2016 20:59, Daniel Stender wrote:
> Hi,
> 
> I've just pushed a new package of bcolz (https://github.com/Blosc/bcolz) to a new
> team repo (packages/bcolz.git), the ITP is https://bugs.debian.org/831408.
> 
> bcolz is a chunked compressed data container build on the top of Numpy, which can be
> used either on disk but also in-memory. For compression it uses blosc (c-blosc). I would
> say it belongs to Numerical Computation task, where Numpy is.
> 
> The package builds and works all right but can't go in right now because it temporarily
> (#834318) builds with the vendored blosc + sublibraries (e.g. lots of copyright entries
> missing).
> 
> Best,
> DS

If you want to test drive this, here's an example in-memory operation (adapted from the presentation
"Open source tools for financial time series analysis" by Yves Hilpisch on PyData London 2015 [1]):

<cut>
In [1]: import bcolz

In [2]: N = 100000 * 100

In [4]: %%time
   ...: ct = bcolz.fromiter(((i, i ** 2) for i in range(N)), dtype="i4, i8", count=N, cparams=bcolz.cparams(clevel=9))
   ...: 
CPU times: user 5.48 s, sys: 20 ms, total: 5.5 s
Wall time: 5.5 s

In [5]: ct
Out[5]: 
ctable((10000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 114.44 MB; cbytes: 10.86 MB; ratio: 10.54
  cparams := cparams(clevel=9, shuffle=1, cname='lz4', quantize=0)
[(0, 0) (1, 1) (2, 4) ..., (9999997, 99999940000009)
 (9999998, 99999960000004) (9999999, 99999980000001)]

In [6]: %time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 328 ms, sys: 20 ms, total: 348 ms
Wall time: 164 ms
Out[6]: 
carray((10000000,), float64)
  nbytes := 76.29 MB; cbytes := 30.57 MB; ratio: 2.50
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 65536; chunksize: 524288; blocksize: 32768
[  0.00000000e+00   2.00000000e+00   6.00000000e+00 ...,   2.26447238e+08
   2.46447234e+08   2.66447232e+08]

In [7]: bcolz.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     1.1.0
NumPy version:     1.11.1rc1
Blosc version:     1.9.2 ($Date:: 2016-06-08 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   2.6.0
Dask version:      not available (version >= 0.9.0 not detected)
Python version:    3.5.2 (default, Jul  5 2016, 11:33:36) 
[GCC 5.4.0 20160609]
Platform:          linux-x86_64
Byte-ordering:     little
Detected cores:    4
</cut>

Compression ratios and access time really aren't bad (Core-i5 notebook). Dask is on ITP
(https://bugs.debian.org/817777).

DS

[1] https://www.youtube.com/watch?v=isDJVYF2F54

-- 
4096R/DF5182C8
http://www.danielstender.com/blog/


Reply to: