[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [RFC] An rsync friendly gzip



Dylan Thurston wrote:

[..snip..]

> For modifying gzip or zlib, the key functions are ct_tally and
> _tr_tally, respectively.  For gzip (version 1.2.4), look at trees.c,
> lines 958-1006; specifically, replace the code
> 
>     /* Try to guess if it is profitable to stop the current block here */
>     if (level > 2 && (last_lit & 0xfff) == 0) {
>         ...
>     }
>     return (last_lit == LIT_BUFSIZE-1 || last_dist == DIST_BUFSIZE);
>     /* We avoid equality with LIT_BUFSIZE because of wraparound at 64K
>      * on 16 bit machines and because stored blocks are restricted to
>      * 64K-1 bytes.
>      */
> 
> with an appropriate version of your own.  The analogous code is at line
> 1044 of trees.c in zlib-1.1.3.

Thanks for that, although it might be a bit more complicated than that.

> You also have to change the flushes so that they end up byte-aligned.
> The way to do this within the gzip file format is to output a 0-length
> literal block; unfortunately, this wastes 5 bytes.  It's a trivial (and
> otherwise useful) extension to the file format to avoid wasting the 5
> bytes (add a byte-aligned block type), but I think it's best to avoid that.

It does not help to simply cut off a block whenever you feel like it
and start a new one. The point at which you cut the block has to depend
on the data itself. That's why the rolling hash is there. The idea is
that
even if you add a few bytes near the beginning, later in the file you
want to be compressing the exact same blocks.

Also, when you cut it off, you have to cut it off in such a way that you
it compresses the same way as it would have before. This generally means
that you have to clear the dictionary, which can have adverse effects on
the compression rate.

> Alternatively, you could use the published interface to zlib, telling
> it to flush appropriately.

Working on that.

> What block size were you using in your tests?

With the rolling hash the block size varies somewhat, but the average is
about 27K. I'm not sure what would be the most optimal size for debian
packages.

> There was a thread a few weeks ago on deity where I talked about this
> with Jason Gunthrope.  He was sceptical that there would be good
> savings for real .debs; have you tried it?  If not, I will.  I've been
> collecting the .debs I download for the past few weeks.

I must admit I havn't tried it on many files. Seems a lot of files are
much less that 27K so it would make too much difference. It should be
easy
to repack them (ar -x ; gzip -d <control.tar.gz ...) I'll run some tests
today
or tomorrow on some 'real' packages.

Martijn van Oosterhout
Australia



Reply to: