[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

efficient mirroring: tje, mirroring only (binary) diffs



OK, there it is, 'tje', an attempt at more efficient mirroring of
the Debian (and others) archives.

The basic idea is that the `tjeserver' only sends (binary) diffs
to packages to the client, not whole files. For this purpose
I've written a small binary diff utility (I tried xdelta, but
it really gave very bad results).

Another design goal of `tjeserver' was to make it as light
as possible (CPU-wise) for the server. 
Thus, the server will 
 - never calculate binary diffs (they are all stored)
 - never compress/uncompress files
 - never calculate md5sums (they are intensively used,
   but only stored ones).
Basically, the server just spits out data already on the harddisk.


To make the listing of directories more efficient, they are
handled just like files (and thus if only one file changed md5 sums,
the server will send a diff of the directory listing, and not
again the whole directory).


To make it all somewhat more clear, here an example session.


Client: Hello, I've got "/.tjelisting", md5=872588436ed4bcf8ec
  (note, the ".tjelisting" file is basically a listing of all
   entries in a directory, with md5 sums for each file added.
   And for each subdir, a md5 of the .tjelisting in that subdir).
Client: I'd like to have "/.tjelisting", with label "latest"

Server: OK, just aply this patch to your /.tjelisting:
  <patch follows>

  Client recieves patch, applies it to it's own /.tjelisting, and
  compares the old and new .tjelisting files. Client notices
  that file "foo" and the ".tjelisting" in directory "sub"
  have changed, but the 50 other entries didn't. So:

Client: I've got "foo" with md5=872588436ed4b
Client: and "sub/.tjelisting" with md5=fe590983d23a9a1b43
Client: I'd like to have "foo" with md5=23dc9bf02ff0b46d2
Client: I'd like to have "sub/.tjelisting" with md5=02af6da05001f1918
  

Server: for "foo:23dc9bf02ff0b46d2", just apply the following patch
  <patch follows>
Server: And as for "sub/.tjelisting", I don't know the md5 sum you
Server: say you have. But I do know the file with the md5 sum you
Server: want, so here you've got the full file:
  <file follows>


One nice thing about this is that if no file at all changed on
the server, all that needs to be communicated is one md5sum
of the top-level .tjelisting, and that's it.


Notice that the "patches" I speak about are against "uncompressed"
(.deb) files. That is, I un-ar a .deb, then uncompress the *.gz 
files in there, and then put them together in one file again.
All diffs are against these. Naturally, the client later makes proper
.deb archives again.

Drawback: for some reason ar and gzip add something like a timestamp
as to when an .ar or .gz was created. Due to this, the .deb files
recieved will have different chksums than the original .deb's. 
However, the control.tar.gz and data.tar.gz file in the debs will
be identical. It should be possible to get around this, however,
in future releases. Also, it should be noticed that it would be
best if any signing of packages happens (at least also) against
uncompressed packages, that would also allow other compression
methods.


Please see
  http://joostje.op.het.net/tje/index.html
for more info, and the tje_0.0.deb

Note: Please CC me on your replies.
      I'm subscribed to debain-devel, but I don't read it regularly.
Note: this could also be used for the packages pools etc.
      At the moment I've only tested this with .deb files, but
      it should work equally well with other formats.

Thanks,
-- 
joostje


Reply to: