[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

On adding size info to Packages files [very long]



Hello everyone,

I've been working on the 'du -S' stuff over the last week (approx),
and I think it's time I 'went public' with it.  I'm afraid there's a
lot of stuff here; I thought it worth presenting the supporting data to
back up my arguments...  Put it down to practice for the PhD thesis if
you want.  ;-)

I've written a few simple tools to help analyse things.  They are included
in the uuencoded gzipped tar at the end of this message.

Here are the sizes of hamm's Packages files:

compressed  uncompr. ratio uncompressed_name
    21390     77711  72.4% Packages.hamm.contrib.orig
   332143   1108583  70.0% Packages.hamm.main.orig
    58041    184713  68.5% Packages.hamm.non-free.orig

Now, if we run ./gen-du on each, which downloads each package, extracts
its contents, runs 'du -S' on it, and adds the output as a 'Du:' entry
in the Packages file, we get these file sizes:

compressed  uncompr. ratio uncompressed_name
   130250    871106  85.0% Packages.hamm.contrib.du
   409086   1482499  72.4% Packages.hamm.main.du
    70646    243962  71.0% Packages.hamm.non-free.du

Here's a sample of the output:

| Package: stow
| Version: 1.3.2-9
[...]
| installed-size: 140
| Du: 1	usr
|  16	usr/bin
|  1	usr/doc
|  13	usr/doc/stow
|  74	usr/doc/stow/html
|  19	usr/info
|  1	usr/lib
|  2	usr/lib/menu
|  1	usr/man
|  7	usr/man/man8

A couple of points.  Firstly, the size of the 'contrib' section's
Packages.du file is large because of the picon-* packages.  These packages
contain a very deep directory hierarchy with a few small files in each
leaf directory.  That gives us a very large 'du' output, even wrt the
size of the package, and the packages are large anyway.  We might want
to prune some of those directories, despite the inelegance of doing so.
'./fnfilter' does that.

Secondly, while the 'du' info will gzip down quite small, it will be
ungzipped on the user's machine and stuffed into the 'available' file,
which is not stored compressed.  Since this thing is intended to help
those with smaller disks, (if you have a plenty of space, you don't need
an exact tally of the space which will be used) growing the available
file by 200% or so isn't too clever, IMO.  (Although, since we're already
wasting a lot of space in /var/lib/dpkg/info by keeping lots of small
files there, this may not be considered important.  The machines I use
normally have only a couple of meg or so free on /.)

    1363984  /var/lib/dpkg/available  (ordinary Packages files)
    2587131  /var/lib/dpkg/available  (Packages files with 'Du:' entries)

Therefore, I've also tried a few different forms of compression for
the 'Du:' entry.  The script ./package-component-filter will take a
Packages file and filter all instances of a given component through a
given command.  E.g.

  ./package-component-filter du 'gzip -9n | uuencode -m' \
					     Packages.hamm.contrib.du

gives output looking like this:

| Package: stow
| Version: 1.3.2-9
[...]
| installed-size: 140
| Du: begin-base64 644 -
|  H4sIAAAAAAACAzPkLC0u4jI0A1H6SZl5XIZgVkp+MpehMYypX1ySX85lboLC
|  188oyc3hMrQEC2bmpeVDteZkJnEZwVj6ual5pVCJ3MQ8LnMYC4QtuACLQ28e
|  fgAAAA==
|  ====

I looked at several possibilities (percentage expansion is relative to
the original Packages files):

 - Leave the 'Du' entry unencoded:
   Human-readable, very compressible, but takes up a huge amount of
   space in the available file.
   Overall expansion: 89% uncompressed, 48% compressed

 - Encode using 'sort +1 | ./myfrcode':
   Still fairly human-reabable, quite a lot shorter, and even compresses
   a bit better.
   Overall expansion: 35% uncompressed, 37% compressed

   Alternatively, using 'sort +1 | ./myfrcode2':
   Overall expansion: 40% uncompressed, 37% compressed

 - Encode with 'gzip -9n | uuencode -':
   Doesn't cut down the sizes much, except for the picon-* packages,
   where it makes a big difference.  Compresses very badly.  Completely
   unreadable.
   Overall expansion: 34% uncompressed, 76% compressed

 - Encode with './squish':
   Tiny impact on the 'available' file, but lousy compression.
   Completely unreadable.
   Overall expansion: 25% uncompressed, 63% compressed

compressed  uncompr. ratio uncompressed_name
   100981    318882  68.3% Packages.hamm.contrib.du.myfrcode
   394917   1318643  70.0% Packages.hamm.main.du.myfrcode
    68389    216575  68.4% Packages.hamm.non-free.du.myfrcode

   102745    393981  73.9% Packages.hamm.contrib.du.myfrcode2
   394318   1321474  70.1% Packages.hamm.main.du.myfrcode2
    68274    217545  68.6% Packages.hamm.non-free.du.myfrcode2

   142466    236685  39.8% Packages.hamm.contrib.du.uugz
   499540   1378849  63.7% Packages.hamm.main.du.uugz
    84137    226083  62.7% Packages.hamm.non-free.du.uugz

   113210    195584  42.1% Packages.hamm.contrib.du.squish
   479933   1314473  63.4% Packages.hamm.main.du.squish
    80862    216404  62.6% Packages.hamm.non-free.du.squish

(Note: ./myfrcode and ./myfrcode2 require their input sorted in a
particular way, if their output is to be properly decompressible.
It's quite possible to do that, but I haven't got round to doing it yet.
'sort +1' is a close approximation, but its not right.  The data I
generated will not extract correctly, but should demonstrate the right
compression characteristics.)

Summarising:

If network bandwidth is the only factor, we should do some form of
front-coding on the du entries (preferably a non-broken form), because
this gives the smallest compressed size.

If the size of the 'available' file is the only factor, we could use
'./squish', because this gives the smallest uncompressed size.

On balance, I'd say that front-coding gives a reasonable compromise
between network bandwidth, size of available file, and human-readableness.
If human-readability is more important than size of 'available' file, then
we should simply put the un-encoded 'du' output into the Packages files.

--

For some idea on what front-coding is,

  info '(find)New Database Format'

--

Please feel free to comment on any or all of this on the lists, but
note that, due to the arrangements by which I get Debian-related email,
there is likely to be a day or two's delay before I can reply.

--
Charles Briscoe-Smith
White pages entry, with PGP key: <URL:http://alethea.ukc.ac.uk/wp?95cpb4>
PGP public keyprint: 74 68 AB 2E 1C 60 22 94  B8 21 2D 01 DE 66 13 E2

begin 644 scripts.tar.gz
M'XL(```````"`^U:6W?;N!'VJ_DK)HH:2K))D-2M:X5NW>ZE>6K.[MGV(;)K
MB@1%'HN@RHLOB;V_O3,@*=%*LNM-&SNGP>=C$QP,!B"!;S!#>,F%$91[GQ6V
M94W'8]B#B66-;+S">#AVZ"KAV),IP&0\'(ZGDQ$J@FW9H]$>6'N/@#(OO`Q@
MSU\O1K^FYY?+O?]#/'\&;!$+ED?:<_@N6?``]*`$XR<=TK)8EP7$HDCAM>=?
M>$N>0QBO>*YI7N9'\27_U\++N1L5Q?J(L5+$UV;$1>Z9GF^6%RR)LRS-<A;P
M1>P)+8CSPJ4_.8N\)&%^*HHL7E#W7G9CQ,,_3C0MY(4?]?KP3@-X#C_GV.D1
M2"'D:<+9VBLBEHK:MHFF(;D17L*IB&V$<,]I4"2"KGV.HBP!(X2N$%B^6O("
M#'$)W?83L*Z-=<DE*8%9\&1-CVEVNY6T+8&NH]W5PX0N/0UKWHVY?`NMLK9\
M&Z_!",)[0G^]?9<;>9K%2TV[BK`+*'B.(]Q6@N%!YSSB7@"&O9&>=\"%3F<&
M08IC++QX!0?.MLWQUK;@5]53M"6;&RU(!=>THW:+H/S86.KN<EHD[*S+#KN!
M#B_;W191G!OKZK[1M`])=U?SH0,$"(7;.5]F'%]F#&??X\!H<H_@9;LSN*TZ
MRYDY.((!8_IYA]I6\Q0*H#FLUP@MB"R$+$T+O`O6%TN#UI%QO5%JZE"SU:[G
M![("7P,Q!,Y7.#LGYWV8:^3'<`1I5L"!O;GO<3]*<;F!_FUYI,.L&>(9`Z;W
MX?B]]W5_9+Y7W%.@!NU9`I#V=Z2TRC]Y)3UD+?WJ9,GU5"VJBG<;._<>=4_A
MRT`](8:?)FN<-5$8Z.@*GCWB_@_3B=7L_Z@WPOU_.'%LM?\_UOY?YIF,`=8\
M6V$4\+$E`88!=8D<<``;A1S2\'Z(H*&=O_S\PQ'Z`YX+O8",^^E2Q#G'>`+?
MN/"Y;%1$?-<8>2_TV=YZS;T,S00QMBU6-^"%U+6W-15L;'W$E*EIW5O7GN%E
M:]Z%/(K#`OY\\N,/_YAIW766+C,OV9776V#OY3$%(_NEP+@GAU["SGI;8T=]
M^%//'/19+)7VUQE&2S,L"'Y-USNM$F$D@A;WTS47\/H0.K=-K[?M':$S:]1?
M0Z?KS`4)[@]C?^6A4Z\'@V,!V?NLZ5HVM*N&V+>_2O&%OV[,SC!RV9W_4/SO
M^?Z[^6]/-ORWQS;%_^/11/'_:?B/E/L;S[B>(Q'Q"E<<`NZMX"HN(B09K:@0
M]_QU0_>*KI%WR2$J,4@1)68168YFD)05>3&^Q4@!X)^\Y0<HG(<+D5X)0#(L
M5CQI:X,G`O)%62DX43O!YB?5;0!%QCE&1P(6O.T-%N@CT/L('&R'F1USE\,)
M([;,BX8SO6X>O^6'&/6CV^B[O:Z-9:>/-5("[B^0N_1R5IBEK&/,5U@O2!,/
MO<XMACWY;8EY1GQ-ERSOL]Z;,W:*ELW!3J.NS;H.,UTR7*2%MWHG[=\=N'(`
M,YE0I!GW*%25/?=D)'G!;W+X@VPAQU]1O'//QKR05TGYNT\(JY*;,//3@#\E
M_X>.4_%_8HTF]DCR?ZCV_Z?:_U\)2OOS*"U7`3&,UB*R*Q68ON!RIOUU529B
MAUV8%2`#+..;TX,MQ3!GNL<QW&2W+`/HQ@(I+X4-X8K,9?A356]9:`Z8ZY*L
MX0!9Q;4O#?RW'/B:T?#?>4K^#^V&_Q-G:A'_)_90\?_KX;^L;4MJ1^#ZP>]Q
M!,H-?`+_2U%Y`-/_C/R?C$8?Y;]MC\9;_E,L8-O#H>+_HX`-H+T$*,<O19#*
M;)J'(8;C,D^OZF'`,+&/A;\J\>9E7@1Q:D;']T1(RR7)-*(GQ<H]*GC^(?@1
MONB!=_GFM*]A-$MB/\*86,H79?C&MIS1*>7)=&.=NOK<TK<9<,^/W"5]SO6R
M7K__S/WN[]]7*7%E=Y7Z+K:C5)@LIV'H6F2K:>Y'S]!@H<.+%T#E3?-]]';2
MJ!]1Z$_VMOV0X$[;T=D:W1W3ASN@L>#OP+8._,C0Y4.AU:T9K#2,8ZM6KX7T
M1&2+KFB9/AC+ZLU8J.+@H!KSG?9>A6RRV]/[`Q8/>".U03E;LJMM#[/ZKIFL
M35NRW)>?(90O_H*1_[N,\^A)SW]A.*[\_PC3O_%P0O&?,YHJ___HY[\GRV7&
M\SR^Y/(;JBQCV!>F&1T*-R?"A_7Y4E[$JQ4LN>"95Z#7AY.?_OKJ%7WWD6JF
MIM4?-^<8#<[[^]6%S9W]N<UTF&NW5>A)'TK0@W@%9_4V<UN=0AG^@>U(O>HH
M]1O1KK%E35ER(=L8"5S?O&T4#APL56>&EZ"?N8BNKOS0!Y:_^/P>X+?X/W(F
M3?PWMFU'?O^9*OX_/O]_Y)<\RWDK^*/8KUH@&S:;`V`U?:OS;7W!E[$PZ!\I
M)B.:8V`!OV08':(CT&=TC#VK3JIUXJ'>;[4U!'@+/R#PD,B;R7LL8SPQG,X=
M>SBW;&MN6?3KX.\0+988:Q+I=RSI^"--4$'&([7C""I%DS6![O:_!3[HF[2O
MB?^?._O[S?P/1L/Z^R^J3>PI\7\Z5M]_'X?_[V=S#\_<6HG6![*X!^=ME#:@
MF0.WE\=+.M^AFG[5Q3[E%B_(+FJ<D@"3TW<?2M.:I*37;GT'K?0/>VYR'?(.
M_4WN8M6)2][#CE32HJ"@H*"@H*"@H*"@H*"@H*"@H*"@H*"@H*"@H*"@H*"@
0H*"@H/#EXS_\]H]R`%``````
`
end


--
To UNSUBSCRIBE, email to debian-devel-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org


Reply to: