On adding size info to Packages files [very long]
Hello everyone,
I've been working on the 'du -S' stuff over the last week (approx),
and I think it's time I 'went public' with it. I'm afraid there's a
lot of stuff here; I thought it worth presenting the supporting data to
back up my arguments... Put it down to practice for the PhD thesis if
you want. ;-)
I've written a few simple tools to help analyse things. They are included
in the uuencoded gzipped tar at the end of this message.
Here are the sizes of hamm's Packages files:
compressed uncompr. ratio uncompressed_name
21390 77711 72.4% Packages.hamm.contrib.orig
332143 1108583 70.0% Packages.hamm.main.orig
58041 184713 68.5% Packages.hamm.non-free.orig
Now, if we run ./gen-du on each, which downloads each package, extracts
its contents, runs 'du -S' on it, and adds the output as a 'Du:' entry
in the Packages file, we get these file sizes:
compressed uncompr. ratio uncompressed_name
130250 871106 85.0% Packages.hamm.contrib.du
409086 1482499 72.4% Packages.hamm.main.du
70646 243962 71.0% Packages.hamm.non-free.du
Here's a sample of the output:
| Package: stow
| Version: 1.3.2-9
[...]
| installed-size: 140
| Du: 1 usr
| 16 usr/bin
| 1 usr/doc
| 13 usr/doc/stow
| 74 usr/doc/stow/html
| 19 usr/info
| 1 usr/lib
| 2 usr/lib/menu
| 1 usr/man
| 7 usr/man/man8
A couple of points. Firstly, the size of the 'contrib' section's
Packages.du file is large because of the picon-* packages. These packages
contain a very deep directory hierarchy with a few small files in each
leaf directory. That gives us a very large 'du' output, even wrt the
size of the package, and the packages are large anyway. We might want
to prune some of those directories, despite the inelegance of doing so.
'./fnfilter' does that.
Secondly, while the 'du' info will gzip down quite small, it will be
ungzipped on the user's machine and stuffed into the 'available' file,
which is not stored compressed. Since this thing is intended to help
those with smaller disks, (if you have a plenty of space, you don't need
an exact tally of the space which will be used) growing the available
file by 200% or so isn't too clever, IMO. (Although, since we're already
wasting a lot of space in /var/lib/dpkg/info by keeping lots of small
files there, this may not be considered important. The machines I use
normally have only a couple of meg or so free on /.)
1363984 /var/lib/dpkg/available (ordinary Packages files)
2587131 /var/lib/dpkg/available (Packages files with 'Du:' entries)
Therefore, I've also tried a few different forms of compression for
the 'Du:' entry. The script ./package-component-filter will take a
Packages file and filter all instances of a given component through a
given command. E.g.
./package-component-filter du 'gzip -9n | uuencode -m' \
Packages.hamm.contrib.du
gives output looking like this:
| Package: stow
| Version: 1.3.2-9
[...]
| installed-size: 140
| Du: begin-base64 644 -
| H4sIAAAAAAACAzPkLC0u4jI0A1H6SZl5XIZgVkp+MpehMYypX1ySX85lboLC
| 188oyc3hMrQEC2bmpeVDteZkJnEZwVj6ual5pVCJ3MQ8LnMYC4QtuACLQ28e
| fgAAAA==
| ====
I looked at several possibilities (percentage expansion is relative to
the original Packages files):
- Leave the 'Du' entry unencoded:
Human-readable, very compressible, but takes up a huge amount of
space in the available file.
Overall expansion: 89% uncompressed, 48% compressed
- Encode using 'sort +1 | ./myfrcode':
Still fairly human-reabable, quite a lot shorter, and even compresses
a bit better.
Overall expansion: 35% uncompressed, 37% compressed
Alternatively, using 'sort +1 | ./myfrcode2':
Overall expansion: 40% uncompressed, 37% compressed
- Encode with 'gzip -9n | uuencode -':
Doesn't cut down the sizes much, except for the picon-* packages,
where it makes a big difference. Compresses very badly. Completely
unreadable.
Overall expansion: 34% uncompressed, 76% compressed
- Encode with './squish':
Tiny impact on the 'available' file, but lousy compression.
Completely unreadable.
Overall expansion: 25% uncompressed, 63% compressed
compressed uncompr. ratio uncompressed_name
100981 318882 68.3% Packages.hamm.contrib.du.myfrcode
394917 1318643 70.0% Packages.hamm.main.du.myfrcode
68389 216575 68.4% Packages.hamm.non-free.du.myfrcode
102745 393981 73.9% Packages.hamm.contrib.du.myfrcode2
394318 1321474 70.1% Packages.hamm.main.du.myfrcode2
68274 217545 68.6% Packages.hamm.non-free.du.myfrcode2
142466 236685 39.8% Packages.hamm.contrib.du.uugz
499540 1378849 63.7% Packages.hamm.main.du.uugz
84137 226083 62.7% Packages.hamm.non-free.du.uugz
113210 195584 42.1% Packages.hamm.contrib.du.squish
479933 1314473 63.4% Packages.hamm.main.du.squish
80862 216404 62.6% Packages.hamm.non-free.du.squish
(Note: ./myfrcode and ./myfrcode2 require their input sorted in a
particular way, if their output is to be properly decompressible.
It's quite possible to do that, but I haven't got round to doing it yet.
'sort +1' is a close approximation, but its not right. The data I
generated will not extract correctly, but should demonstrate the right
compression characteristics.)
Summarising:
If network bandwidth is the only factor, we should do some form of
front-coding on the du entries (preferably a non-broken form), because
this gives the smallest compressed size.
If the size of the 'available' file is the only factor, we could use
'./squish', because this gives the smallest uncompressed size.
On balance, I'd say that front-coding gives a reasonable compromise
between network bandwidth, size of available file, and human-readableness.
If human-readability is more important than size of 'available' file, then
we should simply put the un-encoded 'du' output into the Packages files.
--
For some idea on what front-coding is,
info '(find)New Database Format'
--
Please feel free to comment on any or all of this on the lists, but
note that, due to the arrangements by which I get Debian-related email,
there is likely to be a day or two's delay before I can reply.
--
Charles Briscoe-Smith
White pages entry, with PGP key: <URL:http://alethea.ukc.ac.uk/wp?95cpb4>
PGP public keyprint: 74 68 AB 2E 1C 60 22 94 B8 21 2D 01 DE 66 13 E2
begin 644 scripts.tar.gz
M'XL(```````"`^U:6W?;N!'VJ_DK)HH:2K))D-2M:X5NW>ZE>6K.[MGV(;)K
MB@1%'HN@RHLOB;V_O3,@*=%*LNM-&SNGP>=C$QP,!B"!;S!#>,F%$91[GQ6V
M94W'8]B#B66-;+S">#AVZ"KAV),IP&0\'(ZGDQ$J@FW9H]$>6'N/@#(OO`Q@
MSU\O1K^FYY?+O?]#/'\&;!$+ED?:<_@N6?``]*`$XR<=TK)8EP7$HDCAM>=?
M>$N>0QBO>*YI7N9'\27_U\++N1L5Q?J(L5+$UV;$1>Z9GF^6%RR)LRS-<A;P
M1>P)+8CSPJ4_.8N\)&%^*HHL7E#W7G9CQ,,_3C0MY(4?]?KP3@-X#C_GV.D1
M2"'D:<+9VBLBEHK:MHFF(;D17L*IB&V$<,]I4"2"KGV.HBP!(X2N$%B^6O("
M#'$)W?83L*Z-=<DE*8%9\&1-CVEVNY6T+8&NH]W5PX0N/0UKWHVY?`NMLK9\
M&Z_!",)[0G^]?9<;>9K%2TV[BK`+*'B.(]Q6@N%!YSSB7@"&O9&>=\"%3F<&
M08IC++QX!0?.MLWQUK;@5]53M"6;&RU(!=>THW:+H/S86.KN<EHD[*S+#KN!
M#B_;W191G!OKZK[1M`])=U?SH0,$"(7;.5]F'%]F#&??X\!H<H_@9;LSN*TZ
MRYDY.((!8_IYA]I6\Q0*H#FLUP@MB"R$+$T+O`O6%TN#UI%QO5%JZE"SU:[G
M![("7P,Q!,Y7.#LGYWV8:^3'<`1I5L"!O;GO<3]*<;F!_FUYI,.L&>(9`Z;W
MX?B]]W5_9+Y7W%.@!NU9`I#V=Z2TRC]Y)3UD+?WJ9,GU5"VJBG<;._<>=4_A
MRT`](8:?)FN<-5$8Z.@*GCWB_@_3B=7L_Z@WPOU_.'%LM?\_UOY?YIF,`=8\
M6V$4\+$E`88!=8D<<``;A1S2\'Z(H*&=O_S\PQ'Z`YX+O8",^^E2Q#G'>`+?
MN/"Y;%1$?-<8>2_TV=YZS;T,S00QMBU6-^"%U+6W-15L;'W$E*EIW5O7GN%E
M:]Z%/(K#`OY\\N,/_YAIW766+C,OV9776V#OY3$%(_NEP+@GAU["SGI;8T=]
M^%//'/19+)7VUQE&2S,L"'Y-USNM$F$D@A;WTS47\/H0.K=-K[?M':$S:]1?
M0Z?KS`4)[@]C?^6A4Z\'@V,!V?NLZ5HVM*N&V+>_2O&%OV[,SC!RV9W_4/SO
M^?Z[^6]/-ORWQS;%_^/11/'_:?B/E/L;S[B>(Q'Q"E<<`NZMX"HN(B09K:@0
M]_QU0_>*KI%WR2$J,4@1)68168YFD)05>3&^Q4@!X)^\Y0<HG(<+D5X)0#(L
M5CQI:X,G`O)%62DX43O!YB?5;0!%QCE&1P(6O.T-%N@CT/L('&R'F1USE\,)
M([;,BX8SO6X>O^6'&/6CV^B[O:Z-9:>/-5("[B^0N_1R5IBEK&/,5U@O2!,/
MO<XMACWY;8EY1GQ-ERSOL]Z;,W:*ELW!3J.NS;H.,UTR7*2%MWHG[=\=N'(`
M,YE0I!GW*%25/?=D)'G!;W+X@VPAQU]1O'//QKR05TGYNT\(JY*;,//3@#\E
M_X>.4_%_8HTF]DCR?ZCV_Z?:_U\)2OOS*"U7`3&,UB*R*Q68ON!RIOUU529B
MAUV8%2`#+..;TX,MQ3!GNL<QW&2W+`/HQ@(I+X4-X8K,9?A356]9:`Z8ZY*L
MX0!9Q;4O#?RW'/B:T?#?>4K^#^V&_Q-G:A'_)_90\?_KX;^L;4MJ1^#ZP>]Q
M!,H-?`+_2U%Y`-/_C/R?C$8?Y;]MC\9;_E,L8-O#H>+_HX`-H+T$*,<O19#*
M;)J'(8;C,D^OZF'`,+&/A;\J\>9E7@1Q:D;']T1(RR7)-*(GQ<H]*GC^(?@1
MONB!=_GFM*]A-$MB/\*86,H79?C&MIS1*>7)=&.=NOK<TK<9<,^/W"5]SO6R
M7K__S/WN[]]7*7%E=Y7Z+K:C5)@LIV'H6F2K:>Y'S]!@H<.+%T#E3?-]]';2
MJ!]1Z$_VMOV0X$[;T=D:W1W3ASN@L>#OP+8._,C0Y4.AU:T9K#2,8ZM6KX7T
M1&2+KFB9/AC+ZLU8J.+@H!KSG?9>A6RRV]/[`Q8/>".U03E;LJMM#[/ZKIFL
M35NRW)>?(90O_H*1_[N,\^A)SW]A.*[\_PC3O_%P0O&?,YHJ___HY[\GRV7&
M\SR^Y/(;JBQCV!>F&1T*-R?"A_7Y4E[$JQ4LN>"95Z#7AY.?_OKJ%7WWD6JF
MIM4?-^<8#<[[^]6%S9W]N<UTF&NW5>A)'TK0@W@%9_4V<UN=0AG^@>U(O>HH
M]1O1KK%E35ER(=L8"5S?O&T4#APL56>&EZ"?N8BNKOS0!Y:_^/P>X+?X/W(F
M3?PWMFU'?O^9*OX_/O]_Y)<\RWDK^*/8KUH@&S:;`V`U?:OS;7W!E[$PZ!\I
M)B.:8V`!OV08':(CT&=TC#VK3JIUXJ'>;[4U!'@+/R#PD,B;R7LL8SPQG,X=
M>SBW;&MN6?3KX.\0+988:Q+I=RSI^"--4$'&([7C""I%DS6![O:_!3[HF[2O
MB?^?._O[S?P/1L/Z^R^J3>PI\7\Z5M]_'X?_[V=S#\_<6HG6![*X!^=ME#:@
MF0.WE\=+.M^AFG[5Q3[E%B_(+FJ<D@"3TW<?2M.:I*37;GT'K?0/>VYR'?(.
M_4WN8M6)2][#CE32HJ"@H*"@H*"@H*"@H*"@H*"@H*"@H*"@H*"@H*"@H*"@
0H*"@H/#EXS_\]H]R`%``````
`
end
--
To UNSUBSCRIBE, email to debian-devel-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Reply to: