Chapter 10. Data management

Table of Contents

10.1. Sharing, copying, and archiving
10.1.1. Archive and compression tools
10.1.2. Copy and synchronization tools
10.1.3. Idioms for the archive
10.1.4. Idioms for the copy
10.1.5. Idioms for the selection of files
10.1.6. Archive media
10.1.7. Removable storage device
10.1.8. Filesystem choice for sharing data
10.1.9. Sharing data via network
10.2. Backup and recovery
10.2.1. Backup utility suites
10.2.2. An example script for the system backup
10.2.3. A copy script for the data backup
10.3. Data security infrastructure
10.3.1. Key management for GnuPG
10.3.2. Using GnuPG on files
10.3.3. Using GnuPG with Mutt
10.3.4. Using GnuPG with Vim
10.3.5. The MD5 sum
10.4. Source code merge tools
10.4.1. Extracting differences for source files
10.4.2. Merging updates for source files
10.4.3. Updating via 3-way-merge
10.5. Version control systems
10.5.1. Comparison of VCS commands
10.6. Git
10.6.1. Configuration of Git client
10.6.2. Git references
10.6.3. Git commands
10.6.4. Git for the Subversion repository
10.6.5. Git for recording configuration history
10.7. CVS
10.7.1. Configuration of CVS repository
10.7.2. Local access to CVS
10.7.3. Remote access to CVS with pserver
10.7.4. Remote access to CVS with ssh
10.7.5. Importing a new source to CVS
10.7.6. File permissions in CVS repository
10.7.7. Work flow of CVS
10.7.8. Latest files from CVS
10.7.9. Administration of CVS
10.7.10. Execution bit for CVS checkout
10.8. Subversion
10.8.1. Configuration of Subversion repository
10.8.2. Access to Subversion via Apache2 server
10.8.3. Local access to Subversion by group
10.8.4. Remote access to Subversion via SSH
10.8.5. Subversion directory structure
10.8.6. Importing a new source to Subversion
10.8.7. Work flow of Subversion

Tools and tips for managing binary and text data on the Debian system are described.

[Warning] Warning

The uncoordinated write access to actively accessed devices and files from multiple processes must not be done to avoid the race condition. File locking mechanisms using flock(1) may be used to avoid it.

The security of the data and its controlled sharing have several aspects.

  • The creation of data archive

  • The remote storage access

  • The duplication

  • The tracking of the modification history

  • The facilitation of data sharing

  • The prevention of unauthorized file access

  • The detection of unauthorized file modification

These can be realized by using some combination of tools.

  • Archive and compression tools

  • Copy and synchronization tools

  • Network filesystems

  • Removable storage media

  • The secure shell

  • The authentication system

  • Version control system tools

  • Hash and cryptographic encryption tools

Here is a summary of archive and compression tools available on the Debian system.

Table 10.1. List of archive and compression tools

package popcon size extension command comment
tar V:565, I:999 2614 .tar tar(1) the standard archiver (de facto standard)
cpio V:333, I:999 844 .cpio cpio(1) Unix System V style archiver, use with find(1)
binutils V:300, I:761 19469 .ar ar(1) archiver for the creation of static libraries
fastjar V:8, I:82 191 .jar fastjar(1) archiver for Java (zip like)
pax V:19, I:71 147 .pax pax(1) new POSIX standard archiver, compromise between tar and cpio
gzip V:862, I:999 239 .gz gzip(1), zcat(1), … GNU LZ77 compression utility (de facto standard)
bzip2 V:405, I:904 119 .bz2 bzip2(1), bzcat(1), … Burrows-Wheeler block-sorting compression utility with higher compression ratio than gzip(1) (slower than gzip with similar syntax)
lzma V:10, I:127 144 .lzma lzma(1) LZMA compression utility with higher compression ratio than gzip(1) (deprecated)
xz-utils V:212, I:946 472 .xz xz(1), xzdec(1), … XZ compression utility with higher compression ratio than bzip2(1) (slower than gzip but faster than bzip2; replacement for LZMA compression utility)
p7zip V:9, I:91 966 .7z 7zr(1), p7zip(1) 7-Zip file archiver with high compression ratio (LZMA compression)
p7zip-full V:220, I:516 3810 .7z 7z(1), 7za(1) 7-Zip file archiver with high compression ratio (LZMA compression and others)
lzop V:4, I:41 112 .lzo lzop(1) LZO compression utility with higher compression and decompression speed than gzip(1) (lower compression ratio than gzip with similar syntax)
zip V:51, I:355 593 .zip zip(1) InfoZIP: DOS archive and compression tool
unzip V:287, I:789 377 .zip unzip(1) InfoZIP: DOS unarchive and decompression tool

[Warning] Warning

Do not set the "$TAPE" variable unless you know what to expect. It changes tar(1) behavior.

[Note] Note

The gzipped tar(1) archive uses the file extension ".tgz" or ".tar.gz".

[Note] Note

The xz-compressed tar(1) archive uses the file extension ".txz" or ".tar.xz".

[Note] Note

Popular compression method in FOSS tools such as tar(1) has been moving as follows: gzipbzip2xz

[Note] Note

cp(1), scp(1) and tar(1) may have some limitation for special files. cpio(1) is most versatile.

[Note] Note

cpio(1) is designed to be used with find(1) and other commands and suitable for creating backup scripts since the file selection part of the script can be tested independently.

[Note] Note

Internal structure of Libreoffice data files are ".jar" file.

Here is a summary of simple copy and backup tools available on the Debian system.


Copying files with rsync(8) offers richer features than others.

  • delta-transfer algorithm that sends only the differences between the source files and the existing files in the destination

  • quick check algorithm (by default) that looks for files that have changed in size or in last-modified time

  • "--exclude" and "--exclude-from" options similar to tar(1)

  • "a trailing slash on the source directory" syntax that avoids creating an additional directory level at the destination.

[Tip] Tip

Execution of the bkup script mentioned in Section 10.2.3, “A copy script for the data backup” with the "-gl" option under cron(8) should provide very similar functionality as Plan9's dumpfs for the static data archive.

[Tip] Tip

Version control system (VCS) tools in Table 10.11, “List of version control system tools” can function as the multi-way copy and synchronization tools.

Here are several ways to copy the entire content of the directory "./source" using different tools.

  • Local copy: "./source" directory → "/dest" directory

  • Remote copy: "./source" directory at local host → "/dest" directory at "user@host.dom" host

rsync(8):

# cd ./source; rsync -aHAXSv . /dest
# cd ./source; rsync -aHAXSv . user@host.dom:/dest

You can alternatively use "a trailing slash on the source directory" syntax.

# rsync -aHAXSv ./source/ /dest
# rsync -aHAXSv ./source/ user@host.dom:/dest

Alternatively, by the following.

# cd ./source; find . -print0 | rsync -aHAXSv0 --files-from=- . /dest
# cd ./source; find . -print0 | rsync -aHAXSv0 --files-from=- . user@host.dom:/dest

GNU cp(1) and openSSH scp(1):

# cd ./source; cp -a . /dest
# cd ./source; scp -pr . user@host.dom:/dest

GNU tar(1):

# (cd ./source && tar cf - . ) | (cd /dest && tar xvfp - )
# (cd ./source && tar cf - . ) | ssh user@host.dom '(cd /dest && tar xvfp - )'

cpio(1):

# cd ./source; find . -print0 | cpio -pvdm --null --sparse /dest

You can substitute "." with "foo" for all examples containing "." to copy files from "./source/foo" directory to "/dest/foo" directory.

You can substitute "." with the absolute path "/path/to/source/foo" for all examples containing "." to drop "cd ./source;". These copy files to different locations depending on tools used as follows.

  • "/dest/foo": rsync(8), GNU cp(1), and scp(1)

  • "/dest/path/to/source/foo": GNU tar(1), and cpio(1)

[Tip] Tip

rsync(8) and GNU cp(1) have option "-u" to skip files that are newer on the receiver.

find(1) is used to select files for archive and copy commands (see Section 10.1.3, “Idioms for the archive” and Section 10.1.4, “Idioms for the copy”) or for xargs(1) (see Section 9.3.9, “Repeating a command looping over files”). This can be enhanced by using its command arguments.

Basic syntax of find(1) can be summarized as the following.

  • Its conditional arguments are evaluated from left to right.

  • This evaluation stops once its outcome is determined.

  • "Logical OR" (specified by "-o" between conditionals) has lower precedence than "logical AND" (specified by "-a" or nothing between conditionals).

  • "Logical NOT" (specified by "!" before a conditional) has higher precedence than "logical AND".

  • "-prune" always returns logical TRUE and, if it is a directory, searching of file is stopped beyond this point.

  • "-name" matches the base of the filename with shell glob (see Section 1.5.6, “Shell glob”) but it also matches its initial "." with metacharacters such as "*" and "?". (New POSIX feature)

  • "-regex" matches the full path with emacs style BRE (see Section 1.6.2, “Regular expressions”) as default.

  • "-size" matches the file based on the file size (value precedented with "+" for larger, precedented with "-" for smaller)

  • "-newer" matches the file newer than the one specified in its argument.

  • "-print0" always returns logical TRUE and print the full filename (null terminated) on the standard output.

find(1) is often used with an idiomatic style as the following.

# find /path/to \
    -xdev -regextype posix-extended \
    -type f -regex ".*\.cpio|.*~" -prune -o \
    -type d -regex ".*/\.git" -prune -o \
    -type f -size +99M -prune -o \
    -type f -newer /path/to/timestamp -print0

This means to do following actions.

  1. Search all files starting from "/path/to"

  2. Globally limit its search within its starting filesystem and uses ERE (see Section 1.6.2, “Regular expressions”) instead

  3. Exclude files matching regex of ".*\.cpio" or ".*~" from search by stop processing

  4. Exclude directories matching regex of ".*/\.git" from search by stop processing

  5. Exclude files larger than 99 Megabytes (units of 1048576 bytes) from search by stop processing

  6. Print filenames which satisfy above search conditions and are newer than "/path/to/timestamp"

Please note the idiomatic use of "-prune -o" to exclude files in the above example.

[Note] Note

For non-Debian Unix-like system, some options may not be supported by find(1). In such a case, please consider to adjust matching methods and replace "-print0" with "-print". You may need to adjust related commands too.

When choosing computer data storage media for important data archive, you should be careful about their limitations. For small personal data backup, I use CD-R and DVD-R by the brand name company and store in a cool, shaded, dry, clean environment. (Tape archive media seem to be popular for professional use.)

[Note] Note

A fire-resistant safe are meant for paper documents. Most of the computer data storage media have less temperature tolerance than paper. I usually rely on multiple secure encrypted copies stored in multiple secure locations.

Optimistic storage life of archive media seen on the net (mostly from vendor info).

  • 100+ years : Acid free paper with ink

  • 100 years : Optical storage (CD/DVD, CD/DVD-R)

  • 30 years : Magnetic storage (tape, floppy)

  • 20 years : Phase change optical storage (CD-RW)

These do not count on the mechanical failures due to handling etc.

Optimistic write cycle of archive media seen on the net (mostly from vendor info).

  • 250,000+ cycles : Harddisk drive

  • 10,000+ cycles : Flash memory

  • 1,000 cycles : CD/DVD-RW

  • 1 cycles : CD/DVD-R, paper

[Caution] Caution

Figures of storage life and write cycle here should not be used for decisions on any critical data storage. Please consult the specific product information provided by the manufacture.

[Tip] Tip

Since CD/DVD-R and paper have only 1 write cycle, they inherently prevent accidental data loss by overwriting. This is advantage!

[Tip] Tip

If you need fast and frequent backup of large amount of data, a hard disk on a remote host linked by a fast network connection, may be the only realistic option.

Removable storage devices may be any one of the following.

They may be connected via any one of the following.

Modern desktop environments such as GNOME and KDE can mount these removable devices automatically without a matching "/etc/fstab" entry.

  • udisks package provides a daemon and associated utilities to mount and unmount these devices.

  • D-bus creates events to initiate automatic processes.

  • PolicyKit provides required privileges.

[Tip] Tip

Automounted devices may have the "uhelper=" mount option which is used by umount(8).

[Tip] Tip

Automounting under modern desktop environment happens only when those removable media devices are not listed in "/etc/fstab".

Mount point under modern desktop environment is chosen as "/media/<disk_label>" which can be customized by the following.

  • mlabel(1) for FAT filesystem

  • genisoimage(1) with "-V" option for ISO9660 filesystem

  • tune2fs(1) with "-L" option for ext2/ext3/ext4 filesystem

[Tip] Tip

The choice of encoding may need to be provided as mount option (see Section 8.3.6, “Filename encoding”).

[Tip] Tip

The use of the GUI menu to unmount a filesystem may remove its dynamically generated device node such as "/dev/sdc". If you wish to keep its device node, unmount it with the umount(8) command from the shell prompt.

When sharing data with other system via removable storage device, you should format it with common filesystem supported by both systems. Here is a list of filesystem choices.


[Tip] Tip

See Section 9.8.1, “Removable disk encryption with dm-crypt/LUKS” for cross platform sharing of data using device level encryption.

The FAT filesystem is supported by almost all modern operating systems and is quite useful for the data exchange purpose via removable hard disk like media.

When formatting removable hard disk like devices for cross platform sharing of data with the FAT filesystem, the following should be safe choices.

When using the FAT or ISO9660 filesystems for sharing data, the following should be the safe considerations.

  • Archiving files into an archive file first using tar(1), or cpio(1) to retain the long filename, the symbolic link, the original Unix file permission and the owner information.

  • Splitting the archive file into less than 2 GiB chunks with the split(1) command to protect it from the file size limitation.

  • Encrypting the archive file to secure its contents from the unauthorized access.

[Note] Note

For FAT filesystems by its design, the maximum file size is (2^32 - 1) bytes = (4GiB - 1 byte). For some applications on the older 32 bit OS, the maximum file size was even smaller (2^31 - 1) bytes = (2GiB - 1 byte). Debian does not suffer the latter problem.

[Note] Note

Microsoft itself does not recommend to use FAT for drives or partitions of over 200 MB. Microsoft highlights its short comings such as inefficient disk space usage in their "Overview of FAT, HPFS, and NTFS File Systems". Of course, we should normally use the ext4 filesystem for Linux.

[Tip] Tip

For more on filesystems and accessing filesystems, please read "Filesystems HOWTO".

We all know that computers fail sometime or human errors cause system and data damages. Backup and recovery operations are the essential part of successful system administration. All possible failure modes hit you some day.

[Tip] Tip

Keep your backup system simple and backup your system often. Having backup data is more important than how technically good your backup method is.

There are 3 key factors which determine actual backup and recovery policy.

  1. Knowing what to backup and recover.

    • Data files directly created by you: data in "~/"

    • Data files created by applications used by you: data in "/var/" (except "/var/cache/", "/var/run/", and "/var/tmp/")

    • System configuration files: data in "/etc/"

    • Local softwares: data in "/usr/local/" or "/opt/"

    • System installation information: a memo in plain text on key steps (partition, …)

    • Proven set of data: confirmed by experimental recovery operations in advance

  2. Knowing how to backup and recover.

    • Secure storage of data: protection from overwrite and system failure

    • Frequent backup: scheduled backup

    • Redundant backup: data mirroring

    • Fool proof process: easy single command backup

  3. Assessing risks and costs involved.

    • Value of data when lost

    • Required resources for backup: human, hardware, software, …

    • Failure mode and their possibility

[Note] Note

Do not back up the pseudo-filesystem contents found on /proc, /sys, /tmp, and /run (see Section 1.2.12, “procfs and sysfs” and Section 1.2.13, “tmpfs”). Unless you know exactly what you are doing, they are huge useless data.

As for secure storage of data, data should be at least on different disk partitions preferably on different disks and machines to withstand the filesystem corruption. Important data are best stored on a write-once media such as CD/DVD-R to prevent overwrite accidents. (See Section 9.7, “The binary data” for how to write to the storage media from the shell commandline. GNOME desktop GUI environment gives you easy access via menu: "Places→CD/DVD Creator".)

[Note] Note

You may wish to stop some application daemons such as MTA (see Section 6.3, “Mail transport agent (MTA)”) while backing up data.

[Note] Note

You should pay extra care to the backup and restoration of identity related data files such as "/etc/ssh/ssh_host_dsa_key", "/etc/ssh/ssh_host_rsa_key", "~/.gnupg/*", "~/.ssh/*", "/etc/passwd", "/etc/shadow", "/etc/fetchmailrc", "popularity-contest.conf", "/etc/ppp/pap-secrets", and "/etc/exim4/passwd.client". Some of these data can not be regenerated by entering the same input string to the system.

[Note] Note

If you run a cron job as a user process, you must restore files in "/var/spool/cron/crontabs" directory and restart cron(8). See Section 9.3.14, “Scheduling tasks regularly” for cron(8) and crontab(1).

Here is a select list of notable backup utility suites available on the Debian system.


Backup tools have their specialized focuses.

  • Mondo Rescue is a backup system to facilitate restoration of complete system quickly from backup CD/DVD etc. without going through normal system installation processes.

  • sbackup and keep packages provide easy GUI frontend for desktop users to make regular backups of user data. An equivalent function can be realized by a simple script (Section 10.2.2, “An example script for the system backup”) and cron(8).

  • Bacula, Amanda, and BackupPC are full featured backup suite utilities which are focused on regular backups over network.

Basic tools described in Section 10.1.1, “Archive and compression tools” and Section 10.1.2, “Copy and synchronization tools” can be used to facilitate system backup via custom scripts. Such script can be enhanced by the following.

  • The obnam package enables incremental (remote) backups.

  • The rdiff-backup package enables incremental (remote) backups.

  • The dump package helps to archive and restore the whole filesystem incrementally and efficiently.

[Tip] Tip

See files in "/usr/share/doc/dump/" and "Is dump really deprecated?" to learn about the dump package.

For a personal Debian desktop system running unstable suite, I only need to protect personal and critical data. I reinstall system once a year anyway. Thus I see no reason to backup the whole system or to install a full featured backup utility.

I use a simple script to make a backup archive and burn it into CD/DVD using GUI. Here is an example script for this.

#!/bin/sh -e
# Copyright (C) 2007-2008 Osamu Aoki <osamu@debian.org>, Public Domain
BUUID=1000; USER=osamu # UID and name of a user who accesses backup files
BUDIR="/var/backups"
XDIR0=".+/Mail|.+/Desktop"
XDIR1=".+/\.thumbnails|.+/\.?Trash|.+/\.?[cC]ache|.+/\.gvfs|.+/sessions"
XDIR2=".+/CVS|.+/\.git|.+/\.svn|.+/Downloads|.+/Archive|.+/Checkout|.+/tmp"
XSFX=".+\.iso|.+\.tgz|.+\.tar\.gz|.+\.tar\.bz2|.+\.cpio|.+\.tmp|.+\.swp|.+~"
SIZE="+99M"
DATE=$(date --utc +"%Y%m%d-%H%M")
[ -d "$BUDIR" ] || mkdir -p "BUDIR"
umask 077
dpkg --get-selections \* > /var/lib/dpkg/dpkg-selections.list
debconf-get-selections > /var/cache/debconf/debconf-selections

{
find /etc /usr/local /opt /var/lib/dpkg/dpkg-selections.list \
     /var/cache/debconf/debconf-selections -xdev -print0
find /home/$USER /root -xdev -regextype posix-extended \
  -type d -regex "$XDIR0|$XDIR1" -prune -o -type f -regex "$XSFX" -prune -o \
  -type f -size  "$SIZE" -prune -o -print0
find /home/$USER/Mail/Inbox /home/$USER/Mail/Outbox -print0
find /home/$USER/Desktop  -xdev -regextype posix-extended \
  -type d -regex "$XDIR2" -prune -o -type f -regex "$XSFX" -prune -o \
  -type f -size  "$SIZE" -prune -o -print0
} | cpio -ov --null -O $BUDIR/BU$DATE.cpio
chown $BUUID $BUDIR/BU$DATE.cpio
touch $BUDIR/backup.stamp

This is meant to be a script example executed from root.

I expect you to change and execute this as follows.

Keep it simple!

[Tip] Tip

You can recover debconf configuration data with "debconf-set-selections debconf-selections" and dpkg selection data with "dpkg --set-selection <dpkg-selections.list".

For the set of data under a directory tree, the copy with "cp -a" provides the normal backup.

For the set of large non-overwritten static data under a directory tree such as the one under the "/var/cache/apt/packages/" directory, hardlinks with "cp -al" provide an alternative to the normal backup with efficient use of the disk space.

Here is a copy script, which I named as bkup, for the data backup. This script copies all (non-VCS) files under the current directory to the dated directory on the parent directory or on a remote host.

#!/bin/sh -e
# Copyright (C) 2007-2008 Osamu Aoki <osamu@debian.org>, Public Domain
fdot(){ find . -type d \( -iname ".?*" -o -iname "CVS" \) -prune -o -print0;}
fall(){ find . -print0;}
mkdircd(){ mkdir -p "$1";chmod 700 "$1";cd "$1">/dev/null;}
FIND="fdot";OPT="-a";MODE="CPIOP";HOST="localhost";EXTP="$(hostname -f)"
BKUP="$(basename $(pwd)).bkup";TIME="$(date  +%Y%m%d-%H%M%S)";BU="$BKUP/$TIME"
while getopts gcCsStrlLaAxe:h:T f; do case $f in
g)  MODE="GNUCP";; # cp (GNU)
c)  MODE="CPIOP";; # cpio -p
C)  MODE="CPIOI";; # cpio -i
s)  MODE="CPIOSSH";; # cpio/ssh
t)  MODE="TARSSH";; # tar/ssh
r)  MODE="RSYNCSSH";; # rsync/ssh
l)  OPT="-alv";; # hardlink (GNU cp)
L)  OPT="-av";;  # copy (GNU cp)
a)  FIND="fall";; # find all
A)  FIND="fdot";; # find non CVS/ .???/
x)  set -x;; # trace
e)  EXTP="${OPTARG}";; # hostname -f
h)  HOST="${OPTARG}";; # user@remotehost.example.com
T)  MODE="TEST";; # test find mode
\?) echo "use -x for trace."
esac; done
shift $(expr $OPTIND - 1)
if [ $# -gt 0 ]; then
  for x in $@; do cp $OPT $x $x.$TIME; done
elif [ $MODE = GNUCP ]; then
  mkdir -p "../$BU";chmod 700 "../$BU";cp $OPT . "../$BU/"
elif [ $MODE = CPIOP ]; then
  mkdir -p "../$BU";chmod 700 "../$BU"
  $FIND|cpio --null --sparse -pvd ../$BU
elif [ $MODE = CPIOI ]; then
  $FIND|cpio -ov --null | ( mkdircd "../$BU"&&cpio -i )
elif [ $MODE = CPIOSSH ]; then
  $FIND|cpio -ov --null|ssh -C $HOST "( mkdircd \"$EXTP/$BU\"&&cpio -i )"
elif [ $MODE = TARSSH ]; then
  (tar cvf - . )|ssh -C $HOST "( mkdircd \"$EXTP/$BU\"&& tar xvfp - )"
elif [ $MODE = RSYNCSSH ]; then
  rsync -aHAXSv ./ "${HOST}:${EXTP}-${BKUP}-${TIME}"
else
  echo "Any other idea to backup?"
  $FIND |xargs -0 -n 1 echo
fi

This is meant to be command examples. Please read script and edit it by yourself before using it.

[Tip] Tip

I keep this bkup in my "/usr/local/bin/" directory. I issue this bkup command without any option in the working directory whenever I need a temporary snapshot backup.

[Tip] Tip

For making snapshot history of a source file tree or a configuration file tree, it is easier and space efficient to use git(7) (see Section 10.6.5, “Git for recording configuration history”).

The data security infrastructure is provided by the combination of data encryption tool, message digest tool, and signature tool.


See Section 9.8, “Data encryption tips” on dm-crypto and ecryptfs which implement automatic data encryption infrastructure via Linux kernel modules.

Here are GNU Privacy Guard commands for the basic key management.


Here is the meaning of the trust code.


The following uploads my key "1DD8D791" to the popular keyserver "hkp://keys.gnupg.net".

$ gpg --keyserver hkp://keys.gnupg.net --send-keys 1DD8D791

A good default keyserver set up in "~/.gnupg/gpg.conf" (or old location "~/.gnupg/options") contains the following.

keyserver hkp://keys.gnupg.net

The following obtains unknown keys from the keyserver.

$ gpg --list-sigs --with-colons | grep '^sig.*\[User ID not found\]' |\
  cut -d ':' -f 5| sort | uniq | xargs gpg --recv-keys

There was a bug in OpenPGP Public Key Server (pre version 0.9.6) which corrupted key with more than 2 sub-keys. The newer gnupg (>1.2.1-2) package can handle these corrupted subkeys. See gpg(1) under "--repair-pks-subkey-bug" option.

md5sum(1) provides utility to make a digest file using the method in rfc1321 and verifying each file with it.

$ md5sum foo bar >baz.md5
$ cat baz.md5
d3b07384d113edec49eaa6238ad5ff00  foo
c157a79031e1c40f85931829bc5fc552  bar
$ md5sum -c baz.md5
foo: OK
bar: OK
[Note] Note

The computation for the MD5 sum is less CPU intensive than the one for the cryptographic signature by GNU Privacy Guard (GnuPG). Usually, only the top level digest file is cryptographically signed to ensure data integrity.

There are many merge tools for the source code. Following commands caught my eyes.

Table 10.10. List of source code merge tools

package popcon size command description
diffutils V:814, I:945 1229 diff(1) compare files line by line
diffutils V:814, I:945 1229 diff3(1) compare and merges three files line by line
vim V:148, I:379 2063 vimdiff(1) compare 2 files side by side in vim
patch V:154, I:949 172 patch(1) apply a diff file to an original
dpatch V:2, I:32 237 dpatch(1) manage series of patches for Debian package
diffstat V:14, I:140 49 diffstat(1) produce a histogram of changes by the diff
patchutils V:12, I:128 186 combinediff(1) create a cumulative patch from two incremental patches
patchutils V:12, I:128 186 dehtmldiff(1) extract a diff from an HTML page
patchutils V:12, I:128 186 filterdiff(1) extract or excludes diffs from a diff file
patchutils V:12, I:128 186 fixcvsdiff(1) fix diff files created by CVS that patch(1) mis-interprets
patchutils V:12, I:128 186 flipdiff(1) exchange the order of two patches
patchutils V:12, I:128 186 grepdiff(1) show which files are modified by a patch matching a regex
patchutils V:12, I:128 186 interdiff(1) show differences between two unified diff files
patchutils V:12, I:128 186 lsdiff(1) show which files are modified by a patch
patchutils V:12, I:128 186 recountdiff(1) recompute counts and offsets in unified context diffs
patchutils V:12, I:128 186 rediff(1) fix offsets and counts of a hand-edited diff
patchutils V:12, I:128 186 splitdiff(1) separate out incremental patches
patchutils V:12, I:128 186 unwrapdiff(1) demangle patches that have been word-wrapped
wiggle V:0, I:0 189 wiggle(1) apply rejected patches
quilt V:7, I:57 794 quilt(1) manage series of patches
meld V:9, I:44 2833 meld(1) compare and merge files (GTK)
dirdiff V:0, I:5 191 dirdiff(1) display differences and merge changes between directory trees
docdiff V:0, I:0 582 docdiff(1) compare two files word by word / char by char
imediff2 V:0, I:0 58 imediff2(1) interactive full screen 2-way merge tool
makepatch V:0, I:0 148 makepatch(1) generate extended patch files
makepatch V:0, I:0 148 applypatch(1) apply extended patch files
wdiff V:10, I:131 891 wdiff(1) display word differences between text files

Here is a summary of the version control systems (VCS) on the Debian system.

[Note] Note

If you are new to VCS systems, you should start learning with Git, which is growing fast in popularity.


VCS is sometimes known as revision control system (RCS), or software configuration management (SCM).

Distributed VCS such as Git is the tool of choice these days. CVS and Subversion may still be useful to join some existing open source program activities.

Debian provides free VCS services via Debian Alioth service. It supports practically all VCSs. Its documentation can be found at http://wiki.debian.org/Alioth .

There are few basics for creating a shared access VCS archive.

Here is an oversimplified comparison of native VCS commands to provide the big picture. The typical command sequence may require options and arguments.


[Caution] Caution

Invoking a git subcommand directly as "git-xyz" from the command line has been deprecated since early 2006.

[Tip] Tip

If there is a executable file git-foo in the path specified by $PATH, entring "git foo" without hyphen to the command line invokes this git-foo. This is a feature of the git command.

[Tip] Tip

GUI tools such as tkcvs(1) and gitk(1) really help you with tracking revision history of files. The web interface provided by many public archives for browsing their repositories is also quite useful, too.

[Tip] Tip

Git can work directly with different VCS repositories such as ones provided by CVS and Subversion, and provides the local repository for local changes with git-cvs and git-svn packages. See git for CVS users, and Section 10.6.4, “Git for the Subversion repository”.

[Tip] Tip

Git has commands which have no equivalents in CVS and Subversion: "fetch", "rebase", "cherry-pick", …

Git can do everything for both local and remote source code management. This means that you can record the source code changes without needing network connectivity to the remote repository.

See the following.

git-gui(1) and gitk(1) commands make using Git very easy.

[Warning] Warning

Do not use the tag string with spaces in it even if some tools such as gitk(1) allow you to use it. It may choke some other git commands.

Even if your upstream uses different VCS, it may be a good idea to use git(1) for local activity since you can manage your local copy of source tree without the network connection to the upstream. Here are some packages and commands used with git(1).


[Tip] Tip

With git(1), you work on a local branch with many commits and use something like "git rebase -i master" to reorganize change history later. This enables you to make clean change history. See git-rebase(1) and git-cherry-pick(1).

[Tip] Tip

When you want to go back to a clean working directory without loosing the current state of the working directory, you can use "git stash". See git-stash(1).

You can manually record chronological history of configuration using Git tools. Here is a simple example for your practice to record "/etc/apt/" contents.

$ cd /etc/apt/
$ sudo git init
$ sudo chmod 700 .git
$ sudo git add .
$ sudo git commit -a

Commit configuration with description.

Make modification to the configuration files.

$ cd /etc/apt/
$ sudo git commit -a

Commit configuration with description and continue your life.

$ cd /etc/apt/
$ sudo gitk --all

You have full configuration history with you.

[Note] Note

sudo(8) is needed to work with any file permissions of configuration data. For user configuration data, you may skip sudo.

[Note] Note

The "chmod 700 .git" command in the above example is needed to protect archive data from unauthorized read access.

[Tip] Tip

For more complete setup for recording configuration history, please look for the etckeeper package: Section 9.2.10, “Recording changes in configuration files”.

See the following.

  • cvs(1)

  • "/usr/share/doc/cvs/html-cvsclient"

  • "/usr/share/doc/cvs/html-info"

  • "/usr/share/doc/cvsbook"

  • "info cvs"

Many public CVS servers provide read-only remote access to them with account name "anonymous" via pserver service. For example, Debian web site contents are maintained by webwml project via CVS at Debian alioth service. The following sets up "$CVSROOT" for the remote access to this CVS repository.

$ export CVSROOT=:pserver:anonymous@anonscm.debian.org:/cvs/webwml
$ cvs login
[Note] Note

Since pserver is prone to eavesdropping attack and insecure, write access is usually disable by server administrators.

The following sets up "$CVS_RSH" and "$CVSROOT" for the remote access to the CVS repository by webwml project with SSH.

$ export CVS_RSH=ssh
$ export CVSROOT=:ext:account@cvs.alioth.debian.org:/cvs/webwml

You can also use public key authentication for SSH which eliminates the remote password prompt.

Here is an example of typical work flow using CVS.

Check all available modules from CVS project pointed by "$CVSROOT" by the following.

$ cvs rls
CVSROOT
module1
module2
...

Checkout "module1" to its default directory "./module1" by the following.

$ cd ~/path/to
$ cvs co module1
$ cd module1

Make changes to the content as needed.

Check changes by making "diff -u [repository] [local]" equivalent by the following.

$ cvs diff -u

You find that you broke some file "file_to_undo" severely but other files are fine.

Overwrite "file_to_undo" file with the clean copy from CVS by the following.

$ cvs up -C file_to_undo

Save the updated local source tree to CVS by the following.

$ cvs ci -m "Describe change"

Create and add "file_to_add" file to CVS by the following.

$ vi file_to_add
$ cvs add file_to_add
$ cvs ci -m "Added file_to_add"

Merge the latest version from CVS by the following.

$ cvs up -d

Watch out for lines starting with "C filename" which indicates conflicting changes.

Look for unmodified code in ".#filename.version".

Search for "<<<<<<<" and ">>>>>>>" in files for conflicting changes.

Edit files to fix conflicts as needed.

Add a release tag "Release-1" by the following.

$ cvs ci -m "last commit for Release-1"
$ cvs tag Release-1

Edit further.

Remove the release tag "Release-1" by the following.

$ cvs tag -d Release-1

Check in changes to CVS by the following.

$ cvs ci -m "real last commit for Release-1"

Re-add the release tag "Release-1" to updated CVS HEAD of main by the following.

$ cvs tag Release-1

Create a branch with a sticky branch tag "Release-initial-bugfixes" from the original version pointed by the tag "Release-initial" and check it out to "~/path/to/old" directory by the following.

$ cvs rtag -b -r Release-initial Release-initial-bugfixes module1
$ cd ~/path/to
$ cvs co -r Release-initial-bugfixes -d old module1
$ cd old
[Tip] Tip

Use "-D 2005-12-20" (ISO 8601 date format) instead of "-r Release-initial" to specify particular date as the branch point.

Work on this local source tree having the sticky tag "Release-initial-bugfixes" which is based on the original version.

Work on this branch by yourself … until someone else joins to this "Release-initial-bugfixes" branch.

Sync with files modified by others on this branch while creating new directories as needed by the following.

$ cvs up -d

Edit files to fix conflicts as needed.

Check in changes to CVS by the following.

$ cvs ci -m "checked into this branch"

Update the local tree by HEAD of main while removing sticky tag ("-A") and without keyword expansion ("-kk") by the following.

$ cvs up -d -kk -A

Update the local tree (content = HEAD of main) by merging from the "Release-initial-bugfixes" branch and without keyword expansion by the following.

$ cvs up -d -kk -j Release-initial-bugfixes

Fix conflicts with editor.

Check in changes to CVS by the following.

$ cvs ci -m "merged Release-initial-bugfixes"

Make archive by the following.

$ cd ..
$ mv old old-module1-bugfixes
$ tar -cvzf old-module1-bugfixes.tar.gz old-module1-bugfixes
$ rm -rf old-module1-bugfixes
[Tip] Tip

"cvs up" command can take "-d" option to create new directories and "-P" option to prune empty directories.

[Tip] Tip

You can checkout only a sub directory of "module1" by providing its name as "cvs co module1/subdir".


Subversion is a recent-generation version control system replacing older CVS. It has most of CVS's features except tags and branches.

You need to install subversion, libapache2-svn and subversion-tools packages to set up a Subversion server.

Here is an example of typical work flow using Subversion with its native client.

[Tip] Tip

Client commands offered by the git-svn package may offer alternative work flow of Subversion using the git command. See Section 10.6.4, “Git for the Subversion repository”.

Check all available modules from Subversion project pointed by URL "file:///srv/svn/project" by the following.

$ svn list file:///srv/svn/project
module1
module2
...

Checkout "module1/trunk" to a directory "module1" by the following.

$ cd ~/path/to
$ svn co file:///srv/svn/project/module1/trunk module1
$ cd module1

Make changes to the content as needed.

Check changes by making "diff -u [repository] [local]" equivalent by the following.

$ svn diff

You find that you broke some file "file_to_undo" severely but other files are fine.

Overwrite "file_to_undo" file with the clean copy from Subversion by the following.

$ svn revert file_to_undo

Save the updated local source tree to Subversion by the following.

$ svn ci -m "Describe change"

Create and add "file_to_add" file to Subversion by the following.

$ vi file_to_add
$ svn add file_to_add
$ svn ci -m "Added file_to_add"

Merge the latest version from Subversion by the following.

$ svn up

Watch out for lines starting with "C filename" which indicates conflicting changes.

Look for unmodified code in, e.g., "filename.r6", "filename.r9", and "filename.mine".

Search for "<<<<<<<" and ">>>>>>>" in files for conflicting changes.

Edit files to fix conflicts as needed.

Add a release tag "Release-1" by the following.

$ svn ci -m "last commit for Release-1"
$ svn cp file:///srv/svn/project/module1/trunk file:///srv/svn/project/module1/tags/Release-1

Edit further.

Remove the release tag "Release-1" by the following.

$ svn rm file:///srv/svn/project/module1/tags/Release-1

Check in changes to Subversion by the following.

$ svn ci -m "real last commit for Release-1"

Re-add the release tag "Release-1" from updated Subversion HEAD of trunk by the following.

$ svn cp file:///srv/svn/project/module1/trunk file:///srv/svn/project/module1/tags/Release-1

Create a branch with a path "module1/branches/Release-initial-bugfixes" from the original version pointed by the path "module1/tags/Release-initial" and check it out to "~/path/to/old" directory by the following.

$ svn cp file:///srv/svn/project/module1/tags/Release-initial file:///srv/svn/project/module1/branches/Release-initial-bugfixes
$ cd ~/path/to
$ svn co file:///srv/svn/project/module1/branches/Release-initial-bugfixes old
$ cd old
[Tip] Tip

Use "module1/trunk@{2005-12-20}" (ISO 8601 date format) instead of "module1/tags/Release-initial" to specify particular date as the branch point.

Work on this local source tree pointing to branch "Release-initial-bugfixes" which is based on the original version.

Work on this branch by yourself … until someone else joins to this "Release-initial-bugfixes" branch.

Sync with files modified by others on this branch by the following.

$ svn up

Edit files to fix conflicts as needed.

Check in changes to Subversion by the following.

$ svn ci -m "checked into this branch"

Update the local tree with HEAD of trunk by the following.

$ svn switch file:///srv/svn/project/module1/trunk

Update the local tree (content = HEAD of trunk) by merging from the "Release-initial-bugfixes" branch by the following.

$ svn merge file:///srv/svn/project/module1/branches/Release-initial-bugfixes

Fix conflicts with editor.

Check in changes to Subversion by the following.

$ svn ci -m "merged Release-initial-bugfixes"

Make archive by the following.

$ cd ..
$ mv old old-module1-bugfixes
$ tar -cvzf old-module1-bugfixes.tar.gz old-module1-bugfixes
$ rm -rf old-module1-bugfixes
[Tip] Tip

You can replace URLs such as "file:///…" by any other URL formats such as "http://…" and "svn+ssh://…".

[Tip] Tip

You can checkout only a sub directory of "module1" by providing its name as "svn co file:///srv/svn/project/module1/trunk/subdir module1/subdir", etc.