Chapter 10. Data management

Table of Contents

10.1. Sharing, copying, and archiving
10.1.1. Archive and compression tools
10.1.2. Copy and synchronization tools
10.1.3. Idioms for the archive
10.1.4. Idioms for the copy
10.1.5. Idioms for the selection of files
10.1.6. Archive media
10.1.7. Removable storage device
10.1.8. Filesystem choice for sharing data
10.1.9. Sharing data via network
10.2. Backup and recovery
10.2.1. Backup utility suites
10.2.2. Personal backup
10.3. Data security infrastructure
10.3.1. Key management for GnuPG
10.3.2. Using GnuPG on files
10.3.3. Using GnuPG with Mutt
10.3.4. Using GnuPG with Vim
10.3.5. The MD5 sum
10.4. Source code merge tools
10.4.1. Extracting differences for source files
10.4.2. Merging updates for source files
10.4.3. Interactive merge
10.5. Git
10.5.1. Configuration of Git client
10.5.2. Basic Git commands
10.5.3. Git tips
10.5.4. Git references
10.5.5. Other version control systems

Tools and tips for managing binary and text data on the Debian system are described.

[Warning] Warning

The uncoordinated write access to actively accessed devices and files from multiple processes must not be done to avoid the race condition. File locking mechanisms using flock(1) may be used to avoid it.

The security of the data and its controlled sharing have several aspects.

  • The creation of data archive

  • The remote storage access

  • The duplication

  • The tracking of the modification history

  • The facilitation of data sharing

  • The prevention of unauthorized file access

  • The detection of unauthorized file modification

These can be realized by using some combination of tools.

  • Archive and compression tools

  • Copy and synchronization tools

  • Network filesystems

  • Removable storage media

  • The secure shell

  • The authentication system

  • Version control system tools

  • Hash and cryptographic encryption tools

Here is a summary of archive and compression tools available on the Debian system.

Table 10.1. List of archive and compression tools

package popcon size extension command comment
tar V:914, I:999 3152 .tar tar(1) the standard archiver (de facto standard)
cpio V:489, I:998 1144 .cpio cpio(1) Unix System V style archiver, use with find(1)
binutils V:164, I:673 97 .ar ar(1) archiver for the creation of static libraries
fastjar V:2, I:25 183 .jar fastjar(1) archiver for Java (zip like)
pax V:12, I:24 170 .pax pax(1) new POSIX standard archiver, compromise between tar and cpio
gzip V:891, I:999 242 .gz gzip(1), zcat(1), … GNU LZ77 compression utility (de facto standard)
bzip2 V:147, I:973 122 .bz2 bzip2(1), bzcat(1), … Burrows-Wheeler block-sorting compression utility with higher compression ratio than gzip(1) (slower than gzip with similar syntax)
lzma V:2, I:27 149 .lzma lzma(1) LZMA compression utility with higher compression ratio than gzip(1) (deprecated)
xz-utils V:453, I:980 612 .xz xz(1), xzdec(1), … XZ compression utility with higher compression ratio than bzip2(1) (slower than gzip but faster than bzip2; replacement for LZMA compression utility)
zstd V:4, I:24 1902 .zstd zstd(1), zstdcat(1), … Zstandard fast lossless compression utility
p7zip V:79, I:454 987 .7z 7zr(1), p7zip(1) 7-Zip file archiver with high compression ratio (LZMA compression)
p7zip-full V:102, I:469 4664 .7z 7z(1), 7za(1) 7-Zip file archiver with high compression ratio (LZMA compression and others)
lzop V:10, I:85 164 .lzo lzop(1) LZO compression utility with higher compression and decompression speed than gzip(1) (lower compression ratio than gzip with similar syntax)
zip V:49, I:427 623 .zip zip(1) InfoZIP: DOS archive and compression tool
unzip V:132, I:792 385 .zip unzip(1) InfoZIP: DOS unarchive and decompression tool

[Warning] Warning

Do not set the "$TAPE" variable unless you know what to expect. It changes tar(1) behavior.

Here are several ways to copy the entire content of the directory "./source" using different tools.

  • Local copy: "./source" directory → "/dest" directory

  • Remote copy: "./source" directory at local host → "/dest" directory at "user@host.dom" host

rsync(8):

# cd ./source; rsync -aHAXSv . /dest
# cd ./source; rsync -aHAXSv . user@host.dom:/dest

You can alternatively use "a trailing slash on the source directory" syntax.

# rsync -aHAXSv ./source/ /dest
# rsync -aHAXSv ./source/ user@host.dom:/dest

Alternatively, by the following.

# cd ./source; find . -print0 | rsync -aHAXSv0 --files-from=- . /dest
# cd ./source; find . -print0 | rsync -aHAXSv0 --files-from=- . user@host.dom:/dest

GNU cp(1) and openSSH scp(1):

# cd ./source; cp -a . /dest
# cd ./source; scp -pr . user@host.dom:/dest

GNU tar(1):

# (cd ./source && tar cf - . ) | (cd /dest && tar xvfp - )
# (cd ./source && tar cf - . ) | ssh user@host.dom '(cd /dest && tar xvfp - )'

cpio(1):

# cd ./source; find . -print0 | cpio -pvdm --null --sparse /dest

You can substitute "." with "foo" for all examples containing "." to copy files from "./source/foo" directory to "/dest/foo" directory.

You can substitute "." with the absolute path "/path/to/source/foo" for all examples containing "." to drop "cd ./source;". These copy files to different locations depending on tools used as follows.

  • "/dest/foo": rsync(8), GNU cp(1), and scp(1)

  • "/dest/path/to/source/foo": GNU tar(1), and cpio(1)

[Tip] Tip

rsync(8) and GNU cp(1) have option "-u" to skip files that are newer on the receiver.

find(1) is used to select files for archive and copy commands (see Section 10.1.3, “Idioms for the archive” and Section 10.1.4, “Idioms for the copy”) or for xargs(1) (see Section 9.4.9, “Repeating a command looping over files”). This can be enhanced by using its command arguments.

Basic syntax of find(1) can be summarized as the following.

  • Its conditional arguments are evaluated from left to right.

  • This evaluation stops once its outcome is determined.

  • "Logical OR" (specified by "-o" between conditionals) has lower precedence than "logical AND" (specified by "-a" or nothing between conditionals).

  • "Logical NOT" (specified by "!" before a conditional) has higher precedence than "logical AND".

  • "-prune" always returns logical TRUE and, if it is a directory, searching of file is stopped beyond this point.

  • "-name" matches the base of the filename with shell glob (see Section 1.5.6, “Shell glob”) but it also matches its initial "." with metacharacters such as "*" and "?". (New POSIX feature)

  • "-regex" matches the full path with emacs style BRE (see Section 1.6.2, “Regular expressions”) as default.

  • "-size" matches the file based on the file size (value precedented with "+" for larger, precedented with "-" for smaller)

  • "-newer" matches the file newer than the one specified in its argument.

  • "-print0" always returns logical TRUE and print the full filename (null terminated) on the standard output.

find(1) is often used with an idiomatic style as the following.

# find /path/to \
    -xdev -regextype posix-extended \
    -type f -regex ".*\.cpio|.*~" -prune -o \
    -type d -regex ".*/\.git" -prune -o \
    -type f -size +99M -prune -o \
    -type f -newer /path/to/timestamp -print0

This means to do following actions.

  1. Search all files starting from "/path/to"

  2. Globally limit its search within its starting filesystem and uses ERE (see Section 1.6.2, “Regular expressions”) instead

  3. Exclude files matching regex of ".*\.cpio" or ".*~" from search by stop processing

  4. Exclude directories matching regex of ".*/\.git" from search by stop processing

  5. Exclude files larger than 99 Megabytes (units of 1048576 bytes) from search by stop processing

  6. Print filenames which satisfy above search conditions and are newer than "/path/to/timestamp"

Please note the idiomatic use of "-prune -o" to exclude files in the above example.

[Note] Note

For non-Debian Unix-like system, some options may not be supported by find(1). In such a case, please consider to adjust matching methods and replace "-print0" with "-print". You may need to adjust related commands too.

When choosing computer data storage media for important data archive, you should be careful about their limitations. For small personal data backup, I use CD-R and DVD-R by the brand name company and store in a cool, shaded, dry, clean environment. (Tape archive media seem to be popular for professional use.)

[Note] Note

A fire-resistant safe are meant for paper documents. Most of the computer data storage media have less temperature tolerance than paper. I usually rely on multiple secure encrypted copies stored in multiple secure locations.

Optimistic storage life of archive media seen on the net (mostly from vendor info).

  • 100+ years : Acid free paper with ink

  • 100 years : Optical storage (CD/DVD, CD/DVD-R)

  • 30 years : Magnetic storage (tape, floppy)

  • 20 years : Phase change optical storage (CD-RW)

These do not count on the mechanical failures due to handling etc.

Optimistic write cycle of archive media seen on the net (mostly from vendor info).

  • 250,000+ cycles : Harddisk drive

  • 10,000+ cycles : Flash memory

  • 1,000 cycles : CD/DVD-RW

  • 1 cycles : CD/DVD-R, paper

[Caution] Caution

Figures of storage life and write cycle here should not be used for decisions on any critical data storage. Please consult the specific product information provided by the manufacture.

[Tip] Tip

Since CD/DVD-R and paper have only 1 write cycle, they inherently prevent accidental data loss by overwriting. This is advantage!

[Tip] Tip

If you need fast and frequent backup of large amount of data, a hard disk on a remote host linked by a fast network connection, may be the only realistic option.

Removable storage devices may be any one of the following.

They may be connected via any one of the following.

Modern desktop environments such as GNOME and KDE can mount these removable devices automatically without a matching "/etc/fstab" entry.

  • udisks package provides a daemon and associated utilities to mount and unmount these devices.

  • D-bus creates events to initiate automatic processes.

  • PolicyKit provides required privileges.

[Tip] Tip

Automounted devices may have the "uhelper=" mount option which is used by umount(8).

[Tip] Tip

Automounting under modern desktop environment happens only when those removable media devices are not listed in "/etc/fstab".

Mount point under modern desktop environment is chosen as "/media/disk_label" which can be customized by the following.

  • mlabel(1) for FAT filesystem

  • genisoimage(1) with "-V" option for ISO9660 filesystem

  • tune2fs(1) with "-L" option for ext2/ext3/ext4 filesystem

[Tip] Tip

The choice of encoding may need to be provided as mount option (see Section 8.1.3, “Filename encoding”).

[Tip] Tip

The use of the GUI menu to unmount a filesystem may remove its dynamically generated device node such as "/dev/sdc". If you wish to keep its device node, unmount it with the umount(8) command from the shell prompt.

When sharing data with other system via removable storage device, you should format it with common filesystem supported by both systems. Here is a list of filesystem choices.


[Tip] Tip

See Section 9.9.1, “Removable disk encryption with dm-crypt/LUKS” for cross platform sharing of data using device level encryption.

The FAT filesystem is supported by almost all modern operating systems and is quite useful for the data exchange purpose via removable hard disk like media.

When formatting removable hard disk like devices for cross platform sharing of data with the FAT filesystem, the following should be safe choices.

When using the FAT or ISO9660 filesystems for sharing data, the following should be the safe considerations.

  • Archiving files into an archive file first using tar(1), or cpio(1) to retain the long filename, the symbolic link, the original Unix file permission and the owner information.

  • Splitting the archive file into less than 2 GiB chunks with the split(1) command to protect it from the file size limitation.

  • Encrypting the archive file to secure its contents from the unauthorized access.

[Note] Note

For FAT filesystems by its design, the maximum file size is (2^32 - 1) bytes = (4GiB - 1 byte). For some applications on the older 32 bit OS, the maximum file size was even smaller (2^31 - 1) bytes = (2GiB - 1 byte). Debian does not suffer the latter problem.

[Note] Note

Microsoft itself does not recommend to use FAT for drives or partitions of over 200 MB. Microsoft highlights its short comings such as inefficient disk space usage in their "Overview of FAT, HPFS, and NTFS File Systems". Of course, we should normally use the ext4 filesystem for Linux.

[Tip] Tip

For more on filesystems and accessing filesystems, please read "Filesystems HOWTO".

We all know that computers fail sometime or human errors cause system and data damages. Backup and recovery operations are the essential part of successful system administration. All possible failure modes hit you some day.

[Tip] Tip

Keep your backup system simple and backup your system often. Having backup data is more important than how technically good your backup method is.

There are 3 key factors which determine actual backup and recovery policy.

  1. Knowing what to backup and recover.

    • Data files directly created by you: data in "~/"

    • Data files created by applications used by you: data in "/var/" (except "/var/cache/", "/var/run/", and "/var/tmp/")

    • System configuration files: data in "/etc/"

    • Local softwares: data in "/usr/local/" or "/opt/"

    • System installation information: a memo in plain text on key steps (partition, …)

    • Proven set of data: confirmed by experimental recovery operations in advance

  2. Knowing how to backup and recover.

    • Secure storage of data: protection from overwrite and system failure

    • Frequent backup: scheduled backup

    • Redundant backup: data mirroring

    • Fool proof process: easy single command backup

  3. Assessing risks and costs involved.

    • Value of data when lost

    • Required resources for backup: human, hardware, software, …

    • Failure mode and their possibility

[Note] Note

Do not back up the pseudo-filesystem contents found on /proc, /sys, /tmp, and /run (see Section 1.2.12, “procfs and sysfs” and Section 1.2.13, “tmpfs”). Unless you know exactly what you are doing, they are huge useless data.

As for secure storage of data, data should be at least on different disk partitions preferably on different disks and machines to withstand the filesystem corruption. Important data are best stored on a write-once media such as CD/DVD-R to prevent overwrite accidents. (See Section 9.8, “The binary data” for how to write to the storage media from the shell commandline. GNOME desktop GUI environment gives you easy access via menu: "Places→CD/DVD Creator".)

[Note] Note

You may wish to stop some application daemons such as MTA (see Section 6.2.4, “Mail transport agent (MTA)”) while backing up data.

[Note] Note

You should pay extra care to the backup and restoration of identity related data files such as "/etc/ssh/ssh_host_dsa_key", "/etc/ssh/ssh_host_rsa_key", "~/.gnupg/*", "~/.ssh/*", "/etc/passwd", "/etc/shadow", "/etc/fetchmailrc", "popularity-contest.conf", "/etc/ppp/pap-secrets", and "/etc/exim4/passwd.client". Some of these data can not be regenerated by entering the same input string to the system.

[Note] Note

If you run a cron job as a user process, you must restore files in "/var/spool/cron/crontabs" directory and restart cron(8). See Section 9.4.14, “Scheduling tasks regularly” for cron(8) and crontab(1).

Here is a select list of notable backup utility suites available on the Debian system.

Table 10.5. List of backup suite utilities

package popcon size description
dump V:1, I:6 352 4.4 BSD dump(8) and restore(8) for ext2/ext3/ext4 filesystems
xfsdump V:0, I:9 854 dump and restore with xfsdump(8) and xfsrestore(8) for XFS filesystem on GNU/Linux and IRIX
backupninja V:3, I:4 367 lightweight, extensible meta-backup system
bacula-common V:10, I:14 2158 Bacula: network backup, recovery and verification - common support files
bacula-client I:3 183 Bacula: network backup, recovery and verification - client meta-package
bacula-console V:1, I:4 107 Bacula: network backup, recovery and verification - text console
bacula-server I:1 183 Bacula: network backup, recovery and verification - server meta-package
amanda-common V:1, I:2 10030 Amanda: Advanced Maryland Automatic Network Disk Archiver (Libs)
amanda-client V:1, I:2 1088 Amanda: Advanced Maryland Automatic Network Disk Archiver (Client)
amanda-server V:0, I:0 1075 Amanda: Advanced Maryland Automatic Network Disk Archiver (Server)
backup-manager V:1, I:1 571 command-line backup tool
backup2l V:0, I:1 114 low-maintenance backup/restore tool for mountable media (disk based)
backuppc V:3, I:3 3183 BackupPC is a high-performance, enterprise-grade system for backing up PCs (disk based)
duplicity V:9, I:19 1834 (remote) incremental backup
flexbackup V:0, I:0 243 (remote) incremental backup
rdiff-backup V:6, I:14 733 (remote) incremental backup
restic V:1, I:4 22540 (remote) incremental backup
slbackup V:0, I:0 151 (remote) incremental backup

Backup tools have their specialized focuses.

  • Mondo Rescue is a backup system to facilitate restoration of complete system quickly from backup CD/DVD etc. without going through normal system installation processes.

  • Bacula, Amanda, and BackupPC are full featured backup suite utilities which are focused on regular backups over network.

  • Regular backups of user data can be realized by a simple script (Section 10.2.2, “Personal backup”).

Basic tools described in Section 10.1.1, “Archive and compression tools” and Section 10.1.2, “Copy and synchronization tools” can be used to facilitate system backup via custom scripts. Such script can be enhanced by the following.

  • The restic package enables incremental (remote) backups.

  • The rdiff-backup package enables incremental (remote) backups.

  • The dump package helps to archive and restore the whole filesystem incrementally and efficiently.

[Tip] Tip

See files in "/usr/share/doc/dump/" and "Is dump really deprecated?" to learn about the dump package.

The data security infrastructure is provided by the combination of data encryption tool, message digest tool, and signature tool.


See Section 9.9, “Data encryption tips” on dm-crypt and ecryptfs which implement automatic data encryption infrastructure via Linux kernel modules.

Here are GNU Privacy Guard commands for the basic key management.


Here is the meaning of the trust code.


The following uploads my key "1DD8D791" to the popular keyserver "hkp://keys.gnupg.net".

$ gpg --keyserver hkp://keys.gnupg.net --send-keys 1DD8D791

A good default keyserver set up in "~/.gnupg/gpg.conf" (or old location "~/.gnupg/options") contains the following.

keyserver hkp://keys.gnupg.net

The following obtains unknown keys from the keyserver.

$ gpg --list-sigs --with-colons | grep '^sig.*\[User ID not found\]' |\
  cut -d ':' -f 5| sort | uniq | xargs gpg --recv-keys

There was a bug in OpenPGP Public Key Server (pre version 0.9.6) which corrupted key with more than 2 sub-keys. The newer gnupg (>1.2.1-2) package can handle these corrupted subkeys. See gpg(1) under "--repair-pks-subkey-bug" option.

md5sum(1) provides utility to make a digest file using the method in rfc1321 and verifying each file with it.

$ md5sum foo bar >baz.md5
$ cat baz.md5
d3b07384d113edec49eaa6238ad5ff00  foo
c157a79031e1c40f85931829bc5fc552  bar
$ md5sum -c baz.md5
foo: OK
bar: OK
[Note] Note

The computation for the MD5 sum is less CPU intensive than the one for the cryptographic signature by GNU Privacy Guard (GnuPG). Usually, only the top level digest file is cryptographically signed to ensure data integrity.

There are many merge tools for the source code. Following commands caught my eyes.

Table 10.10. List of source code merge tools

package popcon size command description
patch V:123, I:721 248 patch(1) apply a diff file to an original
vim V:102, I:404 3286 vimdiff(1) compare 2 files side by side in vim
imediff V:0, I:0 170 imediff(1) interactive full screen 2/3-way merge tool
meld V:14, I:38 3065 meld(1) compare and merge files (GTK)
wiggle V:0, I:0 174 wiggle(1) apply rejected patches
diffutils V:883, I:993 1598 diff(1) compare files line by line
diffutils V:883, I:993 1598 diff3(1) compare and merges three files line by line
quilt V:3, I:32 788 quilt(1) manage series of patches
wdiff V:8, I:69 644 wdiff(1) display word differences between text files
diffstat V:14, I:146 81 diffstat(1) produce a histogram of changes by the diff
patchutils V:16, I:143 232 combinediff(1) create a cumulative patch from two incremental patches
patchutils V:16, I:143 232 dehtmldiff(1) extract a diff from an HTML page
patchutils V:16, I:143 232 filterdiff(1) extract or excludes diffs from a diff file
patchutils V:16, I:143 232 fixcvsdiff(1) fix diff files created by CVS that patch(1) mis-interprets
patchutils V:16, I:143 232 flipdiff(1) exchange the order of two patches
patchutils V:16, I:143 232 grepdiff(1) show which files are modified by a patch matching a regex
patchutils V:16, I:143 232 interdiff(1) show differences between two unified diff files
patchutils V:16, I:143 232 lsdiff(1) show which files are modified by a patch
patchutils V:16, I:143 232 recountdiff(1) recompute counts and offsets in unified context diffs
patchutils V:16, I:143 232 rediff(1) fix offsets and counts of a hand-edited diff
patchutils V:16, I:143 232 splitdiff(1) separate out incremental patches
patchutils V:16, I:143 232 unwrapdiff(1) demangle patches that have been word-wrapped
dirdiff V:0, I:2 166 dirdiff(1) display differences and merge changes between directory trees
docdiff V:0, I:0 555 docdiff(1) compare two files word by word / char by char
makepatch V:0, I:0 102 makepatch(1) generate extended patch files
makepatch V:0, I:0 102 applypatch(1) apply extended patch files

Git is the tool of choice these days for the version control system (VCS) since Git can do everything for both local and remote source code management.

Debian provides free Git services via Debian Salsa service. Its documentation can be found at https://wiki.debian.org/Salsa .

Here are some Git related packages.


Git operation involves several data.

  • The working tree which holds user facing files and you make changes to them.

    • The changes to be recorded must be explicitly selected and staged to the index. This is git add and git rm commands

  • The index which holds staged files.

    • Staged files will be committed to the local repository upon the subsequent request. This is git commit command.

  • The local repository which holds committed files.

    • Git records the linked history of the committed data and organizes them as branches in the repository.

    • The local repository can send data to the remote repository by git push command.

    • The local repository can receive data from the remote repository by git fetch and git pull commands.

      • The git pull command performs git merge or git rebase command after git fetch command.

      • Here, git merge combines two separate branches of history at the end to a point. (This is default of git pull without customization and may be good for upstream people who publish branch to many people.)

      • Here, git rebase creates one single branch of sequential history of the remote branch one followed by the local branch one. (This is pull.rebase true customization case and may be good for rest of us.

  • The remote repository which holds committed files.

    • The communication to the remote repository uses secure communication protocols such as SSH or HTTPS.

The working tree is files outside of the .git/ directory. Files inside of the .git/ directory hold the index, the local repository data, and some git configuration text files.

Here is an overview of main Git commands.


Here are some Git tips.

Table 10.13. Git tips

Git command line function
gitk --all see complete Git history and operate on them such as resetting HEAD to another commit, cheery-picking patches, creating tags and branches ...
git stash get the clean working tree without loosing data
git remote -v check settings for remote
git branch -vv check settings for branch
git status show working tree status
git config -l list git settings
git reset --hard HEAD; git clean -x -d -f revert all working tree changes and clean them up completely
git rm --cached filename revert staged index changed by git add filename
git reflog get reference log (useful for recovering commits from the removed branch)
git branch new_branch_name HEAD@{6} create a new branch from reflog information
git remote add new_remote URL add a new_remote remote repository pointed by URL
git remote rename origin upstream rename the remote repository name from origin to upstream
git branch -u upstream/branch_name set the remote tracking to the remote repository upstream and its branch name branch_name.
git remote set-url origin https://foo/bar.git change URL of origin
git remote set-url --push upstream DISABLED disable push to upstream (Edit .git/config to re-enable)
git checkout -b topic_branch ; git push -u topic_branch origin make a new topic_branch and push it to origin
git branch -m oldname newname rename local branch name
git push -d origin branch_to_be_removed remove remote branch (new method)
git push origin :branch_to_be_removed remove remote branch (old method)
git checkout --orphan unconnected create a new unconnected branch
git rebase -i origin/main reorder/drop/squish commits from origin/main to clean branch history
git reset HEAD^; git commit --amend squash last 2 commits into one
git checkout topic_branch ; git merge --squash topic_branch squash entire topic_branch into a commit
git ime split the last commit into a series of file-by-file smaller commits etc. (imediff package required)
git repack -a -d; git prune repack the local repository into single pack (this may limit chance of lost data recovery from erased branch etc.)

[Warning] Warning

Do not use the tag string with spaces in it even if some tools such as gitk(1) allow you to use it. It may choke some other git commands.

[Caution] Caution

If a local branch which has been pushed to remote repository is rebased or squashed, pushing this branch has risks and requires --force option. This is usually not an acceptable for main branch but may be acceptable for a topic branch before merging to main branch.

[Caution] Caution

Invoking a git subcommand directly as "git-xyz" from the command line has been deprecated since early 2006.

[Tip] Tip

If there is a executable file git-foo in the path specified by $PATH, entering "git foo" without hyphen to the command line invokes this git-foo. This is a feature of the git command.

See the following.



[4] If you use "~/.vimrc" instead of "~/.vim/vimrc", please substitute accordingly.