第 12 章高度な管理

この章では、前章までに説明した一部の側面を異なる視点からもう一度取り上げます。すなわち、1 台のコンピュータにインストールするのではなく、大規模な配備システムについて学びます。さらに、初回インストール時に RAID や LVM ボリュームを作成するのではなく、手作業でこれを行う方法について学びます。こうすることで初回インストール時の選択を訂正することが可能です。最後に、監視ツールと仮想化技術について議論します。その結果として、この章はより熟練した管理者を対象にしており、ホームネットワークに責任を負う個人を対象にしていません。

12.1. RAID と LVM

第 4 章「インストール」 presented these technologies from the point of view of the installer, and how it integrated them to make their deployment easy from the start. After the initial installation, an administrator must be able to handle evolving storage space needs without having to resort to an expensive re-installation. They must therefore understand the required tools for manipulating RAID and LVM volumes.

RAID and LVM are both techniques to abstract the mounted volumes from their physical counterparts (actual hard-disk drives or partitions thereof); the former ensures the security and availability of the data in case of hardware failure by introducing redundancy, the latter makes volume management more flexible and independent of the actual size of the underlying disks. In both cases, the system ends up with new block devices, which can be used to create filesystems or swap space, without necessarily having them mapped to one physical disk. RAID and LVM come from quite different backgrounds, but their functionality can overlap somewhat, which is why they are often mentioned together.

PERSPECTIVE Btrfs が LVM と RAID を結び付ける

While LVM and RAID are two distinct kernel subsystems that come between the disk block devices and their filesystems, btrfs is a filesystem, initially developed at Oracle, that purports to combine the feature sets of LVM and RAID and much more.

→ https://btrfs.wiki.kernel.org/

btrfs の特筆すべき機能に、任意の時点におけるファイルシステムツリーのスナップショットを取る機能があります。このスナップショットコピーは初期状態ではいかなるディスク領域も使いません、コピー内容の 1 つが修正された際にデータが複製されます。また、このファイルシステムはファイルを透過的に圧縮することが可能で、さらにチェックサムを用いて保存されているデータの完全性を保証します。

RAID と LVM のどちらの場合も、カーネルはハードディスクドライブやパーティションに対応するブロックデバイスファイルとよく似たブロックデバイスファイルを提供します。アプリケーションやカーネルの別の部分がそのようなデバイスのあるブロックにアクセスを要求する場合、適切なサブシステムが要求されたブロックを物理層のブロックに対応付けます。設定に依存して、アプリケーション側から見たブロックは単独か複数の物理ディスクに保存されます。このブロックの物理的場所は論理デバイス内のブロックの位置と直接的に対応するものではないかもしれません。

12.1.1. ソフトウェア RAID

RAID means Redundant Array of Independent Disks. The goal of this system is to prevent data loss and ensure availability in case of hard disk failure. The general principle is quite simple: data are stored on several physical disks instead of only one, with a configurable level of redundancy. Depending on this amount of redundancy, and even in the event of an unexpected disk failure, data can be losslessly reconstructed from the remaining disks.

RAID は専用ハードウェア (SCSI や SATA コントローラカードに統合された RAID モジュール) またはソフトウェア抽象化 (カーネル) を使って実装することが可能です。ハードウェアかソフトウェアかに関わらず、十分な冗長性を備えた RAID システムはディスク障害があっても利用できる状態を透過的に継続することが可能です。従って、スタックの上層 (アプリケーション) はディスク障害にも関わらず、引き続きデータにアクセスできます。もちろん「信頼性低下状態」は性能に影響をおよぼし、冗長性を低下させます。このため、もう一つ別のディスク障害が起きるとデータを失うことになります。このため実践的には、管理者は信頼性低下状態を障害の起きたディスクが交換されるまでの間だけに留めるように努力します。新しいディスクが配備されると、RAID システムは要求されたデータを再構成することが可能です。こうすることで信頼性の高い状態に戻ります。信頼性低下状態か再構成状態にある RAID アレイのアクセス速度は低下する可能性がありますが、この点を除けばアプリケーションがディスク障害に気が付くことはないでしょう。

RAID がハードウェアで実装された場合、その設定は通常 BIOS セットアップツールによってなされます。カーネルは RAID アレイを標準的な物理ディスクとして機能する単一のディスクとみなします。RAID アレイのデバイス名は (ドライバに依存して) 違うかもしれません。

本書ではソフトウェア RAID だけに注目します。

12.1.1.1. さまざまな RAID レベル

実際のところ RAID の種類は 1 種類だけではなく、そのレベルによって識別される複数の種類があります。すなわち、設計と提供される冗長性の度合いが異なる複数の RAID レベルが存在します。より冗長性を高くすれば、より障害に強くなります。なぜなら、より多くのディスクで障害が起きても、システムを動かし続けることができるからです。これに応じて、与えられた一連のディスクに対して利用できる領域が小さくなります。すなわち、あるサイズのデータを保存するために必要なディスク領域のサイズが多くなります。

リニア RAID: Even though the kernel's RAID subsystem allows creating “linear RAID”, this is not proper RAID, since this setup doesn't involve any redundancy. The kernel merely aggregates several disks end-to-end and provides the resulting aggregated volume as one virtual disk (one block device). That is about its only function. This setup is rarely used by itself (see later for the exceptions), especially since the lack of redundancy means that one disk failing makes the whole aggregate, and therefore all the data, unavailable.
RAID-0: 同様に RAID-0 にも冗長性がありません。しかしながら、RAID-0 は順番通り単純に物理ディスクを連結する構成ではありません。すなわち、物理ディスクはストライプ状に分割され、仮想デバイスのブロックは互い違いになった物理ディスクのストライプに保存されます。たとえば 2 台のディスクから構成されている RAID-0 セットアップでは、偶数を付番されたブロックは最初の物理ディスクに保存され、奇数を付番されたブロックは 2 番目の物理ディスクに保存されます。
This system doesn't aim at increasing reliability, since (as in the linear case) the availability of all the data is jeopardized as soon as one disk fails, but at increasing performance: during sequential access to large amounts of contiguous data, the kernel will be able to read from both disks (or write to them) in parallel, which increases the data transfer rate. The disks are utilized entirely by the RAID device, so they should have the same size not to lose performance.
RAID-0 use is shrinking, its niche being filled by LVM (see later).
RAID-1: RAID-1 は「RAID ミラーリング」としても知られ、最も簡単で最も広く使われています。RAID-1 の標準的な構成では、同じサイズの 2 台の物理ディスクを使い、物理ディスクと同じサイズの論理ボリュームが利用できるようになります。データを両方のディスクに保存するため、「ミラー」と呼ばれています。一方のディスクに障害があっても、他方のディスクからデータを利用することが可能です。もちろん、非常に重要なデータ用に RAID-1 を 2 台以上の構成にすることも可能ですが、これはハードウェア費用と利用できる保存領域の比率に直接的な影響をおよぼします。
NOTE ディスクとクラスタサイズ
異なるサイズの 2 台のディスクをミラーでセットアップする場合、サイズの大きい側のディスクは完全に利用されません。なぜなら、大きい側のディスクに含まれるデータは最も小さいディスクに含まれるデータと同じデータだからです。このため RAID-1 ボリュームで提供される利用できる領域のサイズは RAID アレイの最も小さなディスクのサイズと同じになります。冗長性を異なる方法で確保しているより高い RAID レベルの RAID ボリュームに対しても同じことが言えます。
それ故、(RAID-0 と「リニア RAID」以外の) RAID アレイをセットアップする場合、資源の無駄を防ぐためにアレイを構成するディスクはそのサイズが完全に同じか近いものを使うことが重要です。
NOTE 予備ディスク
冗長性を持たせた RAID レベルでは、必要なディスク数よりも多くのディスクで RAID アレイを構成させることが可能です。追加的ディスクは主要ディスクに障害が起きた場合に予備として使われます。たとえば、2 台のディスクと 1 台の予備ディスクのミラー構成では、最初の 2 台のうちの 1 台に障害が起きた場合、カーネルは自動的に (そして素早く) 予備ディスクを使ってミラーを再構成し、再構成の完了後に冗長性が再確保されます。すなわち、重要なデータに対するもう一つの安全装置として予備ディスクを使うことが可能ということです。
この方式が単純に 3 台のディスクに対して最初からミラーリングを行うよりも優れているとされることに疑問を持つかもしれません。「予備ディスク」を設定する利点は複数の RAID ボリュームで予備ディスクを共有することが可能という点です。たとえば、1 台のディスク障害に対する冗長性を確保した 3 つのミラーされたボリュームを構成するには、ディスクを 7 台 (3 つのペアと 1 台の共有された予備) 用意するだけですみます。これに対して各ボリュームに 3 台のディスクを用意する場合には 9 台のディスクが必要です。
RAID-1 は高価であるにも関わらず (良くても物理ストレージ領域のたった半分しか使えないにも関わらず)、広く実運用されています。RAID-1 は簡単に理解でき、簡単にバックアップできます。なぜなら両方のディスクが全く同じ内容を持っているため、片方を一時的に取り外しても運用システムに影響をおよぼさないからです。通常 RAID-1 を使うことで、読み込み性能は好転します。なぜなら、カーネルはデータの半分をそれぞれのディスクから並行して読むことができるからです。これに対して、書き込み性能はそれほど悪化しません。N 台のディスクからなる RAID-1 アレイの場合、データは N-1 台のディスク障害に対して保護されます。
CAUTION RAID is not Backup
RAID systems are not backup mechanisms. While RAID increases the redundancy - and therefore the availability of a system - and protects against disk failures, backups are done to protect data from being altered, deleted, getting corrupted, etc., and to be able to restore them if necessary. To demonstrate this: If you remove one or all files by accident, a RAID will mirror this change, but it will not provide the means to restore the file(s). So while there is clearly an overlap, they are not the same and should be used in conjunction with each other.
RAID-4: RAID-4 は広く使われていません。RAID-4 は実データを保存するために N 台のディスクを使い、冗長性情報を保存するために 1 台の「パリティ」ディスクを使います。「パリティ」ディスクに障害が起きた場合、システムは他の N 台からデータを再構成することが可能です。N 台のデータディスクのうち、最大で 1 台に障害が起きた場合、残りの N-1 台と「パリティ」ディスクには、要求されたデータを再構成するために十分な情報が含まれます。
RAID-4 は高価過ぎるというわけではありません。なぜならディスク 1 台につきたった N 分の 1 台分の追加費用で済むからです。また RAID-4 を使うと読み込み性能が大きく低下するというわけでもありません。しかしながら、RAID-4 は書き込み性能に深刻な影響をおよぼします。加えて、N 台の実データ用ディスクのどのディスクに書き込んでもパリティディスクに対する書き込みが発生するので、パリティディスクは実データ用ディスクに比べて書き込み回数が増えます。その結果、パリティディスクは極めて寿命が短くなります。RAID-4 アレイのデータは (N+1 台のディスクのうち) 1 台の障害に対して保護されます。
RAID-5: RAID-5 は RAID-4 の非対称性問題を対処したものです。すなわち、パリティブロックは N+1 台のディスクに分散して保存され、特定のディスクが特定の役割を果たすことはありません。
読み込みと書き込み性能は RAID-4 と同様です。繰り返しになりますが、RAID-5 システムは (N+1 台のディスクのうち) 最大で 1 台までに障害が起きても動作します。
RAID-6: RAID-6 は RAID-5 の拡張と考えられます。RAID-6 では、N 個の連続するブロックに対して 2 個の冗長性ブロックを使います。この N+2 個のブロックは N+2 台のディスクに分散して保存されます。
RAID-6 は RAID-4 と RAID-5 に比べて少し高価ですが、RAID-6 を使うことで安全性はさらに高まります。なぜなら、(N+2 台中の) 最大で 2 台までの障害に対してデータを守ることが可能だからです。書き込み操作は 1 つのデータブロックと 2 つの冗長性ブロックを書き込むことに対応しますから、RAID-6 の書き込み性能は RAID-4 と RAID-5 に比べてさらに悪化します。
RAID-1+0: This isn't strictly speaking, a RAID level, but a stacking of two RAID groupings. Starting from 2×N disks, one first sets them up by pairs into N RAID-1 volumes; these N volumes are then aggregated into one, either by “linear RAID” or (increasingly) by LVM. This last case goes farther than pure RAID, but there is no problem with that.
RAID-1+0 は複数のディスク障害を乗り切ることが可能です。具体的に言えば、上に挙げた 2×N アレイの場合、最大で N 台までの障害に耐えます。ただし、各 RAID-1 ペアを構成するディスクの両方に障害が発生してはいけません。
GOING FURTHER RAID-10
通常 RAID-10 は RAID-1+0 の同意語と考えられますが、Linux では特別に RAID-10 をより一般的な構成を可能にするものとして定めています。RAID-10 では、システムが各ブロックを 2 種類の異なるディスクに保存することが可能です。奇数台のディスク構成の場合でも、ブロックのコピーは設定可能なモデルに従って分散して保存されます。
RAID-10 の性能は選択した再分割モデルと冗長性の度合い、そして論理ボリュームの作業負荷に依存して変化します。

RAID レベルを選ぶ際には、各用途からの制限および要求を考慮する必要があるのは明らかです。1 台のコンピュータに異なる設定を持つ複数の RAID アレイを配置することが可能である点に注意してください。

12.1.1.2. RAID の設定

RAID ボリュームを設定するには mdadm パッケージが必要です。mdadm パッケージには RAID アレイを作成したり操作するための mdadm コマンド、システムの他の部分に RAID アレイを統合するためのスクリプトやツール、監視システムが含まれます。

以下の例では、多数のディスクを持つサーバをセットアップします。ディスクの一部は既に利用されており、残りは RAID をセットアップするために利用できるようになっています。最初の状態で、以下のディスクとパーティションが存在します。

sdb ディスク (4 GB) は全領域を利用できます。
sdc ディスク (4 GB) は全領域を利用できます。
sdd ディスクは sdd2 パーティション (約 4 GB) だけを利用できます。
sde ディスク (4 GB) は全領域を利用できます。

RAID-0 とミラー (RAID-1) の 2 つのボリュームを作るために上記の物理ディスクを使います。それでは RAID-0 ボリュームから作っていきましょう。

# mdadm --create /dev/md0 --level=0 --raid-devices=2 /dev/sdb /dev/sdc
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
# mdadm --query /dev/md0
/dev/md0: 7.99GiB raid0 2 devices, 0 spares. Use mdadm --detail for more detail.
# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Mon Feb 28 01:54:24 2022
        Raid Level : raid0
        Array Size : 8378368 (7.99 GiB 8.58 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 01:54:24 2022
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : -unknown-
        Chunk Size : 512K

Consistency Policy : none

              Name : debian:0  (local to host debian)
              UUID : a75ac628:b384c441:157137ac:c04cd98c
            Events : 0

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sdb
       1       8       16        1      active sync   /dev/sdc
# mkfs.ext4 /dev/md0
mke2fs 1.46.2 (28-Feb-2021)
Discarding device blocks: done                            
Creating filesystem with 2094592 4k blocks and 524288 inodes
Filesystem UUID: ef077204-c477-4430-bf01-52288237bea0
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

# mkdir /srv/raid-0
# mount /dev/md0 /srv/raid-0
# df -h /srv/raid-0
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        7.8G   24K  7.4G   1% /srv/raid-0

The mdadm --create command requires several parameters: the name of the volume to create (/dev/md*, with MD standing for Multiple Device), the RAID level, the number of disks (which is compulsory despite being mostly meaningful only with RAID-1 and above), and the physical drives to use. Once the device is created, we can use it like we'd use a normal partition, create a filesystem on it, mount that filesystem, and so on. Note that our creation of a RAID-0 volume on md0 is nothing but coincidence, and the numbering of the array doesn't need to be correlated to the chosen amount of redundancy. It is also possible to create named RAID arrays, by giving mdadm parameters such as /dev/md/linear instead of /dev/md0.

同様のやり方で RAID-1 を作成します。注意するべき違いは作成後に説明します。

# mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdd2 /dev/sde
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: largest drive (/dev/sdc2) exceeds size (4189184K) by more than 1%
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
# mdadm --query /dev/md1
/dev/md1: 4.00GiB raid1 2 devices, 0 spares. Use mdadm --detail for more detail.
# mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Feb 28 02:07:48 2022
        Raid Level : raid1
        Array Size : 4189184 (4.00 GiB 4.29 GB)
     Used Dev Size : 4189184 (4.00 GiB 4.29 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 02:08:09 2022
             State : clean, resync
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

    Rebuild Status : 13% complete

              Name : debian:1  (local to host debian)
              UUID : 2dfb7fd5:e09e0527:0b5a905a:8334adb8
            Events : 17

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       1       8       48        1      active sync   /dev/sde
# mdadm --detail /dev/md1
/dev/md1:
[...]
          State : clean
[...]

いくつかの注意点があります。最初に、mdadm は物理デバイス同士のサイズが異なる点を指摘しています。さらに、このことによりサイズが大きい側のデバイスの一部の領域が使えなくなるため、確認が求められています。

さらに重要なことは、ミラーの状態に注意することです。RAID ミラーの正常な状態とは、両方のディスクが全く同じ内容を持っている状態です。しかしながら、ボリュームを最初に作成した直後の RAID ミラーは正常な状態であることを保証されません。このため、RAID サブシステムは RAID ミラーの正常な状態を保証するために、RAID デバイスが作成されたらすぐに同期化作業を始めます。しばらくの後 (必要な時間はディスクの実サイズに依存します)、RAID アレイは「active」または「clean」状態に移行します。同期化作業中、ミラーは信頼性低下状態で、冗長性は保証されない点に注意してください。同期化作業中にディスク障害が起きると、すべてのデータを失うことにつながる恐れがあります。しかしながら、最近作成された RAID アレイの最初の同期化作業の前に大量の重要なデータがこの RAID アレイに保存されていることはほとんどないでしょう。信頼性低下状態であっても /dev/md1 を利用することが可能で、ファイルシステムを作成したり、データのコピーを取ったりすることが可能という点に注意してください。

TIP 信頼性低下状態でミラーを開始する

RAID-1 ミラーを構成する 2 台のディスクの両方をすぐに使えないことが時々あります。たとえば、ミラーを構成するディスクの片方にミラーに移動したいデータが既に保存されている場合です。このような場合、mdadm に渡すデバイスファイル引数の片方をデバイスファイルの代わりに missing にすることで、意図的に信頼性低下状態の RAID-1 アレイを作成することも可能です。ミラーに移動したいデータを含むディスクからデータを「ミラー」にコピーした後、そのディスクをアレイに追加することが可能です。追加作業が終われば、同期化作業が行われ、ミラーに移動したかったデータの冗長性が確保されます。

TIP 同期化作業を行わずにミラーをセットアップする

通常 RAID-1 ボリュームは新しいディスクとして使うために作成され、RAID-1 ボリュームの作成直後にはデータが保存されていないと考えられます。すなわち、RAID-1 ボリュームの初期内容に価値はなく、RAID-1 で保護したい重要なデータは RAID-1 ボリュームの作成後に書き込まれるデータというわけです。

そう考えると、RAID-1 ボリュームにデータが書き込まれる前に RAID-1 ボリュームを構成するディスクの内容が同期されるという点について疑問に思うかもしれません。RAID-1 ボリュームに書き込んでいないデータは後から読み込まれることもないのにも関わらず、なぜ RAID-1 ボリュームの作成時に RAID-1 ボリュームを構成するディスクの内容の同期化作業が必要なのでしょうか?

幸いなことに、RAID-1 を構成するディスクの内容の同期化作業は --assume-clean オプションを mdadm に渡せば避けることが可能です。しかしながら、初期データが読まれる場合、--assume-clean オプションを使うと問題があります (たとえば、物理ディスク上にファイルシステムが既に存在している場合、問題があります)。このため、デフォルトでこのオプションは有効化されません。

RAID-1 アレイを構成するディスクの 1 台に障害が発生した場合、何が起きるかを見て行きましょう。mdadm に --fail オプションを付けることで、ディスク障害を模倣することが可能です。

# mdadm /dev/md1 --fail /dev/sde
mdadm: set /dev/sde faulty in /dev/md1
# mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Feb 28 02:07:48 2022
        Raid Level : raid1
        Array Size : 4189184 (4.00 GiB 4.29 GB)
     Used Dev Size : 4189184 (4.00 GiB 4.29 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 02:15:34 2022
             State : clean, degraded 
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 1
     Spare Devices : 0

Consistency Policy : resync

              Name : debian:1  (local to host debian)
              UUID : 2dfb7fd5:e09e0527:0b5a905a:8334adb8
            Events : 19

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       -       0        0        1      removed

       1       8       48        -      faulty   /dev/sde

RAID-1 ボリュームの内容はまだアクセスすることが可能ですが (そして、RAID-1 ボリュームがマウントされていた場合、アプリケーションはディスク障害に気が付きませんが)、データの安全性はもはや保証されません。つまり sdd ディスクにも障害が発生した場合、データは失われます。この危険性を避けるために、障害の発生したディスクを新しいディスク sdf に交換します。

# mdadm /dev/md1 --add /dev/sdf
mdadm: added /dev/sdf
# mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Feb 28 02:07:48 2022
        Raid Level : raid1
        Array Size : 4189184 (4.00 GiB 4.29 GB)
     Used Dev Size : 4189184 (4.00 GiB 4.29 GB)
      Raid Devices : 2
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 02:25:34 2022
             State : clean, degraded, recovering 
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 1
     Spare Devices : 1

Consistency Policy : resync

    Rebuild Status : 47% complete

              Name : debian:1  (local to host debian)
              UUID : 2dfb7fd5:e09e0527:0b5a905a:8334adb8
            Events : 39

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       2       8       64        1      spare rebuilding   /dev/sdf

       1       8       48        -      faulty   /dev/sde
# [...]
[...]
# mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Feb 28 02:07:48 2022
        Raid Level : raid1
        Array Size : 4189184 (4.00 GiB 4.29 GB)
     Used Dev Size : 4189184 (4.00 GiB 4.29 GB)
      Raid Devices : 2
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 02:25:34 2022
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 1
     Spare Devices : 0

Consistency Policy : resync

              Name : debian:1  (local to host debian)
              UUID : 2dfb7fd5:e09e0527:0b5a905a:8334adb8
            Events : 41

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       2       8       64        1      active sync   /dev/sdf

       1       8       48        -      faulty   /dev/sde

繰り返しになりますが、ボリュームはまだアクセスすることが可能とは言うもののボリュームが信頼性低下状態ならば、カーネルは自動的に再構成作業を実行します。再構成作業が終了したら、RAID アレイは正常状態に戻ります。ここで、システムに sde ディスクをアレイから削除することを伝えることが可能です。削除することで、2 台のディスクからなる古典的な RAID ミラーになります。

# mdadm /dev/md1 --remove /dev/sde
mdadm: hot removed /dev/sde from /dev/md1
# mdadm --detail /dev/md1
/dev/md1:
[...]
    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       2       8       64        1      active sync   /dev/sdf

この後、今後サーバの電源を切った際にドライブを物理的に取り外したり、ハードウェア設定がホットスワップに対応しているならばドライブをホットリムーブすることが可能です。一部の SCSI コントローラ、多くの SATA ディスク、USB や Firewire で接続された外部ドライブなどはホットスワップに対応しています。

12.1.1.3. 設定のバックアップ

Most of the meta-data concerning RAID volumes are saved directly on the disks that make up these arrays, so that the kernel can detect the arrays and their components and assemble them automatically when the system starts up. However, backing up this configuration is encouraged, because this detection isn't fail-proof, and it is only expected that it will fail precisely in sensitive circumstances. In our example, if the sde disk failure had been real (instead of simulated) and the system had been restarted without removing this sde disk, this disk could start working again due to having been probed during the reboot. The kernel would then have three physical elements, each claiming to contain half of the same RAID volume. In reality this leads to the RAID starting from the individual disks alternately - distributing the data also alternately, depending on which disk started the RAID in degraded mode Another source of confusion can come when RAID volumes from two servers are consolidated onto one server only. If these arrays were running normally before the disks were moved, the kernel would be able to detect and reassemble the pairs properly; but if the moved disks had been aggregated into an md1 on the old server, and the new server already has an md1, one of the mirrors would be renamed.

このため、参考情報に過ぎないとは言うものの、設定を保存することは重要です。設定を保存する標準的な方法は /etc/mdadm/mdadm.conf ファイルを編集することです。以下に例を示します。

例 12.1 mdadm 設定ファイル

# mdadm.conf
#
# !NB! Run update-initramfs -u after updating this file.
# !NB! This will ensure that initramfs has an uptodate copy.
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
DEVICE /dev/sd*

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md/0  metadata=1.2 UUID=a75ac628:b384c441:157137ac:c04cd98c name=debian:0
ARRAY /dev/md/1  metadata=1.2 UUID=2dfb7fd5:e09e0527:0b5a905a:8334adb8 name=debian:1
# This configuration was auto-generated on Mon, 28 Feb 2022 01:53:48 +0100 by mkconf

最も役に立つ設定項目の 1 つに DEVICE オプションがあります。これは起動時にシステムが RAID ボリュームの構成情報を自動的に探すデバイスをリストします。上の例では、値をデフォルト値 partitions containers からデバイスファイルを明示したリストに置き換えました。なぜなら、パーティションだけでなくすべてのディスクをボリュームとして使うように決めたからです。

上の例における最後の 2 行を使うことで、カーネルはアレイに割り当てるボリューム番号を安全に選ぶことが可能です。ディスク本体に保存されたメタ情報はボリュームをもう一度組み上げるのに十分ですが、ボリューム番号を定義する (そして /dev/md* デバイス名にマッチすることを確認する) には不十分です。

幸いなことに、以下のコマンドを実行すればこの行を自動的に生成することが可能です。

# mdadm --misc --detail --brief /dev/md?
ARRAY /dev/md/0  metadata=1.2 UUID=a75ac628:b384c441:157137ac:c04cd98c name=debian:0
ARRAY /dev/md/1  metadata=1.2 UUID=2dfb7fd5:e09e0527:0b5a905a:8334adb8 name=debian:1

最後の 2 行の内容はボリュームを構成するディスクのリストに依存しません。このため、障害の発生したディスクを新しいディスクに交換した際に、これをもう一度生成する必要はありません。逆に、RAID アレイを作成および削除した際に、必ずこの設定ファイルを注意深く更新する必要があります。

12.1.2. LVM

LVM (論理ボリュームマネージャ) は物理ディスクから論理ボリュームを抽象化するもう一つの方法で、信頼性を増加させるのではなく柔軟性を増加させることに注目しています。LVM を使うことで、アプリケーションから見る限り透過的に論理ボリュームを変更することが可能です。LVM を使うことで、たとえば新しいディスクを追加し、データを新しいディスクに移行し、古いディスクを削除することがボリュームをアンマウントせずに可能です。

12.1.2.1. LVM の概念

LVM の柔軟性は 3 つの概念から構成された抽象化レベルによって達成されます。

1 番目の概念は PV (物理ボリューム) です。PV はハードウェアに最も近い要素です。具体的に言えば、PV はディスクのパーティション、ディスク全体、その他の任意のブロックデバイス (たとえば、RAID アレイ) などの物理的要素を指します。物理的要素を LVM の PV に設定した場合、物理的要素へのアクセスは必ず LVM を介すべきという点に注意してください。そうでなければ、システムが混乱します。

A number of PVs can be clustered in a VG (Volume Group), which can be compared to disks both virtual and extensible. VGs are abstract, and don't appear in a device file in the /dev hierarchy, so there is no risk of using them directly.

3 番目の概念は LV (論理ボリューム) です。LV は VG の中の 1 つの塊です。さらに VG をディスクに例えたのと同様の考え方を使うと、LV はパーティションに例えられます。LV はブロックデバイスとして /dev に現れ、他の物理パーティションと同様に取り扱うことが可能です (一般的に言えば、LV にファイルシステムやスワップ領域を作成することが可能です)。

ここで重要な事柄は VG を LV に分割する場合に物理的要素 (PV) はいかなる制約も要求しないという点です。1 つの PV (たとえばディスク) から構成される VG を複数の LV に分割できます。同様に、複数の PV から構成される VG を 1 つの大きな LV として提供することも可能です。制約事項がたった 1 つしかないのは明らかです。それはある VG から分割された LV のサイズの合計はその VG を構成する PV のサイズの合計を超えることができないという点です。

しかしながら、ある VG を構成する PV 同士の性能を同様のものにしたり、その VG から分割された LV 同士に求められる性能を同様のものにしたりすることは通常理に適った方針です。たとえば、利用できるハードウェアに高速な PV と低速な PV がある場合、高速な PV から構成される VG と低速な PV から構成される VG に分けると良いでしょう。こうすることで、高速な PV から構成される VG から分割された LV を高速なデータアクセスを必要とするアプリケーションに割り当て、低速な PV から構成される VG から分割された LV を負荷の少ない作業用に割り当てることが可能です。

いかなる場合でも、LV は特定の PV を使用するわけではないという点を覚えておいてください。ある LV に含まれるデータの物理的な保存場所を操作することも可能ですが、普通に使っている限りその必要はありません。逆に、VG を構成する PV 群の構成要素が変化した場合、ある LV に含まれるデータの物理的な保存場所は対象の LV の分割元である VG の中ひいてはその VG を構成する PV 群の構成要素の中を移動することがあります (もちろん、データの移動先は対象の LV の分割元の VG を構成する PV 群の構成要素の中に限られます)。

12.1.2.2. LVM の設定

典型的な用途に対する LVM の設定過程を、段階的に見て行きましょう。具体的に言えば、複雑なストレージの状況を単純化したい場合を見ていきます。通常、長く複雑な一時的措置を繰り返した挙句の果てに、この状況に陥ることがあります。説明目的で、徐々にストレージを変更する必要のあったサーバを考えます。このサーバでは、PV として利用できるパーティションが複数の一部使用済みディスクに分散しています。より具体的に言えば、以下のパーティションを PV として利用できます。

sdb ディスク上の sdb2 パーティション (4 GB)。
sdc ディスク上の sdc3 パーティション (3 GB)。
sdd ディスク (4 GB) は全領域を利用できます。
sdf ディスク上の sdf1 パーティション (4 GB) および sdf2 パーティション (5 GB)。

加えて、sdb と sdf が他の 2 台に比べて高速であると仮定しましょう。

今回の目標は、3 種類の異なるアプリケーション用に 3 つの LV を設定することです。具体的に言えば、5 GB のストレージ領域が必要なファイルサーバ、データベース (1 GB)、バックアップ用の領域 (12 GB) 用の LV を設定することです。ファイルサーバとデータベースは高い性能を必要とします。しかし、バックアップはアクセス速度をそれほど重要視しません。これらの要件により、各アプリケーションに設定する LV の使用する PV が決定されます。さらに LVM を使いますので、PV の物理的サイズからくる制限はありません。このため、PV 群として利用できる領域のサイズの合計だけが制限となります。

LVM の設定に必要なツールは lvm2 パッケージとその依存パッケージに含まれています。これらのパッケージをインストールしたら、3 つの手順を踏んで LVM を設定します。各手順は LVM の概念の 3 つの抽象化レベルに対応します。

最初に、pvcreate を使って PV を作成します。

# pvcreate /dev/sdb2
  Physical volume "/dev/sdb2" successfully created.
# pvdisplay
  "/dev/sdb2" is a new physical volume of "4.00 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/sdb2
  VG Name               
  PV Size               4.00 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               yK0K6K-clbc-wt6e-qk9o-aUh9-oQqC-k1T71B

# for i in sdc3 sdd sdf1 sdf2 ; do pvcreate /dev/$i ; done
  Physical volume "/dev/sdc3" successfully created.
  Physical volume "/dev/sdd" successfully created.
  Physical volume "/dev/sdf1" successfully created.
  Physical volume "/dev/sdf2" successfully created.
# pvdisplay -C
  PV         VG Fmt  Attr PSize PFree
  /dev/sdb2     lvm2 ---  4.00g 4.00g
  /dev/sdc3     lvm2 ---  3.00g 3.00g
  /dev/sdd      lvm2 ---  4.00g 4.00g
  /dev/sdf1     lvm2 ---  4.00g 4.00g
  /dev/sdf2     lvm2 ---  5.00g 5.00g

ここまでは順調です。PV はディスク全体およびディスク上の各パーティションに対して設定することが可能という点に注意してください。上に示した通り、pvdisplay コマンドは既存の PV をリストします。出力フォーマットは 2 種類あります。

vgcreate を使って、これらの PV から VG を構成しましょう。高速なディスクの PV から vg_critical VG を構成します。さらに、これ以外の低速なディスクの PV から vg_normal VG を構成します。

# vgcreate vg_critical /dev/sdb2 /dev/sdf1
  Volume group "vg_critical" successfully created
# vgdisplay
  --- Volume group ---
  VG Name               vg_critical
  System ID             
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               7.99 GiB
  PE Size               4.00 MiB
  Total PE              2046
  Alloc PE / Size       0 / 0   
  Free  PE / Size       2046 / 7.99 GiB
  VG UUID               JgFWU3-emKg-9QA1-stPj-FkGX-mGFb-4kzy1G

# vgcreate vg_normal /dev/sdc3 /dev/sdd /dev/sdf2
  Volume group "vg_normal" successfully created
# vgdisplay -C
  VG          #PV #LV #SN Attr   VSize   VFree  
  vg_critical   2   0   0 wz--n-   7.99g   7.99g
  vg_normal     3   0   0 wz--n- <11.99g <11.99g

繰り返しになりますが、vgdisplay コマンドはかなり簡潔です (そして vgdisplay には 2 種類の出力フォーマットがあります)。同じ物理ディスク上にある 2 つの PV から 2 つの異なる VG を構成することが可能である点に注意してください。また、vg_ 接頭辞を VG の名前に使っていますが、これは慣例に過ぎない点に注意してください。

We now have two “virtual disks”, sized about 8 GB and 12 GB respectively. Let's now carve them up into “virtual partitions” (LVs). This involves the lvcreate command, and a slightly more complex syntax:

# lvdisplay
# lvcreate -n lv_files -L 5G vg_critical
  Logical volume "lv_files" created.
# lvdisplay
  --- Logical volume ---
  LV Path                /dev/vg_critical/lv_files
  LV Name                lv_files
  VG Name                vg_critical
  LV UUID                Nr62xe-Zu7d-0u3z-Yyyp-7Cj1-Ej2t-gw04Xd
  LV Write Access        read/write
  LV Creation host, time debian, 2022-03-01 00:17:46 +0100
  LV Status              available
  # open                 0
  LV Size                5.00 GiB
  Current LE             1280
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

# lvcreate -n lv_base -L 1G vg_critical
  Logical volume "lv_base" created.
# lvcreate -n lv_backups -L 11.98G vg_normal
  Rounding up size to full physical extent 11.98 GiB
  Rounding up size to full physical extent 11.98 GiB
  Logical volume "lv_backups" created.
# lvdisplay -C
  LV         VG          Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_base    vg_critical -wi-a-----  1.00g                                                    
  lv_files   vg_critical -wi-a-----  5.00g                                                    
  lv_backups vg_normal   -wi-a----- 11.98g

LV を作成する場合、2 種類のパラメータが必要です。このため、必ず 2 種類のパラメータをオプションとして lvcreate に渡します。作成する LV の名前を -n オプションで指定し、サイズを -L オプションで指定します。また、操作対象の VG をコマンドに伝えることが必要です。これはもちろん最後のコマンドラインパラメータです。

GOING FURTHER lvcreate のオプション

lvcreate コマンドは複数のオプションを取り、作成する LV を微調整することが可能です。

最初に -l オプションについて説明しましょう。-l オプションを使った場合 LV のサイズをブロック数 (上の例で用いた「人間にとって分かりやすい」単位ではありません) で指定することが可能です。ブロックとは (LVM の用語で PE すなわち物理エクステントと呼ばれています) PV 中のストレージ領域の連続した単位です。ブロックは LV 中に分散されています。ある LV 用のストレージ領域を正確に定義したい場合、たとえば利用できる領域のすべてを使いたい場合、-l オプションのほうが -L オプションよりも使いやすいでしょう。

It is also possible to hint at the physical location of an LV, so that its extents are stored on a particular PV (while staying within the ones assigned to the VG, of course). Since we know that sdb is faster than sdf, we may want to store the lv_base there if we want to give an advantage to the database server compared to the file server. The command line becomes: lvcreate -n lv_base -L 1G vg_critical /dev/sdb2. Note that this command can fail if the PV doesn't have enough free extents. In our example, we would probably have to create lv_base before lv_files to avoid this situation – or free up some space on sdb2 with the pvmove command.

LV が作成され、ブロックデバイスファイルとして /dev/mapper/ に現れます。

# ls -l /dev/mapper
total 0
crw------- 1 root root 10, 236 Mar  1 00:17 control
lrwxrwxrwx 1 root root       7 Mar  1 00:19 vg_critical-lv_base -> ../dm-1
lrwxrwxrwx 1 root root       7 Mar  1 00:17 vg_critical-lv_files -> ../dm-0
lrwxrwxrwx 1 root root       7 Mar  1 00:19 vg_normal-lv_backups -> ../dm-2 
# ls -l /dev/dm-*
brw-rw---- 1 root disk 253, 0 Mar  1 00:17 /dev/dm-0
brw-rw---- 1 root disk 253, 1 Mar  1 00:19 /dev/dm-1
brw-rw---- 1 root disk 253, 2 Mar  1 00:19 /dev/dm-2

NOTE Auto-detecting LVM volumes

コンピュータの起動時に、lvm2-activation systemd サービスユニットは vgchange -aay を実行して VG を「始動」します。具体的に言えば、lvm2-activation systemd サービスユニットは利用できるデバイスを探します。そして LVM サブシステムに LVM 用の PV として初期化されたデバイスが登録され、PV から構成される VG が開始され、VG から分割された LV が開始され、LV が利用できるようになります。このため、LVM ボリュームを作成したり変更する際に設定ファイルを編集する必要はありません。

しかしながら、LVM 要素 (PV、LV、GV) の配置図は /etc/lvm/backup にバックアップされ、問題が起きた時 (見えないところで何が行われているかを確認したい時) に有益です。

ブロックデバイスファイルを分かり易くするために、VG に対応するディレクトリの中に便利なシンボリックリンクが作成されます。

# ls -l /dev/vg_critical
total 0
lrwxrwxrwx 1 root root 7 Mar  1 00:19 lv_base -> ../dm-1
lrwxrwxrwx 1 root root 7 Mar  1 00:17 lv_files -> ../dm-0 
# ls -l /dev/vg_normal
total 0
lrwxrwxrwx 1 root root 7 Mar  1 00:19 lv_backups -> ../dm-2

LV は標準的なパーティションと全く同様に取り扱われます。

# mkfs.ext4 /dev/vg_normal/lv_backups
mke2fs 1.46.2 (28-Feb-2021)
Discarding device blocks: done                            
Creating filesystem with 3140608 4k blocks and 786432 inodes
Filesystem UUID: 7eaf0340-b740-421e-96b2-942cdbf29cb3
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

# mkdir /srv/backups
# mount /dev/vg_normal/lv_backups /srv/backups
# df -h /srv/backups
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/vg_normal-lv_backups   12G   24K   12G   1% /srv/backups
# [...]
[...]
# cat /etc/fstab
[...]
/dev/vg_critical/lv_base    /srv/base       ext4 defaults 0 2
/dev/vg_critical/lv_files   /srv/files      ext4 defaults 0 2
/dev/vg_normal/lv_backups   /srv/backups    ext4 defaults 0 2

アプリケーションにしてみれば、無数の小さなパーティションがわかり易い名前を持つ 1 つの大きな 12 GB のボリュームにまとめられたことになります。

12.1.2.3. 経時変化に伴う LVM の利便性

LVM のパーティションや物理ディスクを統合する機能は便利ですが、これは LVM のもたらす主たる利点ではありません。時間経過に伴い LVM のもたらす柔軟性が特に重要になる時とは LV のサイズを増加させる必要が生じた時でしょう。ここまでの例を使い、LV に新たに巨大なファイルを保存したいけれども、ファイルサーバ用の LV はこの巨大なファイルを保存するには狭すぎると仮定しましょう。vg_critical から分割できる全領域はまだ使い切られていないので、lv_files のサイズを増やすことが可能です。LV のサイズを増やすために lvresize コマンドを使い、LV のサイズの変化にファイルシステムを対応させるために resize2fs を使います。

# df -h /srv/files/
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/vg_critical-lv_files  4.9G  4.2G  485M  90% /srv/files
# lvdisplay -C vg_critical/lv_files
  LV       VG          Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_files vg_critical -wi-ao---- 5.00g                                                    
# vgdisplay -C vg_critical
  VG          #PV #LV #SN Attr   VSize VFree
  vg_critical   2   2   0 wz--n- 7.99g 1.99g
# lvresize -L 6G vg_critical/lv_files
  Size of logical volume vg_critical/lv_files changed from 5.00 GiB (1280 extents) to 6.00 GiB (1536 extents).
  Logical volume vg_critical/lv_files successfully resized.
# lvdisplay -C vg_critical/lv_files
  LV       VG          Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_files vg_critical -wi-ao---- 6.00g                                                    
# resize2fs /dev/vg_critical/lv_files
resize2fs 1.46.2 (28-Feb-2021)
Filesystem at /dev/vg_critical/lv_files is mounted on /srv/files; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 1
The filesystem on /dev/vg_critical/lv_files is now 1572864 (4k) blocks long.

# df -h /srv/files/
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/vg_critical-lv_files  5.9G  4.2G  1.5G  75% /srv/files

CAUTION ファイルシステムのサイズ変更

Not all filesystems can be resized online; resizing a volume can therefore require unmounting the filesystem first and remounting it afterwards. Of course, if one wants to shrink the space allocated to an LV, the filesystem must be shrunk first; the order is reversed when the resizing goes in the other direction: the logical volume must be grown before the filesystem on it. It is rather straightforward, since at no time must the filesystem size be larger than the block device where it resides (whether that device is a physical partition or a logical volume).

ext3、ext4、xfs ファイルシステムはオンラインでサイズを増加させることすなわちアンマウントすることなくサイズを増加させることが可能です。しかし、サイズを減少させる場合はアンマウントを必要とします。reiserfs はオンラインでサイズを増加および減少することが可能です。ext2 は増加も減少も可能ですが、アンマウントを必要とします。

同様の方法でデータベースをホストしている lv_base のサイズを増加させます。以下の通り lv_base の分割元である vg_critical から分割できる領域は既にほぼ使い切った状態になっています。

# df -h /srv/base/
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/vg_critical-lv_base  974M  883M   25M  98% /srv/base
# vgdisplay -C vg_critical
  VG          #PV #LV #SN Attr   VSize VFree   
  vg_critical   2   2   0 wz--n- 7.99g 1016.00m

No matter, since LVM allows adding physical volumes to existing volume groups. For instance, maybe we've noticed that the sdb3 partition, which was so far used outside of LVM, only contained archives that could be moved to lv_backups. We can now recycle it and integrate it to the volume group, and thereby reclaim some available space. This is the purpose of the vgextend command. Of course, the partition must be prepared as a physical volume beforehand. Once the VG has been extended, we can use similar commands as previously to grow the logical volume then the filesystem:

# pvcreate /dev/sdb3
  Physical volume "/dev/sdb3" successfully created.
# vgextend vg_critical /dev/sdb3
  Volume group "vg_critical" successfully extended
# vgdisplay -C vg_critical
  VG          #PV #LV #SN Attr   VSize   VFree 
  vg_critical   3   2   0 wz--n- <12.99g <5.99g 
# lvresize -L 2G vg_critical/lv_base
[...]
# resize2fs /dev/vg_critical/lv_base
[...]
# df -h /srv/base/
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/vg_critical-lv_base  2.0G  886M  991M  48% /srv/base

GOING FURTHER LVM の上級活用

LVM also caters for more advanced uses, where many details can be specified by hand. For instance, an administrator can tweak the size of the blocks that make up physical and logical volumes, as well as their physical layout. It is also possible to move blocks across PVs, for instance, to fine-tune performance or, in a more mundane way, to free a PV when one needs to extract the corresponding physical disk from the VG (whether to affect it to another VG or to remove it from LVM altogether). The manual pages describing the commands are generally clear and detailed. A good entry point is the lvm(8) manual page.

12.1.3. RAID それとも LVM?

1 番目の利用形態は用途が時間的に変化しない 1 台のハードディスクを備えたデスクトップコンピュータのような単純な利用形態です。この場合 RAID と LVM はどちらも疑う余地のない利点をもたらします。しかしながら、RAID と LVM は目標を分岐させて別々の道を歩んでいます。どちらを使うべきか悩むのは間違っていることではありません。最も適切な答えはもちろん現在の要求と将来に予測される要求に依存します。

いくつかの状況では、疑問の余地がないくらい簡単に答えを出すことが可能です。2 番目の利用形態はハードウェア障害からデータを保護することが求められる利用形態です。この場合、ディスクの冗長性アレイ上に RAID をセットアップするのは明らかです。なぜなら LVM はこの種の問題への対応策を全く用意していないからです。逆に、柔軟なストレージ計画が必要でディスクの物理的な配置に依存せずにボリュームを構成したい場合、RAID はあまり役に立たず LVM を選ぶのが自然です。

NOTE 性能が重要な場合

If input/output speed is of the essence, especially in terms of access times, using LVM and/or RAID in one of the many combinations may have some impact on performances, and this may influence decisions as to which to pick. However, these differences in performance are really minor, and will only be measurable in a few use cases. If performance matters, the best gain to be obtained would be to use non-rotating storage media (solid-state drives or SSDs); their cost per megabyte is higher than that of standard hard disk drives, and their capacity is usually smaller, but they provide excellent performance for random accesses. If the usage pattern includes many input/output operations scattered all around the filesystem, for instance for databases where complex queries are routinely being run, then the advantage of running them on an SSD far outweigh whatever could be gained by picking LVM over RAID or the reverse. In these situations, the choice should be determined by other considerations than pure speed, since the performance aspect is most easily handled by using SSDs.

3 番目に注目すべき利用形態は単に 2 つのディスクを 1 つのボリュームにまとめるような利用形態です。性能が欲しかったり、利用できるディスクのどれよりも大きな単一のファイルシステムにしたい場合にこの利用形態が採用されます。この場合、RAID-0 (またはリニア RAID) か LVM ボリュームを使って対処できます。この状況では、追加的な制約事項 (たとえば、他のコンピュータが RAID だけを使っている場合に RAID を使わなければいけないなどの制約事項) がなければ、通常 LVM を選択すると良いでしょう。LVM の最初のセットアップは RAID に比べて複雑ですが、LVM は複雑度を少し増加させるだけで要求が変った場合や新しいディスクを追加する必要ができた場合に対処可能な追加的な柔軟性を大きく上昇させます。

そしてもちろん、最後の本当に興味深い利用形態はストレージシステムにハードウェア障害に対する耐性を持たせさらにボリューム分割に対する柔軟性を持たせる必要がある場合の利用形態です。RAID と LVM のどちらも片方だけで両方の要求を満足させることは不可能です。しかし心配ありません。この要求を満足させるには RAID と LVM の両方を同時に使用する方針、正確に言えば一方の上に他方を構成する方針を採用すれば良いでのです。RAID と LVM の高い成熟度のおかげでほぼ標準になりつつある方針に従うならば、最初にディスクを少数の大きな RAID アレイにグループ分けすることでデータの冗長性を確保します。さらにそれらの RAID アレイを LVM の PV として使います。そして、ファイルシステム用の VG から分割された LV を論理パーティションとして使います。この標準的な方針の優れた点は、ディスク障害が起きた場合に再構築しなければいけない RAID アレイの数が少ない点です。このため、管理者は復旧に必要な時間を減らすことが可能です。

Let's take a concrete example: the public relations department at Falcot Corp needs a workstation for video editing, but the department's budget doesn't allow investing in high-end hardware from the bottom up. A decision is made to favor the hardware that is specific to the graphic nature of the work (monitor and video card), and to stay with generic hardware for storage. However, as is widely known, digital video does have some particular requirements for its storage: the amount of data to store is large, and the throughput rate for reading and writing this data is important for the overall system performance (more than typical access time, for instance). These constraints need to be fulfilled with generic hardware, in this case two 300 GB SATA hard disk drives; the system data must also be made resistant to hardware failure, as well as some of the user data. Edited video clips must indeed be safe, but video rushes pending editing are less critical, since they're still on the videotapes.

前述の条件を満足させるために RAID-1 と LVM を組み合わせます。ディスクの並行アクセスを最適化し、そして障害が同時に発生する危険性を減らすために、各ディスクは 2 つの異なる SATA コントローラに接続されています。このため、各ディスクは sda と sdc として現れます。どちらのディスクも以下に示したパーティショニング方針に従ってパーティショニングされます。

# sfdisk -l /dev/sda
Disk /dev/sda: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZ7LM960
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BB14C130-9E9A-9A44-9462-6226349CA012

Device         Start        End   Sectors   Size Type
/dev/sda1        2048       4095      2048     1M BIOS boot
/dev/sda2        4096  100667391 100663296    48G Linux RAID
/dev/sda3   100667392  134221823  33554432    16G Linux RAID
/dev/sda4   134221824  763367423 629145600   300G Linux RAID
/dev/sda5   763367424 1392513023 629145600   300G Linux RAID
/dev/sda6  1392513024 1875384974 482871951 230.3G Linux LVM

The first partitions of both disks are BIOS boot partitions.
The next two partitions sda2 and sdc2 (about 48 GB) are assembled into a RAID-1 volume, md0. This mirror is directly used to store the root filesystem.
The sda3 and sdc3 partitions are assembled into a RAID-0 volume, md1, and used as swap partition, providing a total 32 GB of swap space. Modern systems can provide plenty of RAM and our system won't need hibernation. So with this amount added, our system will unlikely run out of memory.
The sda4 and sdc4 partitions, as well as sda5 and sdc5, are assembled into two new RAID-1 volumes of about 300 GB each, md2 and md3. Both these mirrors are initialized as physical volumes for LVM, and assigned to the vg_raid volume group. This VG thus contains about 600 GB of safe space.
The remaining partitions, sda6 and sdc6, are directly used as physical volumes, and assigned to another VG called vg_bulk, which therefore ends up with roughly 460 GB of space.

VG を作成したら、VG をとても柔軟な方法で LV に分割することが可能です。vg_raid から分割された LV は 1 台のディスク障害に対して耐性を持ちますが、vg_bulk から分割された LV はディスク障害に対する耐性を持たない点を忘れないでください。逆に、vg_bulk は両方のディスクにわたって割り当てられるので、vg_bulk から分割された LV に保存された巨大なファイルの読み書き速度は高速化されるでしょう。

We will therefore create the lv_var and lv_home LVs on vg_raid, to host the matching filesystems; another large LV, lv_movies, will be used to host the definitive versions of movies after editing. The other VG will be split into a large lv_rushes, for data straight out of the digital video cameras, and a lv_tmp for temporary files. The location of the work area is a less straightforward choice to make: while good performance is needed for that volume, is it worth risking losing work if a disk fails during an editing session? Depending on the answer to that question, the relevant LV will be created on one VG or the other.

We now have both some redundancy for important data and much flexibility in how the available space is split across the applications.

NOTE なぜ 3 種類の RAID-1 ボリュームが必要なのでしょうか?

RAID-1 ボリュームを 1 つだけ作成し、作成した PV から vg_raid を構成し、vg_raid から保護したい内容用の LV を分割することも可能でした。それにも関わらず、なぜ 3 種類の RAID-1 ボリュームを作成したのでしょうか?

最初の分割 (md0 とその他) の根本的理由はデータの安全性を考慮したためです。つまり RAID-1 ミラーを構成する要素に書き込まれるデータは要素同士で全く同じだからです。そのため RAID 層を迂回し、RAID-1 ミラーを構成する 1 台のディスクだけを直接マウントすることが可能です。すなわち、カーネルにバグがあったり LVM メタ情報が破壊されたりした場合でも、RAID と LVM ボリュームに含まれるディスクの配置などの重要なデータにアクセスするために最小限のシステムを起動することが可能ということです。そして、このメタ情報を再構成したりファイルにアクセスしたりすることが可能です。こうすることで、システムを正常状態に戻すことが可能です。

The rationale for the second split (md2 vs. md3) is less clear-cut, and more related to acknowledging that the future is uncertain. When the workstation is first assembled, the exact storage requirements are not necessarily known with perfect precision; they can also evolve over time. In our case, we can't know in advance the actual storage space requirements for video rushes and complete video clips. If one particular clip needs a very large amount of rushes, and the VG dedicated to redundant data is less than halfway full, we can re-use some of its unneeded space. We can remove one of the physical volumes, say md3, from vg_raid and either assign it to vg_bulk directly (if the expected duration of the operation is short enough that we can live with the temporary drop in performance), or undo the RAID setup on md3 and integrate its components sda5 and sdc5 into the bulk VG (which grows by 600 GB instead of 300 GB); the lv_rushes logical volume can then be grown according to requirements.

第 12 章 高度な管理