Bug#908216: btrfs blocked for more than 120 seconds

On Monday, February 25, 2019 10:17pm, "Nicholas D Steeves" <nsteeves@gmail.com> said:

> Control: tags -1 -unreproducible
>
> Hi Russell,
>
> Thank you for providing more info. Now I see where you're running
> into known limitations with btrfs (all versions). Reply follows inline.
>
> BTW, you're not using SMR and/or USB disks, right?

No SMR or USB disks are being used. It is either a single-partition, dedicated hard drive or a partition on hardware RAID.

> On Mon, Feb 25, 2019 at 12:33:51PM -0600, Russell Mosemann wrote:
> > Steps to reproduce
> >
> > Simply copying a file into the file system can cause things to lock up.
> In
> > this case, the files will usually be thin-provisioned qcow2 disks for kvm
> > vm's. There is no detailed formula to force the lockup to occur, but it
> > happens regularly, sometimes multiple times in one day.
> >
>
> Have you read https://wiki.debian.org/Btrfs ? Specifically "COW on
> COW: Don't do it!" ? If you did read it, maybe the document needs to
> be more firm about this... eg: "take care to use raw images" should
> be "under no circumstances use non-raw images". P.S. Yes, I know that
> page would benefit from a reorganisation... Sorry about it's current
> state.

In every case, the btrfs partition is used exclusively as an archive for backups. In no circumstance is a vm or something like a database run on the partition. Consequently, it is not possible for CoW on CoW to happen. The partition is simply storing files.

> > Files are often copied from a master by reference (cp --reflink), one per
> > day to perform a daily backup for up to 45 days. Removing older files is
> a
> > painfully slow process, even though there are only 45 files in the
> > directory. Doing a scrub is almost a sure way to lock up the system,
> > especially if a copy or delete operation is in progress. On two systems,
> > crashes occur with 4.18 and 4.19 but not 4.17. On the other systems that
> > crash, it does not seem to matter if it is 4.17, 4.18 or 4.19.
> >
>
> It might be that >4.17 fixed some corner-case corruption issue, for
> example by adding an additional check during each step of a backref
> walk, and that this makes the timeout more frequent and severe. eg:
> 4.17 works because it is less strict.
>
> By the way, is it your VM host that locks up, or your VM guests? Do[es]
> they[it] recover if you leave it alone for many hours? I didn't see
> any oopses or panics in your kernel logs.

It is the host that locks up. This does not involve vm's in any way. If vm's are present, they are running on different drives. Some of the vm's even use btrfs partitions themselves. None of the vm's experience issues with their btrfs volumes. None of the vm's are affected by the hung btrfs tasks on the host. That is because the issue exclusively involves the separate, dedicated, archive partition used by the host. For all practical purposes, vm's aren't part of this picture.

As far as I am aware, a hung task does not recover, even after many hours. A number of times, it has hung at night. When I check in the morning hours later, it is still hung. In many cases, the server must be forcibly rebooted, because the hung task hangs the reboot process.

> Reflinked file are like snapshots, any I/O on a file must walk every branch
> of the backref tree that is relevant to a file. For more info see:
> https://btrfs.wiki.kernel.org/index.php/Resolving_Extent_Backrefs
>
> As the tree grows and becomes more complex, a COW fs will get slower.
> You've hit the >120sec threshold, due to one or more of the issues
> discussed in this email. eg: a scrub, even during a file copy/delete
> should never cause this timeout. I haven't experienced one since
> linux-4.4.x or 4.9.x...
>
> To get a figure that will provide a sense of scale to how many
> operations it takes to do anything other than reflink or snapshot you
> can consult the output of:
>
> filefrag each_live_copy_vm_image
>
> I expect the number of extends will exceed tens of thousands. BTW,
> you can use btrfsmaintenance to periodically defrag the source (and
> only the source) images. Note that this will break reflinks between
> SOURCE and each of the 45 REFLINKED-COPIES, but not between
> REFLINK-COPY1 and REFLINK-COPY2. Defragging weekly strikes a nice
> balance between lost space efficiency (due to fewer shared references
> between today's backup and yesterday's) and avoiding the performance
> issue you've encountered. Mounting with autodefrag is the least space
> efficient. (P.S. Also, I don't trust autodefrag)
>
> IIRC btrfs-debug-tree can accurately count references.

This is useful information, but it doesn't seem directly related to the hung tasks. The btrfs tasks hang when a file is being copied into the btrfs partition. No references or vm's are involved in that process. It is a simple file copy.

> > Unless otherwise indicated
> >
> > using qgroups: No
> >
>
> Whew, thank you for not! :-) Qgroups make this kind of issue worse.
>
> > using compression: Yes, compress-force=zstd
> >
>
> If userspace CPU usage is already high then compression may introduce
> additional latency and contribute to the >120sec warning.

There is more than plenty CPU. Depending on the server, there are 12 to 16 threads running at 2.67GHz to 3.5GHz. The servers have 64GB or more of memory. The servers are not loaded during the day, and the backups take place at night, when not much else is happening. It is difficult to imagine a scenario where it would take longer than 120 seconds to compress a block.

When I look at the kernel errors, they involve btrfs cleanup transactions, dirty blocks, caching and extents. I don't recall ever seeing a reference to compression in a call trace.

> > number of snapshots: Zero
> >
>
> But 45 reflinked copies per VM.
>
> > number of subvolumes: top level subvolume only
> >
>
> I believe it was Chris Murphy who wrote (on linux-btrfs) about how
> segmenting different functions/datasets into different
> non-hierarchically structured (eg: flat layout) subvolumes reduces
> lock contention during backref walks. This is a performance tuning
> tip that needs to be investigated and integrated into the wiki
> article. Eg:
>
> _____id:5 top-level_____ <-either unmounted, or
> / | | | \ mounted somewhere like
> / | | | \ /.btrfs-admin, /.volume, etc.
> / | | | \
> host_rootfs VM0 VM1 VM2 data_shared_between_VMs

This is an interesting idea, but it implies that btrfs does not handle large files or large file systems very well. The trick is to make it look like multiple, small file systems.

> > raid profile: None
> >
> > using bcache: No
> >
>
> Thanks.
>
> > layered on md or lvm: No
> >
>
> But layered on hardware raid?

lxc008, lxc009, vhost003 and vhost004 use hardware RAID. All of the other hosts access a dedicated hard drive with a single partition. In all cases, the btrfs partition is only used to hold backup files.

> > vhost002
> >
> > # grep btrfs /etc/mtab
> > /dev/sdc1 /usr/local/data/datastore2 btrfs
> > rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> >
>
> [1] Thank you for using noatime. Please explain this configuration.
> eg: does each VM have three virtual disks (containing btrfs volumes),
> backed by qcow2 images, backed by a btrfs volume on the VM host? I
> thought you were using qcow2 images, but this looks like passthrough
> of some kind.
>
> If the former:
>
> Every write causes a COW operation in the inner btrfs, and the qcow2,
> and the outer btrfs volume. The inner btrfs volume compresses once,
> and then the outer btrfs volume will try to compress again.
>
> > # smartctl -l scterc /dev/sdc
> > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
> > build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke,
> > www.smartmontools.org
> >
> > SCT Error Recovery Control:
> > Read: Disabled
> > Write: Disabled
> >
>
> [2] Have you seen any SATA resets in your logs? The default kernel
> timeout is 30sec, and drives without SCR ERT can sometimes take an
> undefined (though generally under 180sec) amount of time to reattempt
> to read a block...and if it's an SMR drive with writing I/O the delay
> to successful read can be even worse.

I have carefully checked the logs for months for an explanation of why btrfs is hanging, and I have never seen any other error message. If this were one host, then a SATA reset might be in the set of possibilities, but since this involves multiple hosts on different architectures with and without RAID, a SATA reset as the only explanation for all of the hangs is improbable.

We briefly experimented with SMR drives, and the performance was abysmal.

> > vhost003
> >
> > # grep btrfs /etc/mtab
> > /dev/sdb4 /usr/local/data/datastore2 btrfs
> > rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> >
>
> [1] Also, why aren't you using noatime here too?

noatime is a more recent change, as an experiment to determine if it would affect hangs. It has not been implemented on all hosts, yet. The presence or absence of noatime does not appear to affect hangs, which makes sense, because hangs happen during writes, not reads.

> > (RAID controller)
> >
> > # smartctl -l scterc /dev/sdb
> > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.17.0-0.bpo.1-amd64] (local
> > build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke,
> > www.smartmontools.org
> >
>
> [4] Ok, so sdb is the raid controller on the host? And sdb is passed
> through, and you shutdown one VM before mounting the same
> btrfs-on-hardware_RAID partition in another VM?

Vm's are not involved. On all hosts, there is one btrfs partition that is used exclusively for backup files. On vhost003, vhost004, lxc008 and lxc009, that partition is on hardware RAID. On all other hosts, the partition is on a dedicated hard drive that only has one btrfs partition on it.

> > vhost004
> >
> > # grep btrfs /etc/mtab
> > /dev/sdb4 /usr/local/data/datastore2 btrfs
> > rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> >
>
> [3] [1]. Also, why aren't you using noatime here too?
>
> > (RAID controller)
> >
> > # smartctl -l scterc /dev/sdb
> > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.17.0-0.bpo.1-amd64] (local
> > build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke,
> > www.smartmontools.org
> >
> >
> >
> > # btrfs dev stats /usr/local/data/datastore2
> > [/dev/sdb4].write_io_errs 0
> > [/dev/sdb4].read_io_errs 0
> > [/dev/sdb4].flush_io_errs 0
> > [/dev/sdb4].corruption_errs 0
> > [/dev/sdb4].generation_errs 0
> >
>
> [4] So vhost03 and vhost04 mount the same partition from the raid
> controller on the host via passthrough? At the same time?

All hosts listed here are separate, physical computers. Each host either has its own dedicated hard drive with the btrfs partition, or it has it's own RAID array with the btrfs partition. The btrfs partition is not shared or remounted between hosts.

> > vhost031
> >
> > # grep btrfs /etc/mtab
> > /dev/sdc1 /usr/local/data/datastore2 btrfs
> > rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> >
>
> [3] [4]
>
> > # smartctl -l scterc /dev/sdc
> > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
> > build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke,
> > www.smartmontools.org
> >
> > SCT Error Recovery Control:
> > Read: Disabled
> > Write: Disabled
> >
>
> [2]
>
> > vhost032
> >
> > # grep btrfs /etc/mtab
> > /dev/sdc1 /usr/local/data/datastore2 btrfs
> > rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> >
>
> [3] [4]
>
> > # smartctl -l scterc /dev/sdc
> > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
> > build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke,
> > www.smartmontools.org
> >
> > SCT Error Recovery Control:
> > Read: Disabled
> > Write: Disabled
> >
>
> Is this sdc a qcow2 image or a passed through megaraid partition?

> > vhost182
> >
> > # grep btrfs /etc/mtab
> > /dev/sdc1 /usr/local/data/datastore2 btrfs
> > rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> >
>
> [1]
>
> [snip]
>
> > lxc008
> >
> > number of subvolumes: 1416
>
> That's *way* too many. This is a major contributing factor to the
> timeouts...

lxc008 does not experience btrfs transaction hangs with 4.17. It does experience hangs with 4.18 and 4.19. Those hangs happen shortly after a copy starts. From that perspective, a hang is easily reproducible.

> [snip]
>
> > # grep btrfs /etc/mtab
> > /dev/sdc1 /usr/local/data2 btrfs
> > rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> >
>
> Is this in a container rather than a VM?

lxc008 is a physical host that runs containers, rather than vm's. The btrfs partition is a separate partition on the RAID array. The btrfs partition is only used to store backup files.

> > (RAID controller)
> >
> > # smartctl -d megaraid,0 -l scterc /dev/sdc
> > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
> > build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke,
> > www.smartmontools.org
> >
> > Write SCT (Get) Error Recovery Control Command failed: ATA return
> > descriptor not supported by controller firmware
> > SCT (Get) Error Recovery Control command failed
> >
>
> A different raid controller? Aiie, this is a complex setup...

They are all simple setups. Either the host has a dedicated hard drive for the btrfs partition, or the host has a RAID array where the btrfs partition is located.

> > lxc009
> >
> > # grep btrfs /etc/mtab
> > /dev/sda3 / btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0
> > /dev/sdb1 /usr/local/data2 btrfs
> > rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> >
> >
> >
> > (RAID controller)
> >
> > # smartctl -l scterc /dev/sdb
> > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
> > build)
> > Copyright (C) 2002-16, Bruce Allen, Christian Franke,
> > www.smartmontools.org
> >
> >
> >
> > # btrfs dev stats /usr/local/data2
> > [/dev/sdb1].write_io_errs 0
> > [/dev/sdb1].read_io_errs 0
> > [/dev/sdb1].flush_io_errs 0
> > [/dev/sdb1].corruption_errs 0
> > [/dev/sdb1].generation_errs 0
> >
>
> Ok, first decide where you want to reflink/snapshot, either inside the
> VMs or outside.
>
> If inside:
> * Host your VM images on an ext4 or xfs partition.
> * Use btrfs inside the VM.
> - Use noatime inside the VM.
> * To get backups onto the host, use the network or a
> passed-through partition.
>
> If outside:
> * Host raw VM images on btrfs (noatime).
> * Use ext4 on xfs inside the VM.
> - Everything is already COWed, checksummed, and compressed on the
> VM host, so it's absolutely not needed here.
> - Also use noatime inside the VM.
> * Periodically defrag the live copy of your VM images.
> ! Note that many on the linux-btrfs mailing list do not recommend
> btrfs for this type of workload, if performance is important.
> ? Maybe the partition pass through is how you're getting around
> this issue?
>
> Hacky "it's too late to rethink this server": Use chattr +C on the VM
> images. Note that the images will no longer be checksummed (see
> wiki).
>
> Maybe I've misunderstood, but it looks like you're running btrfs
> volumes, on top of qcow2 images, on top of a btrfs host volume.
> That's an easy to reproduce recipe for problems of this kind.
>
>
> Sincerely,
> Nicholas
>

This last part kind of went off the rails. We are only talking about one btrfs partition per physical host, which is only used to store backups. It is the most simple, vanilla situation, which should be perfectly suited for a file system.

--
Russell Mosemann