[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1050446: nfs over rdma between debian12 client and debian11 server can cause data corruption



Package: nfs-common,rdma-core

I've been testing the upgrade of a compute node from Debian11 to Debian12.
That node was connected through nfs with rdma protocol to a zfs-storage server running on Debian11.
The compute node and the storage server are part of a high-performance compute cluster, connected over infiniband.
Not sure whether this is important, but the storage server is using zfs.

After the upgrade of the compute node (node client) to Debian 12, this machine could not correctly read a few (small) files. The files were correctly shown with "ls", and the size matched as well.
However the content was corrupted (looked like random garbage). In one case the .ssh/authorized_keys was corrupted, in some other case the "version.lua" from the lmod system was affected, rendering lmod unusable.
Interestingly, only very few files seemed to be affected. Most files were correctly retrieved.

So this is a very subtle error, and not obvious.
When retrieving these files, no error was reported, but data of the expected size was retrieved.
Effectively, the retrieved data was corrupted, and could lead to potential data loss.

The compute node on Debian12 had

ii  libnfsidmap1:amd64                    1:2.6.2-4                               amd64        NFS idmapping library
ii  nfs-common                            1:2.6.2-4                               amd64        NFS support files common to client and server
ii  librdmacm1:amd64                      44.0-2                                  amd64        Library for managing RDMA connections
ii  rdma-core                             44.0-2                                  amd64        RDMA core userspace infrastructure and documentation
ii  rdmacm-utils                          44.0-2                                  amd64        Examples for the librdmacm library


The storage server on Debian11 had
ii  nfs-common                         1:1.3.4-6                      amd64        NFS support files common to client and server
ii  nfs-kernel-server                  1:1.3.4-6                      amd64        support for NFS kernel server
ii  librdmacm1:amd64                   33.2-1                         amd64        Library for managing RDMA connections


The problem went away, when changing nfs mount protocal from proto=rdma to proto=tcp.

I tried to learn about this incompatibility, but did not find any information.
I'm also curious whether an nfs 2.6 server would correctly talk to an nfs 1.3 client over rdma ?
Can anyone provide more information on that topic ?


Reply to: