[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Kerberos-secured NFSv4: nss_getpwnam: name '8' does not map into domain



[I sent this mail already on the 28th of July, but apparently it didn't
reach the mailinglist because of mail problems on my side, thus I send
the mail again now.]

Am 2015-07-22 01:31, schrieb Christian Seiler:
Hi there,

Hi Christian,

first, thanks again for this very extensive response. It helped me a lot
in better understanding the issue.

I did some further debugging in the meantime. In short, I'm able to
reproduce the issue on two of my systems. Please see below.

On 07/10/2015 01:02 PM, Jonas Meurer wrote:
Am 2015-07-08 15:34, schrieb Jonas Meurer:
I've another annoying issue with my new Kerberos-secured NFSv4 setup.
Sometimes when Exim4 writes to the mounted NFS share, it fails to set
owner and permissions on the written file. Exim4 runs as local user
Debian-exim:Debian-exim but tries to set owner of created files on
the NFS share to 'mail:mail'. Both the local user Debian-exim and
the local user mail are authenticated against the Kerberos server and
principals 'Debian-exim@DOMAIN.ORG' as well as 'mail@DOMAIN.ORG' do
exist.

Below's some debugging output of rpc.idmapd on the Kerberos-/NFS-
Server.

Ok, these long entries explain what's happening from the server
perspective - and the server does everything correct here. First of
all, this has nothing do with Kerberos, but with NFSv4's idmapping.

Good to know that the problem is a client-side one :)

Let me back up a little. [...]

Note that Linux always uses uids internally in the kernel to store user
credentials, so what happens is that the sender of a NFSv4 packet takes
the uid, translates it into a string representation containing an @,
and then sends that over the netowrk - the receiver translates it back
into a uid and uses that in its own data structures. (Btw. if you run
ls to show a directory listing, ls will translate it back into a string
representation, possibly a different one, so you may see a username
instead of just a number. So on NFS three different translations will
happen if you type ls -l somewhere: 1. server: uid -> nfs-name,
2. client (kernel): nfs-name -> uid and 3. client (ls): uid -> name.
nfs-name is typically the same as name with just an @nfs4-domain at
the end, but doesn't have to be.)

This is what the idmapping on both the server and the client is about:
to tell the NFSv4 server and client implementations how to do this type
of translation.

Ok, understood. So something goes wrong with idmapping on the client.

Now let's get back to those log messages:
Jul 10 10:46:59 nfs1 rpc.idmapd[4946]: nfsdcb: authbuf=gss/krb5i
authtype=user
Jul 10 10:46:59 nfs1 rpc.idmapd[4946]: nfs4_name_to_uid: calling
nsswitch->name_to_uid
Jul 10 10:46:59 nfs1 rpc.idmapd[4946]: nss_getpwnam: name '8' domain
'freesources.org': resulting localname '(null)'

Here we have to look at two source codes: idmapd (part of nfs-utils)
and libnfsidmap (its own package). idmapd is just a wrapper around that
library and implements a simple event loop around a couple of pipes it
uses to communicate with the kernel. The nfsdcb function is the handler
that processes idmapping requests and replies to them (set up in [1],
function body in [2]). Note that it just reads the request, parses it
and then calls a function to do the actual mapping (imconv), which
itself then calls into the libnfsidmap library.

Since you can setup multiple different idmappings, those are
implemented in a similar manner to plugins in libnfsidmap. In your
case, you are using the 'nss' plugin (name service switch, the libc
mechanism normal programs use to lookup users and groups, typically
in /etc/passwd and /etc/group). That contains the function nss_getpwnam
that is called (as you can see from the debug message), whose body is
found in [3].

There you can see where the log message comes from:

	IDMAP_LOG(4, ("nss_getpwnam: name '%s' domain '%s': "
		  "resulting localname '%s'\n", name, domain, localname));

If you backtrace everything through the call stack of both libnfsidmap
and idmapd, the 'name' veriable contains the data idmapd got from the
kernel, and if you went digging into the kernel source, you'd see that
it just takes the raw data it got from the NFS client and passes it on
to the idmapper.

So that means that if you look at your log message, the NFS client
transmitted the string '8' (it's a 1-byte UTF-8 string with a single
digit, it's not a binary representation of that number!) to the server;
the server passed that string on to the idmapper, the idmapper then
notices "oh dear, it doesn't contain an @", and so it says "nope,
sorry, can't translate that". See the quoted RFC as to why that
behavior is correct.

Compare that to the case where everything worked:
Jul 10 10:50:34 nfs1 rpc.idmapd[4946]: nss_getpwnam: name
'mail@freesources.org' domain 'freesources.org': resulting localname
'mail'

Here you can see that the owner name that was transmitted by the NFS
client was 'mail@freesources.org' (and not simply '8'), so that does
contain an @; nss_getpwname can see that the domain name matches
and just strips it, resulting in a user name 'mail', which it looks
up in /etc/passwd, returns the user id (in this case, 8, because it's
the same on client and server) and the server is perfectly happy.


So to summarize your problem from this perspective so far:

 - Nothing to to with Kerberos, since that's only for authenticating
   the packets.

 - The NFS server appears to work properly throughout the whole thing.

Makes sense so far :)

 - The NFS client sends just the uid converted to a string in some
   cases instead of the properly translated NFS username, which the
   server then rejects.

So why does the client send the wrong username?

I'll skip the very helpful and extensive explanation here and jump
immediately to the point where I can add valuable information.

So where's the bug?

  - I don't know yet. On a Wheezy kernel I didn't experience the
    mapping failure I described (or at least not even remotely as
    often), whereas the same idmapd/libnfsidmap version have this
    issue with a wheezy-backports kernel (i.e. Jessie kernel), and
    the slightly newer Jessie versions of idmapd/libnfsidmap have
    the same trouble under Jessie. So my guess is that there's
    something in the kernel that in some circumstances fails and
    then caches the "doesn't have a proper mapping" value that only
    expires after some time.

  - Since you are using just regular nsswitch via /etc/passwd,
    nss_getpwnam should *never* fail in your case, unless you do
    some weird stuff with /etc/passwd at the same time. (In my case,
    users are stored in LDAP, so that introduces a potential further
    problem here, because there could be some problem with the
    network connection.)

  - I had mixed results when it comes to comparing nfsidmap and idmapd.
    The problem in my case is that the error occurs only sporadically
    and was therefore quite hard to pin down for me. I had the
    impression that it kind of depends on the kernel version which of
    these mechanisms works better, but that may only be anecdotal,
    because of the rarity of the problem in my case.

    (I really need to find an easy way to reproduce and specifically
    trigger this issue in a set of VMs - that would really help with
    debugging.)

I guess that I can help here. I have two systems (both in production
mode, but redundant mailservers, so no problem to take them down for a
few minutes), on which the error occurs reliable several times a day, if
not several times an hour. It happens when exim4 tries to write a mail
from its queue to a users mailbox.

So where does that leave you?

  - You can try to use the nfsidmap mechanism instead of idmapd. It's
    actually really trivial to do that: install the keyutils package
    on the client and that's it. (You may stop idmapd or leave it
    running, shouldn't make a difference if /sbin/request-key exists
    and it supports upcalls via /usr/sbin/nfsidmap.) To disable it
    again, remove the package. (Maybe it's sufficient to comment out
    the line in /etc/request-key.d/id_resolver.conf to disable it but
    keep keyutils installed, but I'm not sure. The package is small
    enough that removing / reinstalling it should be easy enough.)

    Note that /etc/idmapd.conf is used by both tools, so you don't need
    to configure anything.

    Might help, might make the problem worse. As I said above: Since
    my (related) problem wan't so easy to trigger for me, I'm not
    completely sure.

That was easy: I installed keyutils on both systems and rebooted them
(just to go sure). Result: the error vanished. One of the two systems
has keyutils installed since three days, and the error didn't occur one
single time since then. So I would say: the situation improved a lot,
maybe the issue even got fixed completely that way :)

  - If it consistently occurs in your case, maybe you could increase
    the debugging level of the client's idmapd and/or nfsidmap
    (Verbosity in the configuration file is honored by both.) Then
    you can compare the log entries from server and client and try
    to correlate the lookups to see if you find something interesting
    on the client just before the server complaining. (But if it's
    really a kernel bug, debugging the userspace side will not likely
    yield results, but you never know - and maybe the bug is really
    not in the kernel.)

I can do this if you still think it helps, but see below.

  - You could try a newer kernel (e.g. 4.0.8-1 from Debian testing) on
    the client and see if the problem persists with that - maybe
    somebody fixed it in the mean time (possibly accidentally while
    cleaning up code or so - or fixing another issue that seemed
    unrelated or something). Haven't done that myself yet.

Just did that on on the other system: I removed keyutils again,
rebooted, waited until the error reappeared, installed the stretch
(4.0.8-1) linux kernel and rebooted into that one. Result: the issue
vanished.

Sorry that I couldn't give you a better answer so far, I'd also really
like to find that bug and finally squash it.

That was one of the most elaborated and extensive replies I read in
years. It helped me a lot to better understand the problem and even
provided two(!) solutions that work. So no need to apologize :)

Maybe with your deep knowledge and me being able to reproduce the bug in
a reliable way, we're able to find the root cause and squash the bug in
Jessie as well ...

As a first step, I tried to find the code changes so idmap.c between the
Jessie kernel and the current Stretch kernel. If I'm not wrong, then
this is the diff (limited to fs/nfs/idmap.c) between the two kernel
versions:

Changes between 3.16 and 4.0 in kernel code:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/diff/fs/nfs/idmap.c?id=d67ae825a59d639e4d8b82413af84d854617a87e&id2=0c7774abb41bd00d5836d9ba098825a40fa94133

And these are the related NFS patches:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch?id=0c7774abb41bd00d5836d9ba098825a40fa94133
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch?id=f9167789df53f22af771fb6690a3d36aa21d74c5
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch?id=633706a2ee81637be37b6bc02c5336950cc163b5
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch?id=c06cfb08b88dfbe13be44a69ae2fdc3a7c902d81
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch?id=d67ae825a59d639e4d8b82413af84d854617a87e

Cheers,
 jonas


Reply to: