Bug#808953: linux: xhci regression in backported patch

To: Ben Hutchings <ben@decadent.org.uk>
Cc: Ron <ron@debian.org>, 808953@bugs.debian.org
Subject: Bug#808953: linux: xhci regression in backported patch
From: Ron <ron@debian.org>
Date: Sat, 2 Jan 2016 12:41:31 +1030
Message-id: <[🔎] 20160102021131.GO10797@hex.shelbyville.oz>
Reply-to: Ron <ron@debian.org>, 808953@bugs.debian.org
In-reply-to: <[🔎] 1451676034.28485.1.camel@decadent.org.uk>
References: <20151224175608.13653.11553.reportbug@hex.shelbyville.oz> <[🔎] 1451676034.28485.1.camel@decadent.org.uk>

On Fri, Jan 01, 2016 at 07:20:34PM +0000, Ben Hutchings wrote:
> Control: tag -1 moreinfo
> 
> On Fri, 2015-12-25 at 04:26 +1030, Ron wrote:
> > Source: linux
> > Version: 3.16.7-ckt20-1
> > Severity: important
> > 
> > Hi,
> > 
> > While testing the code from here: http://www.bitbabbler.org I ran into a
> > problem on recently updated Jessie systems where devices plugged into an
> > XHCI port would always time out between transfers when the transfer size
> > was larger than 16kB.  And since it wasn't doing that a few weeks ago, I
> > went digging :)
> > 
> > All the following was tested on amd64.
> > 
> > I can confirm that 3.16.7-ckt17-1 does not have this problem and it's a
> > newly introduced regression that's first seen with the ckt20 patch set.
> > I can also confirm that 4.2.0-0.bpo.1 from jessie-backports does not
> > have this problem.
> > 
> > The cause appears to be a botched backport of
> > e210c422b6fdd2dc123bedc588f399aefd8bf9de (from Linus' repo) because I can
> > also confirm that a locally built kernel from the 3.16.7-ckt20-1 source
> > with only that patch reverted also does not have this problem.
> 
> Are you seeing any error messages in the kernel log?  Do they look
> anything like those in <https://bugs.debian.org/809430>?  If not,
> please send them.

No, nothing outwardly complained to the logs.  The first symptom that I
saw, was after updating to ckt20, devices in XHCI ports dropped to a
ridiculously slow transfer rate (devices ECHI ports were unaffected).

Debug tracing the userspace code showed that transfers were frequently
just timing out (with a 5 second timeout set for libusb requests).  I'd
get one transfer succeed, then the next 2 requests would time out before
another successful transfer would occur.  lather, rinse, repeat.

This is only happening with 'large' read requests though.  If I cap the
transfer request size to 16kB or less, then it appears to work normally,
but anything larger than that hits this problem.

I tested that because I know there was previously a problem with XHCI
controllers and large transfers, which was addressed in this patch to
the kernel 19181bc50e1b8e92a7a3b3d78637c6dc5c0b5a1b, and corresponding
patches to libusb to deal with the fact it couldn't rely on the bulk
continuation flag working with those.  But those had fixed this up
until the ckt20 kernel.

> > If it's really important that we have that patch for some reason, then
> > it looks like (at least) 40a3b775f49c2784c96b19170fd2478e5e5511a1 would
> > also need to be backported too, and maybe some other changes that effect
> > this as well.
> [...]
> 
> Have you tried applying that on top of the current stable kernel
> source?  It doesn't look like a dependency to me.

I haven't no.  I figured the first question was whether we could simply
revert this one, since I couldn't find any report of a bug that it fixed
in 3.16 ...

My suspicion there came from the fact that the patch we have applied
triggers under different conditions to what it would if that prior
patch had also been backported.

In the original source, the patch involved code which looks like:

 if (trb_comp_code == COMP_STOP_SHORT) {
    ...
 } else if (event_trb == td->last_trb) {
+   if (td->urb_length_set && trb_comp_code == COMP_SHORT_TX)

While the backported version was massaged to look like:

 if (trb_comp_code == COMP_STOP_SHORT)
    ...
 if (event_trb == td->last_trb) {
+   if (td->urb_length_set && trb_comp_code == COMP_SHORT_TX)

With the difference coming from that earlier 40a3b7 patch.

I went over all the xhci related patches between ckt17 and ckt20, and
that was the most immediately suspicious thing that caught my eye, and
when reverting it fixed the problem, I stopped there to get your input.

There's a few other non-trivial patches in the upstream kernel tree
between those two, and without spending some serious time to get
familiar with this code and analyse how they are all related in detail,
I didn't feel confident to "just throw a few extra backported patches"
at it - since that could well break something else that I'm not testing
even if it fixes the bug I'm seeing.

I can test if sticking the 'else' back in there changes this behaviour
(which I suspect it might) - but I'm not yet comfortably convinced there
is nothing in the intervening patches this might not also rely on (or
even that doing that would really fix 'everything' here).

But it's an easy test and a small patch, so I'll run it and see what
happens.  (and I'll run some tests on 4.3.3-2 as well after that, since
the only machine I currently have running it has everything bound to
EHCI ports - but #809430, looks different to what I'm seeing.  I think
in my case, things are just stalling by not completing the TD when it
really is in fact completed ...  while that bug looks like a much more
complete failure of something somewhere).

  Cheers,
  Ron

Reply to:

References:
- Bug#808953: linux: xhci regression in backported patch
  - From: Ben Hutchings <ben@decadent.org.uk>

Prev by Date: Processed: forcibly merging 808293 809293
Next by Date: Processed: reassign 808602 to src:linux, found 808602 in linux/3.16.7-ckt20-1+deb8u1
Previous by thread: Processed: Re: Bug#808953: linux: xhci regression in backported patch
Next by thread: Processed: retitle 808916 to WARN in drm_atomic_check_only() during boot, Intel GM965 (gen4)
Index(es):
- Date
- Thread