Bug#808953: xhci regression for large transfers (commit e210c422b)

To: Ron <ron@debian.org>
Cc: linux-usb@vger.kernel.org, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, 808953@bugs.debian.org
Subject: Bug#808953: xhci regression for large transfers (commit e210c422b)
From: Mathias Nyman <mathias.nyman@intel.com>
Date: Mon, 11 Jan 2016 14:14:58 +0200
Message-id: <[🔎] 56939CC2.8070501@intel.com>
Reply-to: Mathias Nyman <mathias.nyman@intel.com>, 808953@bugs.debian.org
In-reply-to: <[🔎] 20160108151916.GB10797@hex.shelbyville.oz>
References: <[🔎] 20160102063255.GQ10797@hex.shelbyville.oz> <[🔎] 568E8661.9040202@intel.com> <[🔎] 20160107162228.GX10797@hex.shelbyville.oz> <[🔎] 20160107163920.GY10797@hex.shelbyville.oz> <[🔎] 20160107184724.GZ10797@hex.shelbyville.oz> <[🔎] 568FC341.8070202@intel.com> <[🔎] 20160108151916.GB10797@hex.shelbyville.oz>

On 08.01.2016 17:19, Ron wrote:

On Fri, Jan 08, 2016 at 04:10:09PM +0200, Mathias Nyman wrote:

On 07.01.2016 20:47, Ron wrote:

On Fri, Jan 08, 2016 at 03:09:20AM +1030, Ron wrote:

On Fri, Jan 08, 2016 at 02:52:28AM +1030, Ron wrote:

On Thu, Jan 07, 2016 at 05:38:09PM +0200, Mathias Nyman wrote:

Hi

On 02.01.2016 08:32, Ron wrote:


Hi,

It appears the commit e210c422b6fdd2dc123bedc588f399aefd8bf9de
"xhci: don't finish a TD if we get a short transfer event mid TD"
is causing transfers larger than 16kB to be unreliable.

If I limit transfers to be no larger than 16kB, then it also works as
expected in an XHCI port with an unmodified build of Linus' current
head (v4.4-rc7-76-g9c982e8), but transfers larger than that do not.
I see an alternating cycle of a successful transfer, followed by two
that will time out waiting in libusb (with a 5 second timeout set),
before getting another successful transfer and the cycle repeating.

I can run more tests and dig into this deeper if the reason for it
isn't immediately obvious in hindsight.


Thanks for the info,
I can't spot anything obvious, but my brain might still be in vacation mode.

Could you reproduce it with the attached patch, it only adds extra debugging?

We should either see no output, or the following sequence:

  1. "mid bulk/intr SP, wait for last TRB event"
  2. "last trb has length set"
  3. "and last trb is SHORT_TX, OK"



I guess one out of 3 ain't good ...  all I see logged is:

  [   60.015708] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
  [   65.017374] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
  [   70.455451] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event
  [   75.456248] xhci_hcd 0000:04:00.0: mid bulk/intr SP, wait for last TRB event

I'm passing 5 seconds to libusb as the requested timeout.


And if I limit the maximum transfer size to 16kB (the above was with
64kB transfers), then I see nothing logged at all.

So if that code does indeed look sane, perhaps the issue is really
in the code that's splitting large transfers doing something funny?


So, tracing this a bit deeper, what I'm seeing is that in the case of
a transfer of 16kB or less, there's only one TRB in the TD, so it's
always the last_trb and the wait code never triggers (ie. basically
the same code path as with this patch reverted).

In the case of a larger transfer, there are N / 16kB TRBs in the TD,
and in the case where it stalls, I get a short transfer on the first
TRB, and then it just hangs waiting until it times out, reporting:

  xhci_hcd ep 0x81 - asked for 65536 bytes, 16382 bytes untransferred

with event_trb == td->first_trb.


What's still not clear to me yet is which assumption is broken.
Clearly we aren't actually getting more TRB events after the short
transfer.

And the other thing which possibly makes this odd is that the FTDI will
always return 2 bytes in a packet, even when there "no data" in it,
since it prefixes each packet with 2 "modem status" bytes.  Which means
the 16382 untransferred above is it returning 2 status bytes and none
of the data that we actually requested from it (which is a normal thing
for it to do sometimes, and if we issue another read the data will
eventually come - but I don't yet know if that's also contributing here).


Sorry for the trickle of small facts, I'm still wrapping my head around
everything I need to know to understand this properly.


This is all good information.

We mark all TRBs for incoming bulk transfers as "interrupt on short transfer"
and the last TRB of the TD as "interrupt on completion"

If the we get a short packet event mid TD xhc should create two events, the first
one for the trb that was short, and a second one one the last trb, with the "interrupt on completion"


There is one thing we need to be careful of here, with the FTDI at least.
Once we get a short packet, we do actually need to end the transaction
there and not append any further data to it.  The reason for that is
those modem status bytes.  The only way to find them and filter them from
the actual data you want is to split the received transfer back into
wMaxPacketSize'd blocks and remove the first 2 bytes from each packet,
since it prepends them to every packet.

If a subsequent trb after the short was to add more data to that, we'd
lose that framing and not be able to extract the data from it reliably
(since we wouldn't know where the short in the middle ended).

I'm not sure if the events we expect in this case already cover that,
but it seemed worth mentioning in case they might return more data.

There was an errata added to the specs specifying in more detail what this second events status should be,
but the second event should be there anyways. controller will "fast forward" through the remains of the TD after
a short packet, generating the event.


If I understand that part though, then we shouldn't actually get more
data, just a final completion event for the whole TD ...

It's possible that something goes wrong in updating the dequeue pointer of either the endpoint ring or
event ring, leading to handling the wrong event, or wrong trb for that endpoint (perhaps handling the same "mid TD" TRB twice?)

I have yet another patch. This one prints out the content of the short transfer event, and the event following a short transfer.


Ok, here's a longer dump of what that outputs, which I'll annotate to
show what's actually going on with the device at each point.

The extra values in parens of the "asked for" lines are
(td->last_trb - td->first_trb), and (event_trb - td->first_trb)
which was how I tracked which trb it was stalling on earlier.

The logs shows that we never get the second event for the TD,

The transfer ring and corresponding events:
EP TRB addr, event, comment

d207cce0, 01(Success), Previous event was a SHORT
d207ccf0, 0d(SHORT ), only TRB of 1 long TD, returned
-new TD-
d207cd00, 0d(SHORT ), 1st TRB of 4 long TD. Previous event was SHORT. wait
d207cd10, , no event
d207cd20, , no event
d207cd30, , no event
-new TD-
d207cd40, 0d(SHORT), 1st TRB of 4 long TD. Previous event was short. wait
d207cd50, , no event
d207cd60, , no event
d207cd70, , no event
-new TD-
d207cd80, , no event
d207cd90, , no event
d207cda0, , no event
d207cdb0, 01(Success), Previous event was short
- new TD -
d207cdc0, 0d(SHORT), only TRB in 1 long TD
- new TD -
...

There are no events for the TDs last TRB at d207cd30 or d207cd70.
An event would have been logged as the previous events were short transfers (at d207cd00 and d207cd40)

Looks like some controllers are not following the specs and we can't rely on the second event.

So the patch should be reverted, and instead handle the controllers that work according to specs
as a special case. So we need to return the TD after the first SHORT event, and then make sure a possible
second event doesn't mess up anything.

Just for reference, xhci specs 1.0 section 4.10.1.1 specify that:

"If the Short Packet occurred while processing a Transfer TRB with only an ISP flag set, then
two events shall be generated for the transfer; one for the Transfer TRB that the short
Packet occurred on, and a second for the last TRB with the IOC flag set.

Software shall not interpret an Short Packet Event as indicating that the TD that it is
associated with is “complete”, unless the TRB Pointer field of the Transfer Event references
the last TRB of the TD."

ISP = interrupt on short packet ( we set for all bulk in TRBs)
IOC = interupt on completion ( we set for last TRB in TD)

-Mathias

Reply to:

Follow-Ups:
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>

References:
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Mathias Nyman <mathias.nyman@intel.com>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Mathias Nyman <mathias.nyman@intel.com>
- Bug#808953: xhci regression for large transfers (commit e210c422b)
  - From: Ron <ron@debian.org>

Prev by Date: linux_3.16.7-ckt20-1+deb8u2~bpo70+1_multi.changes ACCEPTED into wheezy-backports->backports-policy, wheezy-backports
Next by Date: Bug#810154: remove busybox hook, leave responsibility to busybox package
Previous by thread: Bug#808953: xhci regression for large transfers (commit e210c422b)
Next by thread: Bug#808953: xhci regression for large transfers (commit e210c422b)
Index(es):
- Date
- Thread