[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Navi 12 on Debian (was: August ROCm Package Testing Results)



Hi Cory,

On 2023-08-30 01:50, Cordell Bloor wrote:
> I updated the supported GPU list [1] and my logs [2] with the gfx1011
> results. I saw no problems, aside from a rocSPARSE test case that
> slightly exceeded the specified tolerance. That test failure is seen on
> gfx1030, too. We can probably just increase the tolerance slightly.

I've been meaning to initiate a discussion on how to best deal with test
failures. For example, rocsparse [1] fails one test on gfx1034, with a
result that ever so slightly exceeded the specified tolerance.

Ideally, we'd have some process for dealing with test failures on our
end, and more importantly how to report them back.

On our end, I did not yet perform a deep dive on rocsparse, but I did
not yet find a trivial way to increase the tolerance, but that's mostly
because I've been far removed from C++ for ages now.

On reporting them back, you once mentioned:

On 2023-07-12 22:53, Cordell Bloor wrote:
> I asked the developers ofamdgpu/kfd for some tips on writing a good report.
> 
> Their consensus was that the two most important things are to provide a reproducer (i.e., a clearly described method to reproduce the problem) and the full dmesg log (as it contains lots of info like kernel version, VBIOS version, etc).

One method that would be very simple for us to implement would be to
extend the test runner (usually debian/tests/upstream-binaries) to
include dmesg output as a test artifact. That would give us a test
binary (from -tests) and the log.

Creating a minimal test case manually would be far more effort, but
somewhat pointless -- after all, the failing test is already in the test
suite.

We could try reporting the rocrand bug above, and see if some regular
process crystallizes from that.

Best,
Christian

[1] https://ci.rocm.debian.net/packages/r/rocsparse/unstable/amd64+gfx1030/


Reply to: