[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1064811: libhipsparse0-tests: HIPSPARSE_STATUS_INTERNAL_ERROR



Hi Christian,

On 2024-02-26 00:54, Christian Kastner wrote:
On gfx1031/gfx1032/gfx1034, there are numerous occurrences of
HIPSPARSE_STATUS_INTERNAL_ERROR, see [1] for a full log. Interestingly,
only some of them lead to test failures (some examples below), and
sometimes there is more than one occurrence per test.

These passed on gfx900/gfx1030 so I don't immediately suspect my update
to the optional-test-matrices resp. allow-missing-matrix-data-in-tests
patches to be the cause.

If it works on gfx1030, but fails on gfx1031, gfx1032 and gfx1034, it is almost certainly caused by run-time dispatching in rocprim [2]. I would ignore this issue until rocprim passes its tests on those architectures.

We only build for gfx1030, so the problem is that when running on gfx103{1,2,4}, rocPRIM is dynamically checking the current GPU architecture and dispatching to something other than the gfx1030 implementation. If you'd like to verify that this is the problem, you can run the tests with the environment variable HSA_OVERRIDE_GFX_VERSION=10.3.0 on gfx1031, gfx1032  or gfx1034 hardware.

The solution is tricky, though, because rocPRIM is a header-only library. We can't use a solution like in rocBLAS [3] where we force gfx1031 to use gfx1030 code objects, because librocprim-dev users might be building their code for gfx1031... unless maybe we hide that dispatch behaviour behind an #ifdef that rocsparse can define.

Sincerely,
Cory Bloor

[2]: https://salsa.debian.org/rocm-team/rocprim/-/blob/debian/5.7.1-1/rocprim/include/rocprim/device/config_types.hpp?ref_type=tags#L247
[3]: https://salsa.debian.org/rocm-team/rocblas/-/blob/debian/5.5.1+dfsg-4/debian/patches/0012-expand-isa-compatibility.patch?ref_type=tags


Reply to: