[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#1056170: libhsa-runtime64-1: ROCr must assume xnack is disabled



I've included some additional information on the ticket and have been discussing this with the upstream developers. I'll summarize the information here.

On 2023-11-18 00:39, Cordell Bloor wrote:
Each time a HIP application is executed, the rocr-runtime prints the message:

    KFD does not support xnack mode query.
    ROCr must assume xnack is disabled.

It is unclear to me whether something is actually wrong or not. This
message is emitted from a debug_print statement in amd_topology.cpp. An
example of this message can be found in the CI logs [1].

This is a debug message. It is guarded by NDEBUG, so it would not be printed if rocr were built in Release mode. There is a bit of discussion upstream as to whether the debug_print should instead be guarded by an environment variable rather than a preprocessor definition.

If there's something wrong with KFD, then that problem should be
reported to the kernel developers. If there's nothing wrong with KFD,
then this message should be suppressed.

The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is the KConfig for "Enable HMM-based shared virtual memory manager", which is required for xnack+ operation. The xnack feature allows some AMD GPUs to retry memory accesses that fail due to a page fault, which is used as a mechanism for migrating managed memory automatically from host to device. With xnack disabled, page faults in device code are not recoverable [1].

Sincerely, Cory Bloor

[1]: https://niconiconi.neocities.org/tech-notes/xnack-on-amd-gpus/

Reply to: