[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Added gfx1034 worker



On 2023-08-27 10:49, Christian Kastner wrote:
> I've added [1] a gfx1034 (RX 6500 XT) worker to the pool.

So all tests have run except for rocfft, and now I've found the issue:
on gfx1034, one of the tests triggers a soft lockup.

I've been assuming that the non-completing tests had something to do
with either the timeout issue I've mentioned earlier, or with job queue
processing (which has subtle gotchas), so I never looked at the worker.

And the worker didn't halt, it just rebooted [1], so I never noticed it
going away briefly [2].

I can reliably trigger the issue (here, reproduced in a container, to
rule out PCIe pass-through as the cause):

> [ RUN      ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CP_ostride_2048_1_CP_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0
> [       OK ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CP_ostride_2048_1_CP_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0 (19 ms)
> [ RUN      ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CP_ostride_2048_1_CI_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0
> [       OK ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CP_ostride_2048_1_CI_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0 (19 ms)
> [ RUN      ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CI_ostride_2048_1_CP_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0
> [       OK ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CI_ostride_2048_1_CP_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0 (16 ms)
> [ RUN      ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_ip_batch_1_istride_2048_1_CI_ostride_2048_1_CI_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0
> 
> Message from syslogd@ci-worker-ckk02 at Sep  5 16:46:30 ...
>  kernel:[ 3688.658804] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u64:1:126875]

I this a valuable find, and consider it motivation to extend the GPU
arch coverage even further.

Best,
Christian

[1] No idea if that's usual or watchdog-triggered, I've never
    encountered soft lockups before.
[2] That reminds me: none of the hosts have any sort of monitoring, so I
    need to add some.


Reply to: