Re: Added gfx1034 worker

To: debian-ai@lists.debian.org
Subject: Re: Added gfx1034 worker
From: Christian Kastner <ckk@debian.org>
Date: Tue, 5 Sep 2023 20:16:33 +0200
Message-id: <[🔎] 5477e38a-45fa-b9f9-5d20-a59c01281141@debian.org>
In-reply-to: <71eaf374-0413-ba39-838c-7ae40956bd29@debian.org>
References: <1409e3fc-bc68-290c-223a-e53f2a34f974@debian.org> <71eaf374-0413-ba39-838c-7ae40956bd29@debian.org>

On 2023-08-27 10:49, Christian Kastner wrote:
> I've added [1] a gfx1034 (RX 6500 XT) worker to the pool.

So all tests have run except for rocfft, and now I've found the issue:
on gfx1034, one of the tests triggers a soft lockup.

I've been assuming that the non-completing tests had something to do
with either the timeout issue I've mentioned earlier, or with job queue
processing (which has subtle gotchas), so I never looked at the worker.

And the worker didn't halt, it just rebooted [1], so I never noticed it
going away briefly [2].

I can reliably trigger the issue (here, reproduced in a container, to
rule out PCIe pass-through as the cause):

> [ RUN      ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CP_ostride_2048_1_CP_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0
> [       OK ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CP_ostride_2048_1_CP_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0 (19 ms)
> [ RUN      ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CP_ostride_2048_1_CI_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0
> [       OK ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CP_ostride_2048_1_CI_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0 (19 ms)
> [ RUN      ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CI_ostride_2048_1_CP_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0
> [       OK ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_op_batch_2_istride_2048_1_CI_ostride_2048_1_CP_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0 (16 ms)
> [ RUN      ] pow2_2D/accuracy_test.vs_fftw/complex_forward_len_256_2048_double_ip_batch_1_istride_2048_1_CI_ostride_2048_1_CI_idist_524288_odist_524288_ioffset_0_0_ooffset_0_0
> 
> Message from syslogd@ci-worker-ckk02 at Sep  5 16:46:30 ...
>  kernel:[ 3688.658804] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u64:1:126875]

I this a valuable find, and consider it motivation to extend the GPU
arch coverage even further.

Best,
Christian

[1] No idea if that's usual or watchdog-triggered, I've never
    encountered soft lockups before.
[2] That reminds me: none of the hosts have any sort of monitoring, so I
    need to add some.

Reply to:

Prev by Date: Processed: Fixed upstream
Next by Date: Bug#1051293: rocfft: soft lockup with gfx1034
Previous by thread: Re: Added gfx1034 worker
Next by thread: Processing of pytorch_2.0.1+dfsg-1~exp1_amd64.changes
Index(es):
- Date
- Thread