[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Preparing Argo and Lyra for the CI (Was: Preparing Ursa and Lyra for the CI)



Hi Cory, all,

On 2023-11-01 03:46, Cordell Bloor wrote:
> One failing test suite is that of rocfft, which times out after five 
> hours [2]. These old servers have terrible single-thread performance,
> so it takes a long time to run the rocfft test suite.
FYI, the limit of 5h was arbitrarily chosen my me, so we can always
increase it.

The reason why I chose 5h, instead of 10h or 24h or whatever, is that
the limit is global, and we do have test suites where at least one test
hangs in an infinite loop, namely rocthrust [4]. (Incidentally, it
passes on gfx803, but in <2min, which must also be a bug).

Making timeouts more package-specific was a recent discussion on the
debian-ci list [5]. I've created an issue to add this to our debci [6],
this shouldn't be much work, in fact.

> The rocsparse, rocblas, and rocsolver packages are also failing.
> Those tests crash with the error "Illegal instruction" [3]. I've not
> yet determined the cause of this problem, but it does not occur when
> the QEMU CPU model is configured as pass-through. It's not clear to
> me why this problem is not seen on the gfx1030 CI machine.

Found it: the autopkgtest QEMU machinery treats AMD and Intel CPUs
differently [7]. ci-worker-ckk{01,02} both have AMD CPUs and these are
pass-through as-is. Intel CPUs have a more complex configuration, to
enable nested KVM.

I think the nested KVM feature is used by the official ci.debian.net (at
least for some workers: Cloud -> Debian VM -> autopkgtest VM), so it
might be a model-specific thing.

In any case, making the CPU configurable seems like something that might
be worthwhile to add to src:autopkgtest. I could add that to our fork
for now, would that help?

> Argo and Lyra use a combined total of 450 W at idle, so I might shut 
> them down when the job queue is empty. I'm sure we can do something 
> clever with IPMI to only boot the systems when they're needed, but
> for now I'll handle it manually.
Wow, that's a lot!

I'll keep this in mind for the meta-worker I'm hacking on. The normal
debci worker is an unbounded listener (on one queue). The meta-worker is
a bounded listener (on multiple queues) and can perform actions when
boundaries are met.

In practice, that means that you could e.g. have the host wake up
periodically by BIOS RTC alarm, and have the meta-worker initiate
shutdown once all its relevant queues are empty.

Best,
Christian

> [1]: https://ci.rocm.debian.net/
> [2]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx900/
> [3]: https://ci.rocm.debian.net/packages/r/rocblas/unstable/amd64+gfx900/

[4] https://ci.rocm.debian.net/packages/r/rocthrust/
[5] https://lists.debian.org/debian-ci/2023/10/msg00008.html
[6] https://salsa.debian.org/rocm-team/rocm-team-infra/-/issues/5
[7] https://salsa.debian.org/ci-team/autopkgtest/-/blob/master/lib/autopkgtest_qemu.py#L145


Reply to: