[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1032899: unblock: rocm-hipamd/5.2.3-6



Hi all,

I feel responsible for several of the issues listed by Paul, as
my earlier activity matches the time frame of some of the
changes and problems.

Christian Kastner, on 2023-03-16:
> On 2023-03-16 10:31, Paul Gevers wrote:
> > Control: tags -1 moreinfo On 16-03-2023 00:16, Christian Kastner 
> > wrote: For next time, can you please contact us earlier? We could 
> > have solved the earlier problems in testing-proposed-updates (in 
> > January), then we would now be in a better position.
> 
> I didn't think of that solution as the RC-blocked dependency was only
> available in unstable, and admittedly because I thought this would
> resolve itself in time.
> 
> But in any case: yes, earlier contact would have been helpful, and I'll
> do so in future.

Acknowledged, I must admit I had a similar perception of the
situation when I sloppily checked migration status two months
ago, and it didn't occur immediately to me that it would become
an entangled migration problem during hard freeze.  I'm sorry
about that.

> > By the way, I checked, but none of the ci.d.n host will run any of 
> > your tests, as none of them has an amdgpu (is that a thing you could 
> > expect on non-amd architectures by the way?).
> 
> Correct! Tests will be skipped on official infra.
> 
> It's not just a matter of the missing hardware (we have it, but DSA has
> understandable concerns), it's also about how to even express that a
> package needs a GPU to run its tests (build-time or autopkgtest).

Some kernel and hardware combinations may cause a host hangup,
e.g. the rocm-hipamd package version in testing doesn't
serialize properly tests and this causes a number of bus
contention errors when running the test suite, eventually
leading to a hangup.  I also have a more concerning case of a
test item running into a potential kernel bug on rx6800, which
I'm long overdue to investigate in depth with competent kernel
people (actually I'm unable to tell whether the hardware or the
kernel is at fault thus far, as the crash occurs in amdgpu ecc
functions).

There are other technical concerns regarding maintenance of
virtual machines and binding them to physical hardware due to
having to pass the GPU through the hardware.  The third issue
was it is almost always mandatory to run using non-free-firmware
that cannot be freely audited for passing tests.

The current combination of skippable tests with check on the
availability of kfd device is the best we managed achieve thus
far.

> I recently initiated a discussion about this [3]. For now, the idea to
> run parallel debci infra with guaranteed GPU presence, gather
> experience, and to eventually share proposals on how a GPU dependency
> could be expressed in d/control and d/tests/control.

(I'm overdue to answer to [3], but overall I was mostly fine
with the ideas and haven't spotted anything of concern yet.)

> > One thing I spotted along the way; the (Build-)Depends on llvm 
> > related packages use the *versioned* ones. Is there a reason not to 
> > use the unversioned ones from src:llvm-defaults? That would make llvm
> > transitions a bit easier.
> 
> I'd have to check with the co-maintainers who added it, but from what I
> gather so far, the ROCm stack needs a very recent llvm because of many
> changes being upstreamed there.

The ROCm stack is actually developed against a fork of llvm (the
rocm-llvm).  To avoid having to package more or less a code copy
of the native llvm, we target instead the next llvm-toolchain
version which contains upstreamed changes from rocm-llvm.  Even
that requires extensive patching, thankfully we have benefited
from the substantial help of people from AMD this far on that
front.

> > [1] https://lists.debian.org/debian-devel/2022/09/msg00105.html and follow-up 
> 
> [2] https://github.com/torvalds/linux/blob/v6.2/drivers/gpu/drm/amd/amdkfd/Kconfig#L6-L8
> [3] https://lists.debian.org/debian-ai/2023/03/msg00038.html

Thank you for your work on putting together Debian 12 bookworm!

Have a nice day,  :)
-- 
  .''`.  Étienne Mollier <emollier@debian.org>
 : :' :  gpg: 8f91 b227 c7d6 f2b1 948c  8236 793c f67e 8f0d 11da
 `. `'   sent from /dev/tty1, please excuse my verbosity
   `-    on air: Status Minor - Feel My Hunger

Attachment: signature.asc
Description: PGP signature


Reply to: