Re: A deep learning rig with 8 GPUs

To: debian-ai@lists.debian.org
Subject: Re: A deep learning rig with 8 GPUs
From: Christian Kastner <ckk@debian.org>
Date: Thu, 24 Aug 2023 23:30:15 +0200
Message-id: <[🔎] 756de8e7-1c2e-a0df-6a03-18c949de668e@debian.org>
In-reply-to: <[🔎] d152b3a54b932d44fd002051e4a2b33215c1fefa.camel@debian.org>
References: <[🔎] 9a9628bf-c987-735f-3a12-3507211a0d87@debian.org> <[🔎] 402ff6c8-682e-e316-87f9-b66266a22cb6@slerp.xyz> <[🔎] d152b3a54b932d44fd002051e4a2b33215c1fefa.camel@debian.org>

On 2023-08-24 19:56, M. Zhou wrote:
> The same applies to the nvidia platform. I'm working with a bunch
> of 8-GPU servers (4U size). And management can be not fun at all
> if we use any configuration that is not battle-tested. Even with
> server-grade solutions, we still have to reboot due to various kinds
> of problems like driver bug and hardware issues.
>
> On Thu, 2023-08-24 at 01:40 -0600, Cordell Bloor wrote:
>> In practice, I think the logistics will be significantly more
>> difficult 
>> than that. You can certainly stuff a bunch of AMD GPUs into a box,
>> but 
>> even with PCIe pass-through to isolate the GPUs, you may find that 
>> sometimes the only reliable way to restore the GPU to a known-good
>> state 
>> is to power-cycle the system. Not all hardware is as well-behaved as 
>> Navi 21.

All fair points, of course.

I'm not even sure QEMU can handle multiple cards; getting just one card
to work took a lot of trial-and-error, with issues such as setting up
PCI devices in the guest so that interrupts get routed correctly. I
haven't tested this yet with multiple cards, which is why I disabled
this for now in the autopkgtest qemu+rocm backend.

Best,
Christian

Reply to:

References:
- A deep learning rig with 8 GPUs
  - From: Christian Kastner <ckk@debian.org>
- Re: A deep learning rig with 8 GPUs
  - From: Cordell Bloor <cgmb@slerp.xyz>
- Re: A deep learning rig with 8 GPUs
  - From: "M. Zhou" <lumin@debian.org>

Prev by Date: gloo-cuda_0.0~git20220518.5b14351-5_amd64.changes ACCEPTED into unstable
Next by Date: Processing of rocrand_5.5.1-1_source.changes
Previous by thread: Re: A deep learning rig with 8 GPUs
Next by thread: Bug#1049960: ITP: half -- C++ library for half precision floating point arithmetics
Index(es):
- Date
- Thread