Re: A deep learning rig with 8 GPUs
On 2023-08-24 19:56, M. Zhou wrote:
> The same applies to the nvidia platform. I'm working with a bunch
> of 8-GPU servers (4U size). And management can be not fun at all
> if we use any configuration that is not battle-tested. Even with
> server-grade solutions, we still have to reboot due to various kinds
> of problems like driver bug and hardware issues.
>
> On Thu, 2023-08-24 at 01:40 -0600, Cordell Bloor wrote:
>> In practice, I think the logistics will be significantly more
>> difficult
>> than that. You can certainly stuff a bunch of AMD GPUs into a box,
>> but
>> even with PCIe pass-through to isolate the GPUs, you may find that
>> sometimes the only reliable way to restore the GPU to a known-good
>> state
>> is to power-cycle the system. Not all hardware is as well-behaved as
>> Navi 21.
All fair points, of course.
I'm not even sure QEMU can handle multiple cards; getting just one card
to work took a lot of trial-and-error, with issues such as setting up
PCI devices in the guest so that interrupts get routed correctly. I
haven't tested this yet with multiple cards, which is why I disabled
this for now in the autopkgtest qemu+rocm backend.
Best,
Christian
Reply to: