[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A deep learning rig with 8 GPUs



On 2023-08-24 19:56, M. Zhou wrote:
> The same applies to the nvidia platform. I'm working with a bunch
> of 8-GPU servers (4U size). And management can be not fun at all
> if we use any configuration that is not battle-tested. Even with
> server-grade solutions, we still have to reboot due to various kinds
> of problems like driver bug and hardware issues.
>
> On Thu, 2023-08-24 at 01:40 -0600, Cordell Bloor wrote:
>> In practice, I think the logistics will be significantly more
>> difficult 
>> than that. You can certainly stuff a bunch of AMD GPUs into a box,
>> but 
>> even with PCIe pass-through to isolate the GPUs, you may find that 
>> sometimes the only reliable way to restore the GPU to a known-good
>> state 
>> is to power-cycle the system. Not all hardware is as well-behaved as 
>> Navi 21.

All fair points, of course.

I'm not even sure QEMU can handle multiple cards; getting just one card
to work took a lot of trial-and-error, with issues such as setting up
PCI devices in the guest so that interrupts get routed correctly. I
haven't tested this yet with multiple cards, which is why I disabled
this for now in the autopkgtest qemu+rocm backend.

Best,
Christian


Reply to: