[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: ROCm CI: Now also triggering on dependency changes



On 2023-11-22 12:54, Christian Kastner wrote:
> I'm happy to report that with the release of rocm-dev-tools 0.5.0, our
> custom scheduler now also triggers tests for our packages when one of
> their dependencies have been updated.

This seems to work fine. For example, yesterday's gcc-13 upload
triggered a re-test of most of the other libraries.

Incidentally, ci-worker-ckk02 hit some memory-related error while
testing rocblas (still ongoing), so I wonder how that will affect the test:

> [  939.527256] [Hardware Error]: Uncorrected, software restartable error.
> [  939.527261] [Hardware Error]: CPU:4 (19:21:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135
> [  939.527270] [Hardware Error]: Error Addr: 0x0000000564d86940
> [  939.527273] [Hardware Error]: IPID: 0x001000b000000000
> [  939.527276] [Hardware Error]: Load Store Unit Ext. Error Code: 1, An ECC error or L2 poison was detected on a data cache read by a load.
> [  939.527282] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
> [  939.527289] mce: Uncorrected hardware memory error in user-access at 564d86940
> [  939.528039] Memory failure: 0x564d86: recovery action for unsplit thp: Ignore
More importantly though, after swapping the CPU on ci-worker-ckk02
yesterday,
  * the gfx1030 card was not properly connected (my fault)
  * the gfx1034 card was assigned the PCI slot ID normally assigned to
    to the aforementioned gfx1030, so the test gfx1030 results are bogus

Consequently, I will reschedule all tests for gfx1030 and gfx1034.

I wonder whether we shouldn't keep a public log of all infrastructure
changes somewhere.


Reply to: