[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1050940: linux: Enable Correctable Errors Collector RAS_CEC feature



Source: linux
Version: 6.5~rc7-1~exp1
Severity: wishlist
Tags: patch
X-Debbugs-Cc: miguel.bernal.marin@linux.intel.com, jair.gonzalez@linux.intel.com

Dear Maintainer,

Please enable the Reliability, Availability and Serviceability (RAS)
Correctable Errors Collector (RAS_CEC) feature on arch amd64/x86_64,
on Debian Trixie. 

RAS_CEC introduce a simple data structure for collecting correctable
errors along with accessors.

This is a small cache which collects correctable memory errors per 4K
page PFN and counts their repeated occurrence. Once the counter for a
PFN overflows, we try to soft-offline that page as we take it to mean
that it has reached a relatively high error count and would probably
be best if we don't use it anymore.

The error decoding is done with the decoding chain now and
mce_first_notifier() gets to see the error first and the CEC decides
whether to log it and then the rest of the chain doesn't hear about it -
basically the main reason for the CE collector - or to continue running
the notifiers.

When the CEC hits the action threshold, it will try to soft-offine the
page containing the ECC and then the whole decoding chain gets to see
the error.

To disable the Correctable Errors Collector, a kernel parameter is used:
>  ras=cec_disable

A MR was created with this proposal at:

https://salsa.debian.org/kernel-team/linux/-/merge_requests/827

Thanks,
Miguel Bernal Marin


Reply to: