Bug#1050940: linux: Enable Correctable Errors Collector RAS_CEC feature
Source: linux
Version: 6.5~rc7-1~exp1
Severity: wishlist
Tags: patch
X-Debbugs-Cc: miguel.bernal.marin@linux.intel.com, jair.gonzalez@linux.intel.com
Dear Maintainer,
Please enable the Reliability, Availability and Serviceability (RAS)
Correctable Errors Collector (RAS_CEC) feature on arch amd64/x86_64,
on Debian Trixie.
RAS_CEC introduce a simple data structure for collecting correctable
errors along with accessors.
This is a small cache which collects correctable memory errors per 4K
page PFN and counts their repeated occurrence. Once the counter for a
PFN overflows, we try to soft-offline that page as we take it to mean
that it has reached a relatively high error count and would probably
be best if we don't use it anymore.
The error decoding is done with the decoding chain now and
mce_first_notifier() gets to see the error first and the CEC decides
whether to log it and then the rest of the chain doesn't hear about it -
basically the main reason for the CE collector - or to continue running
the notifiers.
When the CEC hits the action threshold, it will try to soft-offine the
page containing the ECC and then the whole decoding chain gets to see
the error.
To disable the Correctable Errors Collector, a kernel parameter is used:
> ras=cec_disable
A MR was created with this proposal at:
https://salsa.debian.org/kernel-team/linux/-/merge_requests/827
Thanks,
Miguel Bernal Marin
Reply to: