[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1036644:



I am also affected by this, running Arch Linux on my Intel Nuc 8i3beh. I've seen these same random mce broadcast error kernel panics (only capturable via netconsole) ever since upgrading from the 5.15.x lts kernel series to the 6.1.x series - latest I've tried is 6.1.45 and currently back to the 5.15.x branch for stability. 

I update my Arch Linux installation on a rolling weekly basis so am right upto date for all packages including intel-microcode. As others have experienced, the problem seems more prominent (though not exclusively) when the machine is Idle.

>>Maybe lowering "check_interval" or "monarch_timeout" in machinecheck will cause the bug to strike more often, so a git bisect could be possible!? Or raising those values may workaround the problem!?

I had similar thoughts and stumbled upon

/sys/kernel/debug/mce/fake_panic

Writing 1 to here will cause a fake panic such that the mce event will be logged to dmesg but panic+reboot will not occur. 

Interestingly we then get a couple more messages that possibly suggest that the core lockup is somehow related to i915 as others suspect

[77775.848032] mce: CPUs not responding to MCE broadcast (may include false positives): 1,3
[77775.848032] mce: CPUs not responding to MCE broadcast (may include false positives): 1,3
[77775.848035] mce: [Hardware Error]: Fake kernel panic: Timeout: Not all CPUs entered broadcast exception handler
[77775.848039] Disabling lock debugging due to kernel taint
[77775.885355] mce: [Hardware Error]: Machine check events logged
[77775.888283] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 4: ba00000011000402
[77775.892145] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffc071678d> {fwtable_read32+0x7d/0x220 [i915]}
[77775.897167] mce: [Hardware Error]: TSC d44e32bae41d 

Might be interesting to see if the 
RIP !INEXACT! 10:<ffffffffc071678d> {fwtable_read32+0x7d/0x220 [i915]}
 message occurs for others with fake_panic enabled. 

Unfortunately, fake_panic does not appear to be a workaround from my experience; since the cores reported in the mce event become locked up thereafter; such that any task scheduled onto those cores becomes locked-up - for example I ran the sensors command which hung and eventually..... 

77798.629123] watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [sensors:1229265]
[77798.631037] Modules linked in: coretemp drivetemp netconsole xt_conntrack ipt_REJECT nf_reject_ipv4 xt_connmark xt_mark iptable_mangle xt_comment xt_addrtype iptable_raw wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel rfcomm uinput xt_nat xt_tcpudp iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter veth ts2020 snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils soundwire_bus snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core intel_rapl_msr intel_rapl_common snd_soc_sst_ipc intel_tcc_cooling snd_soc_sst_dsp x86_pkg_temp_thermal intel_powerclamp snd_soc_acpi_intel_match snd_soc_acpi kvm_intel snd_soc_core snd_hda_codec_hdmi snd_compress kvm si2157 ac97_bus snd_hda_codec_realtek snd_pcm_dmaengine snd_hda_codec_generic ledtrig_audio
[77798.631090]  irqbypass si2168 crct10dif_pclmul snd_hda_intel crc32_pclmul polyval_clmulni snd_intel_dspcfg polyval_generic gf128mul ghash_clmulni_intel snd_intel_sdw_acpi snd_hda_codec sha512_ssse3 dvb_usb_dvbsky dvb_usb_v2 btusb m88ds3103 snd_hda_core dvb_core btrtl btbcm iTCO_wdt videobuf2_vmalloc snd_hwdep videobuf2_memops videobuf2_common aesni_intel btintel crypto_simd snd_pcm intel_pmc_bxt btmtk cryptd snd_timer rapl intel_cstate mei_pxp mei_hdcp iTCO_vendor_support ee1004 snd videodev intel_uncore bluetooth mei_me e1000e intel_wmi_thunderbolt i2c_i801 wmi_bmof soundcore pcspkr i2c_smbus mei mc i2c_mux ecdh_generic intel_pch_thermal ir_rc6_decoder rc_rc6_mce ite_cir acpi_pad acpi_tad mac_hid cfg80211 rfkill crypto_user loop fuse dm_mod bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 mmc_block i915 drm_buddy intel_gtt nvme rtsx_pci_sdmmc drm_display_helper mmc_core nvme_core crc32c_intel cec xhci_pci rtsx_pci nvme_common video xhci_pci_renesas ttm wmi
[77798.641974]  [last unloaded: i2c_dev]
[77798.656901] CPU: 2 PID: 1229265 Comm: sensors Tainted: G   M               6.1.39-1-lts-custom-015e51c #1 0c9d39d05dfd27e4ed0b0da78692e6ddc0d0b631
[77798.659471] Hardware name: Intel(R) Client Systems NUC8i3BEH/NUC8BEB, BIOS BECFL357.86A.0089.2021.0621.1343 06/21/2021
[77798.662012] RIP: 0010:smp_call_function_single+0xfe/0x140
[77798.664509] Code: 25 28 00 00 00 75 51 c9 c3 cc cc cc cc 48 89 e6 48 89 54 24 18 4c 89 44 24 10 e8 4d fe ff ff 8b 54 24 08 83 e2 01 74 0b f3 90 <8b> 54 24 08 83 e2 01 75 f5 eb b9 8b 05 89 b4 5d 02 85 c0 0f 85 65
[77798.667074] RSP: 0018:ffffad160582fcc0 EFLAGS: 00000202
[77798.669635] RAX: 0000000000000000 RBX: ffffad160582fd6c RCX: ffff8be27b8dc238
[77798.672205] RDX: 0000000000000001 RSI: ffffad160582fcc0 RDI: ffffad160582fcc0
[77798.674773] RBP: ffffad160582fd18 R08: ffffffff855f4fb0 R09: ffff8be3366090c0
[77798.677349] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8be2457889b0
[77798.679928] R13: ffffad160582fe30 R14: 0000000000000001 R15: ffffad160582fec8
[77798.682464] FS:  00007f19d4d3e740(0000) GS:ffff8be5add00000(0000) knlGS:0000000000000000
[77798.684997] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[77798.687489] CR2: 00007f7ef27cfc80 CR3: 000000032f4c2001 CR4: 00000000003706e0
[77798.689968] Call Trace:
[77798.692428]  <IRQ>
[77798.694886]  ? watchdog_timer_fn+0x1a8/0x200
[77798.697382]  ? lockup_detector_update_enable+0x50/0x50
[77798.699858]  ? __hrtimer_run_queues+0x10f/0x2b0
[77798.702340]  ? hrtimer_interrupt+0xf8/0x210
[77798.704812]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
[77798.707300]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[77798.709803]  </IRQ>
[77798.712312]  <TASK>
[77798.714787]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[77798.717276]  ? pldmfw_flash_image+0xce0/0xce0
[77798.719755]  ? smp_call_function_single+0xfe/0x140
[77798.722241]  ? pldmfw_flash_image+0xce0/0xce0
[77798.724738]  rdmsr_on_cpu+0x5f/0x90
[77798.727216]  show_temp+0xc1/0xf0 [coretemp 2a9b54610668d110c724a01af9913aabfe08a40c]
[77798.729759]  dev_attr_show+0x19/0x40
[77798.732245]  sysfs_kf_seq_show+0xa8/0xf0
[77798.734672]  seq_read_iter+0x120/0x460
[77798.737042]  vfs_read+0x23d/0x310
[77798.739356]  ksys_read+0x6f/0xf0
[77798.741630]  do_syscall_64+0x5d/0x90
[77798.743906]  ? do_syscall_64+0x6c/0x90
[77798.746182]  ? do_syscall_64+0x6c/0x90
[77798.748428]  ? syscall_exit_to_user_mode+0x1b/0x40
[77798.750681]  ? do_syscall_64+0x6c/0x90
[77798.752923]  ? do_syscall_64+0x6c/0x90
[77798.755110]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[77798.757292] RIP: 0033:0x7f19d4f27b21
[77798.759440] Code: c5 fe ff ff 50 48 8d 3d 45 7d 0a 00 e8 e8 11 02 00 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d dd 99 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 48 83 ec 28 48 89 54
[77798.761685] RSP: 002b:00007ffc3a71ea08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[77798.763902] RAX: ffffffffffffffda RBX: 000055c53cb34340 RCX: 00007f19d4f27b21
[77798.766135] RDX: 0000000000001000 RSI: 000055c53cb4af30 RDI: 0000000000000003
[77798.768378] RBP: 00007f19d50065a0 R08: 0000000000000000 R09: 0000000000000001
[77798.770607] R10: 0000000000000003 R11: 0000000000000246 R12: 000055c53cb34340
[77798.772852] R13: 0000000000000a68 R14: 00007f19d5005ca0 R15: 0000000000000a68
[77798.775079]  </TASK>
[77798.777610] systemd-journald[236]: Compressed data object 996 -> 502 using ZSTD
[77798.780534] systemd-journald[236]: Compressed data object 988 -> 559 using ZSTD

So the machine then requires a reboot anyhow to return to normal operation.

Regards
James

Reply to: