MemoryECCError
When a server prints a similar warning to syslog
Jun 8 04:46:54 array01 kernel: sbridge: HANDLING MCE MEMORY ERROR
Jun 8 04:46:54 array01 kernel: CPU 6: Machine Check Exception: 0 Bank 9: 8c00004e000800c1
Jun 8 04:46:54 array01 kernel: TSC 0 ADDR 1fb6d78000 MISC 1229410001001c8c PROCESSOR 0:206d7 TIME 1402195613 SOCKET 1 APIC 20
Jun 8 04:46:55 array01 kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=6 Err=0008:00c1 (ch=1), addr = 0x1fb6d78000 => socket=1, Channel=0(mask=1), rank=5
Jun 8 04:46:55 array01 kernel:
Jun 8 04:57:45 array01 kernel: sbridge: HANDLING MCE MEMORY ERROR
Jun 8 04:57:45 array01 kernel: CPU 6: Machine Check Exception: 0 Bank 5: 8c00004000010091
Jun 8 04:57:45 array01 kernel: TSC 0 ADDR 1db6d78e40 MISC 4402a2a86 PROCESSOR 0:206d7 TIME 1402196265 SOCKET 1 APIC 20
Jun 8 04:57:45 array01 kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory read on FATAL area : cpu=6 Err=0001:0091 (ch=1), addr = 0x1db6d78e40 => socket=1, Channel=0(mask=1), rank=5
some of the memory modules are getting bad. To localize the failing RAM's DIMM, one can issue the following command
grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
The output should match the warning in the system log. In the above scenario, the output looks like this
...
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:234
...
That means that the error is located on the second memory controller (mc1 - counting from 0) and on the second chip-selection row (also counting from 0). The current number of ECC errors is 234 (from the last system start).
The mapping of this abstract structure to the physical DIMM disposition depends on the settings in BIOS and on the motherboard layout. In the example above the server has all its RAM set up without mirroring or interleaving and with 64b addressing The motherboard is SuperMicro X9DRW. This means that this very server's csrows correspond to DIMMS on the motherboard (counting from 0, mc0/csrow0 is DIMMA1, mc0/csrow1 is DIMMA2, mc1/csrow0 is DIMME1 etc.) The presented mapping assumes that the memory controller is located in the CPU itself (so mc0 matches with CPU0, mc1 with CPU1) and that DIMMS A - D belong to the CPU0 and DIMMS E- H belong to the CPU1.
So the misbehaving RAM is placed in DIMM E2 (mc1/csrow1).