My laptop is working just fine. It’s from 2018 and it has an NVME drive.

It has an EFI boot partition and other partition with LUKS and LVM on top of that.

Since this week I see these logs from time to time:

Mar 07 17:31:14 almendra kernel: pcieport 0000:00:1d.6: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 07 17:31:14 almendra kernel: pcieport 0000:00:1d.6:   device [8086:34b6] error status/mask=00000001/00002000
Mar 07 17:31:14 almendra kernel: pcieport 0000:00:1d.6:    [ 0] RxErr                  (First)
Mar 07 17:31:14 almendra kernel: pcieport 0000:00:1d.6: AER:   Error of this Agent is reported first
Mar 07 17:31:14 almendra kernel: nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 07 17:31:14 almendra kernel: nvme 0000:02:00.0:   device [8086:0975] error status/mask=00000001/00002000
Mar 07 17:31:14 almendra kernel: nvme 0000:02:00.0:    [ 0] RxErr                  (First)

The devices are:

$ lspci -vv | grep 1d.6
00:1d.6 PCI bridge: Intel Corporation Device 34b6 (rev 30) (prog-if 00 [Normal decode])

$ lspci -vv | grep 02:00.0
02:00.0 Non-Volatile memory controller: Intel Corporation Optane NVME SSD H10 with Solid State Storage [Teton Glacier] (prog-if 02 [NVM Express])

The laptop works like always, but I have the impression that the NVME drive is telling me something bad.

It happens from time to time:

$ journalctl --since yesterday | grep -c "nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical"
9

Do you know what does it mean?

  • Limonene@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    7 months ago

    The good news is: the error shown there was a PCIe bus error, which means the error is somewhere between the NVME controller and your processor’s PCIe interface. Also good news: the errors you experienced were fully corrected, so you probably lost no data.

    So the flash memory in the drive isn’t failing. That’s good because if the flash memory starts failing, it’s probably only going to fail more. In this case, your errors may be correctable: by replacing the motherboard, by replacing the processor, by reseating the NVME drive in its slot, by verifying that your power supply is reliable…

    However, if your NVME controller actually does fail, it will be little consolation to tell you that your data is all still there on the flash chips, but with no way to get it. So now might be a good time to make a backup. Any time is a good time to make a backup, but now is an especially good time.

    If you keep getting these errors at the same rate, then you probably don’t need to do anything, since the errors are being corrected. If you’re worried, you could use BTRFS and enable checksumming of data.

    • vsis@feddit.clOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 months ago

      […] by replacing the motherboard, by replacing the processor, by reseating the NVME drive in its slot, by verifying that your power supply is reliable…

      I will start with the cheapest option 😅

      I assume the power supply is reliable. Having a battery should make it more stable I guess.