Data corruption

Oh, yuck. ZFS-8000-8A. :(

-------------------
Mon May 29 16:23:47 [bash:5.2.15 jobs:0 error:0 time:35]
root@charm:/home/jj5
# zpool status -v
  pool: fast
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME                                              STATE     READ WRITE CKSUM
        fast                                              ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NJ0W215171W  ONLINE       0     0     2
            nvme-Samsung_SSD_990_PRO_2TB_S6Z2NJ0W215164J  ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

        /fast/vbox/218-jj-wrk-8-charm-prod-vbox/218-jj-wrk-8-charm-prod-vbox.vdi
-------------------

Replacing ZFS disk

Over on How to replace a failed disk in a ZFS mirror I found this command:

# sudo zpool replace -f storage 18311740819329882151 /dev/disk/by-id/ata-WDC_WD80EFAX-68LHPN0_7HJSWL7F

I’m gonna need that later on.
The number is the old device and the path is the new device.

Update. This is done.

-------------------
Tue Feb 21 21:31:53 [bash:5.1.16 jobs:0 error:0 time:3]
root@order:/home/jj5
# zpool replace -f data 12987390290044433012 /dev/disk/by-id/ata-WDC_WD30EFZX-68AWUN0_WD-WX72D1273L06
-------------------
Tue Feb 21 21:55:32 [bash:5.1.16 jobs:0 error:0 time:1422]
root@order:/home/jj5
# zpool status
  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Feb 21 21:34:26 2023
        749G scanned at 605M/s, 88.2G issued at 71.1M/s, 5.71T total
        27.9G resilvered, 1.51% done, 23:01:31 to go
config:

        NAME                                             STATE     READ WRITE CKSUM
        data                                             DEGRADED     0     0     0
          raidz1-0                                       DEGRADED     0     0     0
            ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0D8E3C9     ONLINE       0     0     0
            replacing-1                                  DEGRADED     0     0     0
              12987390290044433012                       UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N2473766-part1
              ata-WDC_WD30EFZX-68AWUN0_WD-WX72D1273L06   ONLINE       0     0     0  (resilvering)
            ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0D5506W     ONLINE       0     0     0
        cache
          ata-WDC_WDS120G1G0A-00SS50_171905A00F49-part4  ONLINE       0     0     0

errors: No known data errors
-------------------

Living life on the edge

Whoosh! There goes my redundancy! Fingers crossed I don’t lose another disk before the replacements arrive!

-------------------
Tue Nov 29 17:01:01 [bash:5.1.16 jobs:0 error:0 time:6361]
root@love:/home/jj5
# zpool status
  pool: data
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0B in 1 days 10:20:21 with 0 errors on Mon Nov 14 10:44:22 2022
config:

        NAME                     STATE     READ WRITE CKSUM
        data                     DEGRADED     0     0     0
          mirror-0               DEGRADED     0     0     0
            sda                  ONLINE       0     0     0
            9460704850353196665  OFFLINE      0     0     0  was /dev/sdb1
          mirror-1               DEGRADED     0     0     0
            2467357469475118468  OFFLINE      0     0     0  was /dev/sdc1
            sdd                  ONLINE       0     0     0
        cache
          nvme0n1p4              ONLINE       0     0     0

errors: No known data errors

  pool: fast
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 2.70G in 00:01:35 with 0 errors on Tue Nov 29 13:28:02 2022
config:

        NAME         STATE     READ WRITE CKSUM
        fast         ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            sde      ONLINE       0     0     0
            sdf      ONLINE       0     0     1
        cache
          nvme0n1p3  ONLINE       0     0     0

errors: No known data errors

  pool: temp
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:21:59 with 0 errors on Sun Nov 13 00:46:02 2022
config:

        NAME         STATE     READ WRITE CKSUM
        temp         ONLINE       0     0     0
          nvme0n1p5  ONLINE       0     0     0

errors: No known data errors
-------------------

Resolving ZFS issue on ‘trick’

I’m gonna follow these instructions to replace a disk in one of my ZFS arrays on my workstation ‘trick’. Have ordered myself a new 6TB Seagate Barracuda Hard Drive for $179. I hope nobody minds if I make a few notes for myself here…

-------------------
Mon Sep 12 19:24:53 [bash:5.0.17 jobs:0 error:0 time:179]
root@trick:/home/jj5
# zpool status data
  pool: data
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub canceled on Mon Sep 12 19:24:53 2022
config:

        NAME                                     STATE     READ WRITE CKSUM
        data                                     DEGRADED     0     0     0
          mirror-0                               DEGRADED     0     0     0
            scsi-SATA_ST6000VN0041-2EL_ZA16N49H  DEGRADED 11.1K     0 54.6K  too many errors
            scsi-SATA_ST6000VN0041-2EL_ZA16N4ZH  ONLINE       0     0     0

errors: No known data errors
-------------------
Mon Sep 12 19:40:13 [bash:5.0.17 jobs:0 error:0 time:1099]
root@trick:/home/jj5
# zdb
data:
    version: 5000
    name: 'data'
    state: 0
    txg: 2685198
    pool_guid: 1339265133722772877
    errata: 0
    hostid: 727553668
    hostname: 'trick'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 1339265133722772877
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 802431090802465148
            metaslab_array: 256
            metaslab_shift: 34
            ashift: 12
            asize: 6001160355840
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 9301639020686187487
                path: '/dev/disk/by-id/scsi-SATA_ST6000VN0041-2EL_ZA16N49H-part1'
                devid: 'ata-ST6000VN0041-2EL11C_ZA16N49H-part1'
                phys_path: 'pci-0000:00:17.0-ata-3'
                whole_disk: 1
                DTL: 28906
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
                degraded: 1
                aux_state: 'err_exceeded'
            children[1]:
                type: 'disk'
                id: 1
                guid: 4734211194602915183
                path: '/dev/disk/by-id/scsi-SATA_ST6000VN0041-2EL_ZA16N4ZH-part1'
                devid: 'ata-ST6000VN0041-2EL11C_ZA16N4ZH-part1'
                phys_path: 'pci-0000:00:17.0-ata-4'
                whole_disk: 1
                DTL: 28905
                create_txg: 4
                com.delphix:vdev_zap_leaf: 131
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

I think the commands I’m gonna need are:

# zpool offline data 9301639020686187487
# zpool status data
# shutdown # and replace disk
# zpool replace data 9301639020686187487 /dev/disk/by-id/scsi-SATA_ST6000DM003-2CY1_WSB076SN
# zpool status data