Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.12.8, Lustre 2.15.0
Affects Version/s: Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2
Labels:
- HA
- LTS12
- zfs
Environment:
x86_64
CentOS 7.6.1810
Lustre 2.12.2
ZFS 0.7.13
HPE DL380 Gen 10 connected to D8000 via multipath

Severity:
2
Rank (Obsolete):
9223372036854775807

Summary

Lustre 2.12 added the ZEDLET /etc/zfs/zed.d/statechange-lustre.sh which is run when ZFS generates the state change event (resource.fs.zfs.statechange). The ZEDLET runs lctl set_param to change the degraded property (obdfilter.${service}.degraded) for the appropriate target to 0, if the pool is ONLINE, and to 1, if the pool is DEGRADED. When a pool becomes DEGRADED because a drive is FAULTED or OFFLINE, the property is correctly set to 1. When the pool comes back ONLINE, the property is not always reset to 0--it depends on how the drive was brought back ONLINE. When the target is incorrectly marked as degraded, the Lustre filesystem has reduced performance, since the target isn't used unless stripes are assigned to it.

Background

Given a pool which looks like this:

  pool: ost04
 state: ONLINE
config:        NAME                                    STATE     READ WRITE CKSUM
        ost04                                   ONLINE       0     0     0
          raidz2-0                              ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay041-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay042-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay043-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay044-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay045-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay046-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay047-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay048-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay049-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay101-0  ONLINE       0     0     0

When a drive is taken offline, ZFS offlines the drive, marks the pool DEGRADED, and generates a state change event. For example,

zpool offline ost04 d8000_sep500C0FF03C1AC73E_bay101-0

generates an event which looks like this:

Oct  8 2019 09:22:43.109749814 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x4b6ac5c4c8d5cb1a
        vdev_state = "OFFLINE" (0x2)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
        vdev_laststate = "ONLINE" (0x7)
        time = 0x5d9c9bb3 0x68aa636
        eid = 0x7a

ZED will run statechange-lustre.sh, which will set the degraded property of the target:

# lctl get_param obdfilter.lustrefs-OST0004.degraded
obdfilter.lustrefs-OST0004.degraded=1

When the offline drive is brought online, ZFS generates a state change event, onlines the drive, resilvers the drive, and marks the pool ONLINE. For example,

zpool online ost04 d8000_sep500C0FF03C1AC73E_bay101-0

generates a series of events like this:

Oct  8 2019 09:29:58.726502922 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x4b6ac5c4c8d5cb1a
        vdev_state = "ONLINE" (0x7)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
        vdev_laststate = "OFFLINE" (0x2)
        time = 0x5d9c9d66 0x2b4d8e0a
        eid = 0x7c

Oct  8 2019 09:29:59.261511291 sysevent.fs.zfs.vdev_online
        version = 0x0
        class = "sysevent.fs.zfs.vdev_online"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x4b6ac5c4c8d5cb1a
        vdev_state = "ONLINE" (0x7)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
        time = 0x5d9c9d67 0xf96587b
        eid = 0x7d

Oct  8 2019 09:29:59.341512542 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5d9c9d67 0x145b115e
        eid = 0x7e

Oct  8 2019 09:29:59.341512542 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "func=2 mintxg=164314 maxtxg=164319"
        history_internal_name = "scan setup"
        history_txg = 0x28236
        history_time = 0x5d9c9d67
        time = 0x5d9c9d67 0x145b115e
        eid = 0x7f

Oct  8 2019 09:29:59.372513027 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x28237
        history_time = 0x5d9c9d67
        time = 0x5d9c9d67 0x16341903
        eid = 0x80

Oct  8 2019 09:29:59.372513027 sysevent.fs.zfs.resilver_finish
        version = 0x0
        class = "sysevent.fs.zfs.resilver_finish"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5d9c9d67 0x16341903
        eid = 0x81

ZED will run statechange-lustre.sh, which will reset the degraded property of the target:

# lctl get_param obdfilter.lustrefs-OST0004.degraded
obdfilter.lustrefs-OST0004.degraded=0

When a drive fails in a pool, ZFS faults the drive, marks the pool as DEGRADED, and generates a state change event which looks like this:

Oct  4 2019 09:17:51.637116237 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x12d754fcdac78110
        vdev_state = "FAULTED" (0x5)
        vdev_path = "/dev/disk/by-id/dm-uuid-mpath-35000c500a647974b"
        vdev_devid = "dm-uuid-mpath-35000c500a647974b"
        vdev_laststate = "ONLINE" (0x7)
        time = 0x5d97548f 0x25f99f4d
        eid = 0x39

ZED will run statechange-lustre.sh, which will set the degraded property of the target:

# lctl get_param obdfilter.lustrefs-OST0004.degraded
obdfilter.lustrefs-OST0004.degraded=1

Problem description

The problem arises when a failed or offline drive is replaced. ZFS will start attaching the replacement drive, resilver it, remove the failed or offline drive, finish attaching the replacement drive, and detach the failed or offline drive. For example:

zpool replace ost04 d8000_sep500C0FF03C1AC73E_bay101-0 d8000_sep500C0FF03C1AC73E_bay050-0

will generates a series of events like this (with sysevent.fs.zfs.config_sync events removed):

Oct  8 2019 10:50:47.193227758 sysevent.fs.zfs.vdev_attach
        version = 0x0
        class = "sysevent.fs.zfs.vdev_attach"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xb9f869eb9bd523ec
        vdev_state = "ONLINE" (0x7)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay050-0"
        vdev_devid = "dm-uuid-mpath-35000c500a647974b"
        time = 0x5d9cb057 0xb846bee
        eid = 0x9b

Oct  8 2019 10:50:47.274229010 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5d9cb057 0x10586712
        eid = 0x9c

Oct  8 2019 10:50:47.274229010 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "func=2 mintxg=3 maxtxg=165396"
        history_internal_name = "scan setup"
        history_txg = 0x28614
        history_time = 0x5d9cb057
        time = 0x5d9cb057 0x10586712
        eid = 0x9d

Oct  8 2019 10:50:47.520232812 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x28615
        history_time = 0x5d9cb057
        time = 0x5d9cb057 0x1f021f6c
        eid = 0x9e

Oct  8 2019 10:50:47.520232812 sysevent.fs.zfs.resilver_finish
        version = 0x0
        class = "sysevent.fs.zfs.resilver_finish"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5d9cb057 0x1f021f6c
        eid = 0x9f

Oct  8 2019 10:50:47.582233770 sysevent.fs.zfs.vdev_remove
        version = 0x0
        class = "sysevent.fs.zfs.vdev_remove"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xdb315a1b87016f8c
        vdev_state = "OFFLINE" (0x2)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
        time = 0x5d9cb057 0x22b42eaa
        eid = 0xa1

Oct  8 2019 10:50:47.587233848 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "replace vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay050-0 for vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        history_internal_name = "vdev attach"
        history_txg = 0x28616
        history_time = 0x5d9cb057
        time = 0x5d9cb057 0x23007a38
        eid = 0xa2

Oct  8 2019 10:50:52.581311041 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        history_internal_name = "detach"
        history_txg = 0x28617
        history_time = 0x5d9cb05c
        time = 0x5d9cb05c 0x22a61a41
        eid = 0xa4

Since ZFS did not generate a state change event, ZED does not run statechange-lustre.sh, and the degraded property of the target is not reset:

# lctl get_param obdfilter.lustrefs-OST0004.degraded
obdfilter.lustrefs-OST0004.degraded=1

despite the pool being ONLINE:

# zpool status ost04
   pool: ost04
 state: ONLINE
  scan: resilvered 2.62M in 0h0m with 0 errors on Tue Oct  8 10:50:47 2019
config:        NAME                                    STATE     READ WRITE CKSUM
        ost04                                   ONLINE       0     0     0
          raidz2-0                              ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay041-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay042-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay043-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay044-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay045-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay046-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay047-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay048-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay049-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay050-0  ONLINE       0     0     0

errors: No known data errors

Steps to reproduce

Create a Lustre filesystem using ZFS
Select a pool to test
Select a drive to fail from the pool
Select an unused drive to use as the replacement drive
Verify the target (MGT, MDT, or OST) corresponding to the pool does not have the degraded property set
Wipe the replacement drive so it looks like an unused drive
Fail the selected drive
Wait for the drive to report faulted
Replace the failed drive
Wait for resilvering to finish
Check the degraded property for the target

I have attached the script test-degraded-drive which implements the above steps. While it does make a few assumptions (datasets are named the same as the pool they are in [e.g., {{ost04/ost04}}] and drives are given by their device mapper names), it should be easy to repurpose it for use on another system.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

test-degraded-drive
2 kB
08/Oct/19 3:40 PM

Details

Description

Summary

Background

Problem description

Steps to reproduce

Attachments

Attachments

Activity

People

Dates