Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2
-
x86_64
CentOS 7.6.1810
Lustre 2.12.2
ZFS 0.7.13
HPE DL380 Gen 10 connected to D8000 via multipath
-
2
-
9223372036854775807
Description
Summary
Lustre 2.12 added the ZEDLET /etc/zfs/zed.d/statechange-lustre.sh which is run when ZFS generates the state change event (resource.fs.zfs.statechange). The ZEDLET runs lctl set_param to change the degraded property (obdfilter.${service}.degraded) for the appropriate target to 0, if the pool is ONLINE, and to 1, if the pool is DEGRADED. When a pool becomes DEGRADED because a drive is FAULTED or OFFLINE, the property is correctly set to 1. When the pool comes back ONLINE, the property is not always reset to 0--it depends on how the drive was brought back ONLINE. When the target is incorrectly marked as degraded, the Lustre filesystem has reduced performance, since the target isn't used unless stripes are assigned to it.
Background
Given a pool which looks like this:
pool: ost04 state: ONLINE config: NAME STATE READ WRITE CKSUM ost04 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay041-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay042-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay043-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay044-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay045-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay046-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay047-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay048-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay049-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay101-0 ONLINE 0 0 0
When a drive is taken offline, ZFS offlines the drive, marks the pool DEGRADED, and generates a state change event. For example,
zpool offline ost04 d8000_sep500C0FF03C1AC73E_bay101-0
generates an event which looks like this:
Oct 8 2019 09:22:43.109749814 resource.fs.zfs.statechange version = 0x0 class = "resource.fs.zfs.statechange" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 vdev_guid = 0x4b6ac5c4c8d5cb1a vdev_state = "OFFLINE" (0x2) vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0" vdev_devid = "dm-uuid-mpath-35000c500a63e36f7" vdev_laststate = "ONLINE" (0x7) time = 0x5d9c9bb3 0x68aa636 eid = 0x7a
ZED will run statechange-lustre.sh, which will set the degraded property of the target:
# lctl get_param obdfilter.lustrefs-OST0004.degraded obdfilter.lustrefs-OST0004.degraded=1
When the offline drive is brought online, ZFS generates a state change event, onlines the drive, resilvers the drive, and marks the pool ONLINE. For example,
zpool online ost04 d8000_sep500C0FF03C1AC73E_bay101-0
generates a series of events like this:
Oct 8 2019 09:29:58.726502922 resource.fs.zfs.statechange version = 0x0 class = "resource.fs.zfs.statechange" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 vdev_guid = 0x4b6ac5c4c8d5cb1a vdev_state = "ONLINE" (0x7) vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0" vdev_devid = "dm-uuid-mpath-35000c500a63e36f7" vdev_laststate = "OFFLINE" (0x2) time = 0x5d9c9d66 0x2b4d8e0a eid = 0x7c Oct 8 2019 09:29:59.261511291 sysevent.fs.zfs.vdev_online version = 0x0 class = "sysevent.fs.zfs.vdev_online" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 vdev_guid = 0x4b6ac5c4c8d5cb1a vdev_state = "ONLINE" (0x7) vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0" vdev_devid = "dm-uuid-mpath-35000c500a63e36f7" time = 0x5d9c9d67 0xf96587b eid = 0x7d Oct 8 2019 09:29:59.341512542 sysevent.fs.zfs.resilver_start version = 0x0 class = "sysevent.fs.zfs.resilver_start" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 time = 0x5d9c9d67 0x145b115e eid = 0x7e Oct 8 2019 09:29:59.341512542 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 history_hostname = "cv-g10-oss0.dev.net" history_internal_str = "func=2 mintxg=164314 maxtxg=164319" history_internal_name = "scan setup" history_txg = 0x28236 history_time = 0x5d9c9d67 time = 0x5d9c9d67 0x145b115e eid = 0x7f Oct 8 2019 09:29:59.372513027 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 history_hostname = "cv-g10-oss0.dev.net" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x28237 history_time = 0x5d9c9d67 time = 0x5d9c9d67 0x16341903 eid = 0x80 Oct 8 2019 09:29:59.372513027 sysevent.fs.zfs.resilver_finish version = 0x0 class = "sysevent.fs.zfs.resilver_finish" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 time = 0x5d9c9d67 0x16341903 eid = 0x81
ZED will run statechange-lustre.sh, which will reset the degraded property of the target:
# lctl get_param obdfilter.lustrefs-OST0004.degraded obdfilter.lustrefs-OST0004.degraded=0
When a drive fails in a pool, ZFS faults the drive, marks the pool as DEGRADED, and generates a state change event which looks like this:
Oct 4 2019 09:17:51.637116237 resource.fs.zfs.statechange version = 0x0 class = "resource.fs.zfs.statechange" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 vdev_guid = 0x12d754fcdac78110 vdev_state = "FAULTED" (0x5) vdev_path = "/dev/disk/by-id/dm-uuid-mpath-35000c500a647974b" vdev_devid = "dm-uuid-mpath-35000c500a647974b" vdev_laststate = "ONLINE" (0x7) time = 0x5d97548f 0x25f99f4d eid = 0x39
ZED will run statechange-lustre.sh, which will set the degraded property of the target:
# lctl get_param obdfilter.lustrefs-OST0004.degraded obdfilter.lustrefs-OST0004.degraded=1
Problem description
The problem arises when a failed or offline drive is replaced. ZFS will start attaching the replacement drive, resilver it, remove the failed or offline drive, finish attaching the replacement drive, and detach the failed or offline drive. For example:
zpool replace ost04 d8000_sep500C0FF03C1AC73E_bay101-0 d8000_sep500C0FF03C1AC73E_bay050-0
will generates a series of events like this (with sysevent.fs.zfs.config_sync events removed):
Oct 8 2019 10:50:47.193227758 sysevent.fs.zfs.vdev_attach version = 0x0 class = "sysevent.fs.zfs.vdev_attach" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 vdev_guid = 0xb9f869eb9bd523ec vdev_state = "ONLINE" (0x7) vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay050-0" vdev_devid = "dm-uuid-mpath-35000c500a647974b" time = 0x5d9cb057 0xb846bee eid = 0x9b Oct 8 2019 10:50:47.274229010 sysevent.fs.zfs.resilver_start version = 0x0 class = "sysevent.fs.zfs.resilver_start" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 time = 0x5d9cb057 0x10586712 eid = 0x9c Oct 8 2019 10:50:47.274229010 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 history_hostname = "cv-g10-oss0.dev.net" history_internal_str = "func=2 mintxg=3 maxtxg=165396" history_internal_name = "scan setup" history_txg = 0x28614 history_time = 0x5d9cb057 time = 0x5d9cb057 0x10586712 eid = 0x9d Oct 8 2019 10:50:47.520232812 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 history_hostname = "cv-g10-oss0.dev.net" history_internal_str = "errors=0" history_internal_name = "scan done" history_txg = 0x28615 history_time = 0x5d9cb057 time = 0x5d9cb057 0x1f021f6c eid = 0x9e Oct 8 2019 10:50:47.520232812 sysevent.fs.zfs.resilver_finish version = 0x0 class = "sysevent.fs.zfs.resilver_finish" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 time = 0x5d9cb057 0x1f021f6c eid = 0x9f Oct 8 2019 10:50:47.582233770 sysevent.fs.zfs.vdev_remove version = 0x0 class = "sysevent.fs.zfs.vdev_remove" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 vdev_guid = 0xdb315a1b87016f8c vdev_state = "OFFLINE" (0x2) vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0" vdev_devid = "dm-uuid-mpath-35000c500a63e36f7" time = 0x5d9cb057 0x22b42eaa eid = 0xa1 Oct 8 2019 10:50:47.587233848 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 history_hostname = "cv-g10-oss0.dev.net" history_internal_str = "replace vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay050-0 for vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0" history_internal_name = "vdev attach" history_txg = 0x28616 history_time = 0x5d9cb057 time = 0x5d9cb057 0x23007a38 eid = 0xa2 Oct 8 2019 10:50:52.581311041 sysevent.fs.zfs.history_event version = 0x0 class = "sysevent.fs.zfs.history_event" pool = "ost04" pool_guid = 0x8159dca79b3945a4 pool_state = 0x0 pool_context = 0x0 history_hostname = "cv-g10-oss0.dev.net" history_internal_str = "vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0" history_internal_name = "detach" history_txg = 0x28617 history_time = 0x5d9cb05c time = 0x5d9cb05c 0x22a61a41 eid = 0xa4
Since ZFS did not generate a state change event, ZED does not run statechange-lustre.sh, and the degraded property of the target is not reset:
# lctl get_param obdfilter.lustrefs-OST0004.degraded obdfilter.lustrefs-OST0004.degraded=1
despite the pool being ONLINE:
# zpool status ost04 pool: ost04 state: ONLINE scan: resilvered 2.62M in 0h0m with 0 errors on Tue Oct 8 10:50:47 2019 config: NAME STATE READ WRITE CKSUM ost04 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay041-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay042-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay043-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay044-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay045-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay046-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay047-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay048-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay049-0 ONLINE 0 0 0 d8000_sep500C0FF03C1AC73E_bay050-0 ONLINE 0 0 0 errors: No known data errors
Steps to reproduce
- Create a Lustre filesystem using ZFS
- Select a pool to test
- Select a drive to fail from the pool
- Select an unused drive to use as the replacement drive
- Verify the target (MGT, MDT, or OST) corresponding to the pool does not have the degraded property set
- Wipe the replacement drive so it looks like an unused drive
- Fail the selected drive
- Wait for the drive to report faulted
- Replace the failed drive
- Wait for resilvering to finish
- Check the degraded property for the target
I have attached the script test-degraded-drive which implements the above steps. While it does make a few assumptions (datasets are named the same as the pool they are in [e.g., {{ost04/ost04}}] and drives are given by their device mapper names), it should be easy to repurpose it for use on another system.