[LU-12836] target's degraded property not reset after drive replaced Created: 08/Oct/19  Updated: 09/Nov/21  Resolved: 02/Jun/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2
Fix Version/s: Lustre 2.12.8, Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Christopher Voltz (Inactive) Assignee: Tony Hutter
Resolution: Fixed Votes: 0
Labels: HA, LTS12, zfs
Environment:

x86_64
CentOS 7.6.1810
Lustre 2.12.2
ZFS 0.7.13
HPE DL380 Gen 10 connected to D8000 via multipath


Attachments: HTML File test-degraded-drive    
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Summary

Lustre 2.12 added the ZEDLET /etc/zfs/zed.d/statechange-lustre.sh which is run when ZFS generates the state change event (resource.fs.zfs.statechange). The ZEDLET runs lctl set_param to change the degraded property (obdfilter.${service}.degraded) for the appropriate target to 0, if the pool is ONLINE, and to 1, if the pool is DEGRADED. When a pool becomes DEGRADED because a drive is FAULTED or OFFLINE, the property is correctly set to 1. When the pool comes back ONLINE, the property is not always reset to 0--it depends on how the drive was brought back ONLINE. When the target is incorrectly marked as degraded, the Lustre filesystem has reduced performance, since the target isn't used unless stripes are assigned to it.

Background

Given a pool which looks like this:

  pool: ost04
 state: ONLINE
config:        NAME                                    STATE     READ WRITE CKSUM
        ost04                                   ONLINE       0     0     0
          raidz2-0                              ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay041-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay042-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay043-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay044-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay045-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay046-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay047-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay048-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay049-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay101-0  ONLINE       0     0     0 

 

When a drive is taken offline, ZFS offlines the drive, marks the pool DEGRADED, and generates a state change event. For example,

zpool offline ost04 d8000_sep500C0FF03C1AC73E_bay101-0

generates an event which looks like this:

Oct  8 2019 09:22:43.109749814 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x4b6ac5c4c8d5cb1a
        vdev_state = "OFFLINE" (0x2)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
        vdev_laststate = "ONLINE" (0x7)
        time = 0x5d9c9bb3 0x68aa636
        eid = 0x7a

ZED will run statechange-lustre.sh, which will set the degraded property of the target:

# lctl get_param obdfilter.lustrefs-OST0004.degraded
obdfilter.lustrefs-OST0004.degraded=1

 

When the offline drive is brought online, ZFS generates a state change event, onlines the drive, resilvers the drive, and marks the pool ONLINE. For example,

zpool online ost04 d8000_sep500C0FF03C1AC73E_bay101-0

generates a series of events like this:

Oct  8 2019 09:29:58.726502922 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x4b6ac5c4c8d5cb1a
        vdev_state = "ONLINE" (0x7)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
        vdev_laststate = "OFFLINE" (0x2)
        time = 0x5d9c9d66 0x2b4d8e0a
        eid = 0x7c

Oct  8 2019 09:29:59.261511291 sysevent.fs.zfs.vdev_online
        version = 0x0
        class = "sysevent.fs.zfs.vdev_online"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x4b6ac5c4c8d5cb1a
        vdev_state = "ONLINE" (0x7)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
        time = 0x5d9c9d67 0xf96587b
        eid = 0x7d

Oct  8 2019 09:29:59.341512542 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5d9c9d67 0x145b115e
        eid = 0x7e

Oct  8 2019 09:29:59.341512542 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "func=2 mintxg=164314 maxtxg=164319"
        history_internal_name = "scan setup"
        history_txg = 0x28236
        history_time = 0x5d9c9d67
        time = 0x5d9c9d67 0x145b115e
        eid = 0x7f

Oct  8 2019 09:29:59.372513027 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x28237
        history_time = 0x5d9c9d67
        time = 0x5d9c9d67 0x16341903
        eid = 0x80

Oct  8 2019 09:29:59.372513027 sysevent.fs.zfs.resilver_finish
        version = 0x0
        class = "sysevent.fs.zfs.resilver_finish"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5d9c9d67 0x16341903
        eid = 0x81

ZED will run statechange-lustre.sh, which will reset the degraded property of the target:

# lctl get_param obdfilter.lustrefs-OST0004.degraded
obdfilter.lustrefs-OST0004.degraded=0

 
When a drive fails in a pool, ZFS faults the drive, marks the pool as DEGRADED, and generates a state change event which looks like this:

Oct  4 2019 09:17:51.637116237 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x12d754fcdac78110
        vdev_state = "FAULTED" (0x5)
        vdev_path = "/dev/disk/by-id/dm-uuid-mpath-35000c500a647974b"
        vdev_devid = "dm-uuid-mpath-35000c500a647974b"
        vdev_laststate = "ONLINE" (0x7)
        time = 0x5d97548f 0x25f99f4d
        eid = 0x39

ZED will run statechange-lustre.sh, which will set the degraded property of the target:

# lctl get_param obdfilter.lustrefs-OST0004.degraded
obdfilter.lustrefs-OST0004.degraded=1

Problem description

The problem arises when a failed or offline drive is replaced. ZFS will start attaching the replacement drive, resilver it, remove the failed or offline drive, finish attaching the replacement drive, and detach the failed or offline drive. For example:

zpool replace ost04 d8000_sep500C0FF03C1AC73E_bay101-0 d8000_sep500C0FF03C1AC73E_bay050-0

will generates a series of events like this (with sysevent.fs.zfs.config_sync events removed):

Oct  8 2019 10:50:47.193227758 sysevent.fs.zfs.vdev_attach
        version = 0x0
        class = "sysevent.fs.zfs.vdev_attach"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xb9f869eb9bd523ec
        vdev_state = "ONLINE" (0x7)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay050-0"
        vdev_devid = "dm-uuid-mpath-35000c500a647974b"
        time = 0x5d9cb057 0xb846bee
        eid = 0x9b

Oct  8 2019 10:50:47.274229010 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5d9cb057 0x10586712
        eid = 0x9c

Oct  8 2019 10:50:47.274229010 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "func=2 mintxg=3 maxtxg=165396"
        history_internal_name = "scan setup"
        history_txg = 0x28614
        history_time = 0x5d9cb057
        time = 0x5d9cb057 0x10586712
        eid = 0x9d

Oct  8 2019 10:50:47.520232812 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "errors=0"
        history_internal_name = "scan done"
        history_txg = 0x28615
        history_time = 0x5d9cb057
        time = 0x5d9cb057 0x1f021f6c
        eid = 0x9e

Oct  8 2019 10:50:47.520232812 sysevent.fs.zfs.resilver_finish
        version = 0x0
        class = "sysevent.fs.zfs.resilver_finish"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        time = 0x5d9cb057 0x1f021f6c
        eid = 0x9f

Oct  8 2019 10:50:47.582233770 sysevent.fs.zfs.vdev_remove
        version = 0x0
        class = "sysevent.fs.zfs.vdev_remove"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xdb315a1b87016f8c
        vdev_state = "OFFLINE" (0x2)
        vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
        time = 0x5d9cb057 0x22b42eaa
        eid = 0xa1

Oct  8 2019 10:50:47.587233848 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "replace vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay050-0 for vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        history_internal_name = "vdev attach"
        history_txg = 0x28616
        history_time = 0x5d9cb057
        time = 0x5d9cb057 0x23007a38
        eid = 0xa2

Oct  8 2019 10:50:52.581311041 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "ost04"
        pool_guid = 0x8159dca79b3945a4
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "cv-g10-oss0.dev.net"
        history_internal_str = "vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
        history_internal_name = "detach"
        history_txg = 0x28617
        history_time = 0x5d9cb05c
        time = 0x5d9cb05c 0x22a61a41
        eid = 0xa4

Since ZFS did not generate a state change event, ZED does not run statechange-lustre.sh, and the degraded property of the target is not reset:

# lctl get_param obdfilter.lustrefs-OST0004.degraded
obdfilter.lustrefs-OST0004.degraded=1

despite the pool being ONLINE:

# zpool status ost04
   pool: ost04
 state: ONLINE
  scan: resilvered 2.62M in 0h0m with 0 errors on Tue Oct  8 10:50:47 2019
config:        NAME                                    STATE     READ WRITE CKSUM
        ost04                                   ONLINE       0     0     0
          raidz2-0                              ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay041-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay042-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay043-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay044-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay045-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay046-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay047-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay048-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay049-0  ONLINE       0     0     0
            d8000_sep500C0FF03C1AC73E_bay050-0  ONLINE       0     0     0

errors: No known data errors

Steps to reproduce

  1. Create a Lustre filesystem using ZFS
  2. Select a pool to test
  3. Select a drive to fail from the pool
  4. Select an unused drive to use as the replacement drive
  5. Verify the target (MGT, MDT, or OST) corresponding to the pool does not have the degraded property set
  6. Wipe the replacement drive so it looks like an unused drive
  7. Fail the selected drive
  8. Wait for the drive to report faulted
  9. Replace the failed drive
  10. Wait for resilvering to finish
  11. Check the degraded property for the target

I have attached the script test-degraded-drive which implements the above steps. While it does make a few assumptions (datasets are named the same as the pool they are in [e.g., {{ost04/ost04}}] and drives are given by their device mapper names), it should be easy to repurpose it for use on another system.



 Comments   
Comment by Peter Jones [ 08/Oct/19 ]

Nathaniel

Could you please advise on this one?

Thanks

Peter

Comment by Christopher Voltz (Inactive) [ 09/Oct/19 ]

Verified the behavior is the same with ZFS 0.7.9.

Was there a test plan for this feature? If so, where is it?

Comment by Nathaniel Clark [ 09/Oct/19 ]

This looks like it's a ZFS issue. zfs::vdev_set_state() seems to not be posting the statechange event to ZED.
This block of code: https://github.com/zfsonlinux/zfs/blob/zfs-0.7-release/module/zfs/vdev.c#L3628-L3638
The code is the same on master: https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev.c#L4548-L4562

I think the "filter out state change due to initial vdev_open" is filtering out the DEGRADED -> ONLINE state change also.

Comment by Christopher Voltz (Inactive) [ 09/Oct/19 ]

The state change event is generated if the pool goes from DEGRADED to ONLINE when the drive in the VDEV changes from OFFLINE instead of FAULTED. I don't have any proof that the state change event was ever generated for this scenario so I'm not convinced your theory is correct but I'll go ahead and create a ZFS issue for the problem.

Comment by Christopher Voltz (Inactive) [ 09/Oct/19 ]

Created ZFS issue statechange event not always generated when VDEV state changes #9437.

Comment by Christopher Voltz (Inactive) [ 25/Oct/19 ]

Created ZFS pull request 9513.
 

Comment by Christopher Voltz (Inactive) [ 28/Oct/19 ]

Replaced ZFS pull request 9513 with 9523.

Comment by Gerrit Updater [ 05/May/21 ]

Tony Hutter (hutter2@llnl.gov) uploaded a new patch: https://review.whamcloud.com/43552
Subject: LU-12836 osd-zfs: Catch all ZFS pool change events
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 524baf41eab96348d34ce70ba423392cdffc8c3d

Comment by Gerrit Updater [ 02/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43552/
Subject: LU-12836 osd-zfs: Catch all ZFS pool change events
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e11a47da71a2e2482e4c4cf582d663cd76a2ecab

Comment by Peter Jones [ 02/Jun/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 04/Jun/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43929
Subject: LU-12836 osd-zfs: Catch all ZFS pool change events
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: dd089ff7a49d4e2fb735adde206ab4b802727a2a

Comment by Gerrit Updater [ 07/Jul/21 ]

Gian-Carlo DeFazio (defazio1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/44171
Subject: LU-12836 osd-zfs: Catch all ZFS pool change events
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 11929c1491c51aef21be1606b4b0a5d0035f750d

Comment by Gerrit Updater [ 10/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43929/
Subject: LU-12836 osd-zfs: Catch all ZFS pool change events
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 5b7627beef0c35a0fc47a416c5fbb9bbf2933f3a

Comment by Gerrit Updater [ 09/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44171/
Subject: LU-12836 osd-zfs: Catch all ZFS pool change events
Project: fs/lustre-release
Branch: b2_14
Current Patch Set:
Commit: bb5aa50947039cbe4217468d3a941d9883b63e81

Generated at Sat Feb 10 02:56:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.