Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12836

target's degraded property not reset after drive replaced

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.8, Lustre 2.15.0
    • Lustre 2.12.0, Lustre 2.12.1, Lustre 2.12.2
    • x86_64
      CentOS 7.6.1810
      Lustre 2.12.2
      ZFS 0.7.13
      HPE DL380 Gen 10 connected to D8000 via multipath
    • 2
    • 9223372036854775807

    Description

      Summary

      Lustre 2.12 added the ZEDLET /etc/zfs/zed.d/statechange-lustre.sh which is run when ZFS generates the state change event (resource.fs.zfs.statechange). The ZEDLET runs lctl set_param to change the degraded property (obdfilter.${service}.degraded) for the appropriate target to 0, if the pool is ONLINE, and to 1, if the pool is DEGRADED. When a pool becomes DEGRADED because a drive is FAULTED or OFFLINE, the property is correctly set to 1. When the pool comes back ONLINE, the property is not always reset to 0--it depends on how the drive was brought back ONLINE. When the target is incorrectly marked as degraded, the Lustre filesystem has reduced performance, since the target isn't used unless stripes are assigned to it.

      Background

      Given a pool which looks like this:

        pool: ost04
       state: ONLINE
      config:        NAME                                    STATE     READ WRITE CKSUM
              ost04                                   ONLINE       0     0     0
                raidz2-0                              ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay041-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay042-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay043-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay044-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay045-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay046-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay047-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay048-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay049-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay101-0  ONLINE       0     0     0 

       

      When a drive is taken offline, ZFS offlines the drive, marks the pool DEGRADED, and generates a state change event. For example,

      zpool offline ost04 d8000_sep500C0FF03C1AC73E_bay101-0
      

      generates an event which looks like this:

      Oct  8 2019 09:22:43.109749814 resource.fs.zfs.statechange
              version = 0x0
              class = "resource.fs.zfs.statechange"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              vdev_guid = 0x4b6ac5c4c8d5cb1a
              vdev_state = "OFFLINE" (0x2)
              vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
              vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
              vdev_laststate = "ONLINE" (0x7)
              time = 0x5d9c9bb3 0x68aa636
              eid = 0x7a
      

      ZED will run statechange-lustre.sh, which will set the degraded property of the target:

      # lctl get_param obdfilter.lustrefs-OST0004.degraded
      obdfilter.lustrefs-OST0004.degraded=1
      

       

      When the offline drive is brought online, ZFS generates a state change event, onlines the drive, resilvers the drive, and marks the pool ONLINE. For example,

      zpool online ost04 d8000_sep500C0FF03C1AC73E_bay101-0
      

      generates a series of events like this:

      Oct  8 2019 09:29:58.726502922 resource.fs.zfs.statechange
              version = 0x0
              class = "resource.fs.zfs.statechange"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              vdev_guid = 0x4b6ac5c4c8d5cb1a
              vdev_state = "ONLINE" (0x7)
              vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
              vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
              vdev_laststate = "OFFLINE" (0x2)
              time = 0x5d9c9d66 0x2b4d8e0a
              eid = 0x7c
      
      Oct  8 2019 09:29:59.261511291 sysevent.fs.zfs.vdev_online
              version = 0x0
              class = "sysevent.fs.zfs.vdev_online"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              vdev_guid = 0x4b6ac5c4c8d5cb1a
              vdev_state = "ONLINE" (0x7)
              vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
              vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
              time = 0x5d9c9d67 0xf96587b
              eid = 0x7d
      
      Oct  8 2019 09:29:59.341512542 sysevent.fs.zfs.resilver_start
              version = 0x0
              class = "sysevent.fs.zfs.resilver_start"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              time = 0x5d9c9d67 0x145b115e
              eid = 0x7e
      
      Oct  8 2019 09:29:59.341512542 sysevent.fs.zfs.history_event
              version = 0x0
              class = "sysevent.fs.zfs.history_event"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              history_hostname = "cv-g10-oss0.dev.net"
              history_internal_str = "func=2 mintxg=164314 maxtxg=164319"
              history_internal_name = "scan setup"
              history_txg = 0x28236
              history_time = 0x5d9c9d67
              time = 0x5d9c9d67 0x145b115e
              eid = 0x7f
      
      Oct  8 2019 09:29:59.372513027 sysevent.fs.zfs.history_event
              version = 0x0
              class = "sysevent.fs.zfs.history_event"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              history_hostname = "cv-g10-oss0.dev.net"
              history_internal_str = "errors=0"
              history_internal_name = "scan done"
              history_txg = 0x28237
              history_time = 0x5d9c9d67
              time = 0x5d9c9d67 0x16341903
              eid = 0x80
      
      Oct  8 2019 09:29:59.372513027 sysevent.fs.zfs.resilver_finish
              version = 0x0
              class = "sysevent.fs.zfs.resilver_finish"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              time = 0x5d9c9d67 0x16341903
              eid = 0x81
      

      ZED will run statechange-lustre.sh, which will reset the degraded property of the target:

      # lctl get_param obdfilter.lustrefs-OST0004.degraded
      obdfilter.lustrefs-OST0004.degraded=0
      

       
      When a drive fails in a pool, ZFS faults the drive, marks the pool as DEGRADED, and generates a state change event which looks like this:

      Oct  4 2019 09:17:51.637116237 resource.fs.zfs.statechange
              version = 0x0
              class = "resource.fs.zfs.statechange"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              vdev_guid = 0x12d754fcdac78110
              vdev_state = "FAULTED" (0x5)
              vdev_path = "/dev/disk/by-id/dm-uuid-mpath-35000c500a647974b"
              vdev_devid = "dm-uuid-mpath-35000c500a647974b"
              vdev_laststate = "ONLINE" (0x7)
              time = 0x5d97548f 0x25f99f4d
              eid = 0x39
      

      ZED will run statechange-lustre.sh, which will set the degraded property of the target:

      # lctl get_param obdfilter.lustrefs-OST0004.degraded
      obdfilter.lustrefs-OST0004.degraded=1

      Problem description

      The problem arises when a failed or offline drive is replaced. ZFS will start attaching the replacement drive, resilver it, remove the failed or offline drive, finish attaching the replacement drive, and detach the failed or offline drive. For example:

      zpool replace ost04 d8000_sep500C0FF03C1AC73E_bay101-0 d8000_sep500C0FF03C1AC73E_bay050-0
      

      will generates a series of events like this (with sysevent.fs.zfs.config_sync events removed):

      Oct  8 2019 10:50:47.193227758 sysevent.fs.zfs.vdev_attach
              version = 0x0
              class = "sysevent.fs.zfs.vdev_attach"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              vdev_guid = 0xb9f869eb9bd523ec
              vdev_state = "ONLINE" (0x7)
              vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay050-0"
              vdev_devid = "dm-uuid-mpath-35000c500a647974b"
              time = 0x5d9cb057 0xb846bee
              eid = 0x9b
      
      Oct  8 2019 10:50:47.274229010 sysevent.fs.zfs.resilver_start
              version = 0x0
              class = "sysevent.fs.zfs.resilver_start"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              time = 0x5d9cb057 0x10586712
              eid = 0x9c
      
      Oct  8 2019 10:50:47.274229010 sysevent.fs.zfs.history_event
              version = 0x0
              class = "sysevent.fs.zfs.history_event"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              history_hostname = "cv-g10-oss0.dev.net"
              history_internal_str = "func=2 mintxg=3 maxtxg=165396"
              history_internal_name = "scan setup"
              history_txg = 0x28614
              history_time = 0x5d9cb057
              time = 0x5d9cb057 0x10586712
              eid = 0x9d
      
      Oct  8 2019 10:50:47.520232812 sysevent.fs.zfs.history_event
              version = 0x0
              class = "sysevent.fs.zfs.history_event"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              history_hostname = "cv-g10-oss0.dev.net"
              history_internal_str = "errors=0"
              history_internal_name = "scan done"
              history_txg = 0x28615
              history_time = 0x5d9cb057
              time = 0x5d9cb057 0x1f021f6c
              eid = 0x9e
      
      Oct  8 2019 10:50:47.520232812 sysevent.fs.zfs.resilver_finish
              version = 0x0
              class = "sysevent.fs.zfs.resilver_finish"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              time = 0x5d9cb057 0x1f021f6c
              eid = 0x9f
      
      Oct  8 2019 10:50:47.582233770 sysevent.fs.zfs.vdev_remove
              version = 0x0
              class = "sysevent.fs.zfs.vdev_remove"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              vdev_guid = 0xdb315a1b87016f8c
              vdev_state = "OFFLINE" (0x2)
              vdev_path = "/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
              vdev_devid = "dm-uuid-mpath-35000c500a63e36f7"
              time = 0x5d9cb057 0x22b42eaa
              eid = 0xa1
      
      Oct  8 2019 10:50:47.587233848 sysevent.fs.zfs.history_event
              version = 0x0
              class = "sysevent.fs.zfs.history_event"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              history_hostname = "cv-g10-oss0.dev.net"
              history_internal_str = "replace vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay050-0 for vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
              history_internal_name = "vdev attach"
              history_txg = 0x28616
              history_time = 0x5d9cb057
              time = 0x5d9cb057 0x23007a38
              eid = 0xa2
      
      Oct  8 2019 10:50:52.581311041 sysevent.fs.zfs.history_event
              version = 0x0
              class = "sysevent.fs.zfs.history_event"
              pool = "ost04"
              pool_guid = 0x8159dca79b3945a4
              pool_state = 0x0
              pool_context = 0x0
              history_hostname = "cv-g10-oss0.dev.net"
              history_internal_str = "vdev=/dev/mapper/d8000_sep500C0FF03C1AC73E_bay101-0"
              history_internal_name = "detach"
              history_txg = 0x28617
              history_time = 0x5d9cb05c
              time = 0x5d9cb05c 0x22a61a41
              eid = 0xa4
      

      Since ZFS did not generate a state change event, ZED does not run statechange-lustre.sh, and the degraded property of the target is not reset:

      # lctl get_param obdfilter.lustrefs-OST0004.degraded
      obdfilter.lustrefs-OST0004.degraded=1
      

      despite the pool being ONLINE:

      # zpool status ost04
         pool: ost04
       state: ONLINE
        scan: resilvered 2.62M in 0h0m with 0 errors on Tue Oct  8 10:50:47 2019
      config:        NAME                                    STATE     READ WRITE CKSUM
              ost04                                   ONLINE       0     0     0
                raidz2-0                              ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay041-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay042-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay043-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay044-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay045-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay046-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay047-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay048-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay049-0  ONLINE       0     0     0
                  d8000_sep500C0FF03C1AC73E_bay050-0  ONLINE       0     0     0
      
      errors: No known data errors

      Steps to reproduce

      1. Create a Lustre filesystem using ZFS
      2. Select a pool to test
      3. Select a drive to fail from the pool
      4. Select an unused drive to use as the replacement drive
      5. Verify the target (MGT, MDT, or OST) corresponding to the pool does not have the degraded property set
      6. Wipe the replacement drive so it looks like an unused drive
      7. Fail the selected drive
      8. Wait for the drive to report faulted
      9. Replace the failed drive
      10. Wait for resilvering to finish
      11. Check the degraded property for the target

      I have attached the script test-degraded-drive which implements the above steps. While it does make a few assumptions (datasets are named the same as the pool they are in [e.g., {{ost04/ost04}}] and drives are given by their device mapper names), it should be easy to repurpose it for use on another system.

      Attachments

        Activity

          People

            hutter Tony Hutter
            voltz Christopher Voltz (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: