Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14048

Unable stop all OSTs due to obd_unlinked_exports

Details

    • 3
    • 9223372036854775807

    Description

      Since our Oak upgrade from Lustre 2.10.8 to 2.12.5, it has not always been possible to stop OSTs anymore (when doing a manual failover for example).

      When stopping OSTs, we can see these messages on the servers:

      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011a is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 6. Is it stuck?
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff887ca4215400 f2e15550-2e62-4 10.51.1.18@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88de7f82bc00 2be2f511-7fb2-4 10.51.1.43@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88de717a7c00 f04b9922-9d5a-4 10.51.1.63@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88deb8acd000 79a83f3f-8d8d-4 10.51.1.42@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST0112: UNLINKED ffff8833ab100c00 2be2f511-7fb2-4 10.51.1.43@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST0112: UNLINKED ffff88de6d629400 5f8df1bb-2246-4 10.51.1.39@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      

      This has never been an issue with 2.10. We tried to manual failback OSTs this morning and we waited for about 10 minutes before rebooting the servers because it was too long and started to impact production. It would be nice to have a short timeout on this. Restart and recovery of the OSTs went well (perhaps worst case 1 client only was evicted).

      Attaching some lustre debug logs from 3 different OSS oak-io[4-6]-s2:

      Attaching kernel logs from the same 3 servers:

       

      This could be similar to the issue reported in LU-12319 which doesn't seem to be fixed at this time.

      Thanks for the help!
      Stephane

      Attachments

        Activity

          [LU-14048] Unable stop all OSTs due to obd_unlinked_exports

          Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43542
          Subject: LU-14048 obd: fix race between connect vs disconnect
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: 7f54a720c64409c1d9d710cfe885b2f0a07970a0

          gerrit Gerrit Updater added a comment - Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/43542 Subject: LU-14048 obd: fix race between connect vs disconnect Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 7f54a720c64409c1d9d710cfe885b2f0a07970a0
          pjones Peter Jones added a comment -

          Landed for 2.15

          pjones Peter Jones added a comment - Landed for 2.15

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41687/
          Subject: LU-14048 obd: fix race between connect vs disconnect
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 08f1d2961361f0f6c253b6fbd429ca7b61a3def2

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41687/ Subject: LU-14048 obd: fix race between connect vs disconnect Project: fs/lustre-release Branch: master Current Patch Set: Commit: 08f1d2961361f0f6c253b6fbd429ca7b61a3def2

          Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41687
          Subject: LU-14048 obd: fix race between connect vs disconnect
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 647b9b3444244c8a7f8b1df077b92ccf770483e5

          gerrit Gerrit Updater added a comment - Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41687 Subject: LU-14048 obd: fix race between connect vs disconnect Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 647b9b3444244c8a7f8b1df077b92ccf770483e5

          Again, hit the same problem with Lustre 2.12.5 OSTs when stopping:

          00000020:02000400:6.0:1613582408.575023:0:230973:0:(genops.c:1837:obd_exports_barrier()) oak-OST000d is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 4. Is it stuck?
          00000020:02000400:6.0:1613582408.587840:0:230973:0:(genops.c:1802:print_export_data()) oak-OST000d: UNLINKED ffff9bc55a67c000 aa46df42-0a4c-4 10.50.16.5@o2ib2 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
          
          # cd /proc/fs/lustre/obdfilter/oak-OST000d/exports/10.50.16.5@o2ib2
          [root@oak-io1-s1 10.50.16.5@o2ib2]# for f in *; do echo ===$f===:; cat $f; done
          ===export===:
          ===fmd_count===:
          ===hash===:
          ===ldlm_stats===:
          snapshot_time             1613582626.251044421 secs.nsecs
          ldlm_enqueue              42 samples [reqs]
          ldlm_cancel               22 samples [reqs]
          ===nodemap===:
          ===reply_data===:
          ===stats===:
          snapshot_time             1613582626.253618983 secs.nsecs
          read_bytes                41 samples [bytes] 4096 4096 167936
          statfs                    1 samples [reqs]
          set_info                  5132 samples [reqs]
          ===uuid===:
          
          sthiell Stephane Thiell added a comment - Again, hit the same problem with Lustre 2.12.5 OSTs when stopping: 00000020:02000400:6.0:1613582408.575023:0:230973:0:(genops.c:1837:obd_exports_barrier()) oak-OST000d is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 4. Is it stuck? 00000020:02000400:6.0:1613582408.587840:0:230973:0:(genops.c:1802:print_export_data()) oak-OST000d: UNLINKED ffff9bc55a67c000 aa46df42-0a4c-4 10.50.16.5@o2ib2 1 (0 0 0) 1 0 1 0: (null) 0 stale:0 # cd /proc/fs/lustre/obdfilter/oak-OST000d/exports/10.50.16.5@o2ib2 [root@oak-io1-s1 10.50.16.5@o2ib2]# for f in *; do echo ===$f===:; cat $f; done ===export===: ===fmd_count===: ===hash===: ===ldlm_stats===: snapshot_time 1613582626.251044421 secs.nsecs ldlm_enqueue 42 samples [reqs] ldlm_cancel 22 samples [reqs] ===nodemap===: ===reply_data===: ===stats===: snapshot_time 1613582626.253618983 secs.nsecs read_bytes 41 samples [bytes] 4096 4096 167936 statfs 1 samples [reqs] set_info 5132 samples [reqs] ===uuid===:

          Hello,

          We encounter the same type of issue during the migration of Lustre 2.12.0.3_cray to Lustre 2.12.0.4_cray:

          2020-12-07 08:47:53 [5791062.978004] Lustre: Failing over fs-OST0017
          2020-12-07 08:48:07 [5791076.983534] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck?
          2020-12-07 08:48:07 [5791076.996600] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
          2020-12-07 08:48:23 [5791093.010738] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck?
          2020-12-07 08:48:23 [5791093.024450] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
          2020-12-07 08:48:55 [5791125.038060] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 4. Is it stuck?
          2020-12-07 08:48:55 [5791125.051938] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
          2020-12-07 08:49:32 [5791162.000809] LustreError: 9629:0:(qsd_reint.c:56:qsd_reint_completion()) fs-OST0017: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
          2020-12-07 08:49:59 [5791189.064757] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 4. Is it stuck?
          2020-12-07 08:49:59 [5791189.078677] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
          2020-12-07 08:51:32 [5791282.006030] LustreError: 13737:0:(qsd_reint.c:56:qsd_reint_completion()) fs-OST0017: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
          2020-12-07 08:52:07 [5791317.088242] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 4. Is it stuck?
          2020-12-07 08:52:07 [5791317.102361] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0:   
          

          The logs above occurred during a filesystem umount with Lustre 2.12.0.3_cray on an OSS.
          11 OSS have freezed with the same type of logs.

          eaujames Etienne Aujames added a comment - Hello, We encounter the same type of issue during the migration of Lustre 2.12.0.3_cray to Lustre 2.12.0.4_cray: 2020-12-07 08:47:53 [5791062.978004] Lustre: Failing over fs-OST0017 2020-12-07 08:48:07 [5791076.983534] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:48:07 [5791076.996600] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: ( null ) 0 stale:0 2020-12-07 08:48:23 [5791093.010738] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:48:23 [5791093.024450] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: ( null ) 0 stale:0 2020-12-07 08:48:55 [5791125.038060] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:48:55 [5791125.051938] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: ( null ) 0 stale:0 2020-12-07 08:49:32 [5791162.000809] LustreError: 9629:0:(qsd_reint.c:56:qsd_reint_completion()) fs-OST0017: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5 2020-12-07 08:49:59 [5791189.064757] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:49:59 [5791189.078677] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: ( null ) 0 stale:0 2020-12-07 08:51:32 [5791282.006030] LustreError: 13737:0:(qsd_reint.c:56:qsd_reint_completion()) fs-OST0017: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5 2020-12-07 08:52:07 [5791317.088242] Lustre: fs-OST0017 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 4. Is it stuck? 2020-12-07 08:52:07 [5791317.102361] Lustre: fs-OST0017: UNLINKED ffff97b4ae236c00 1e77bdd5-516d-2b0f-d3b9-12a911ee8f53 155@ptlf23 1 (0 0 0) 1 0 1 0: The logs above occurred during a filesystem umount with Lustre 2.12.0.3_cray on an OSS. 11 OSS have freezed with the same type of logs.

          Hi Peter,

          Thanks! It did happen on 3 different servers, but at about the same time. I cannot say at this point if this was just a one time thing, as the upgrade is very recent. I don't remember having seen that on Fir before (which has been using 2.12 for a while), but Oak also has even more clients.

          sthiell Stephane Thiell added a comment - Hi Peter, Thanks! It did happen on 3 different servers, but at about the same time. I cannot say at this point if this was just a one time thing, as the upgrade is very recent. I don't remember having seen that on Fir before (which has been using 2.12 for a while), but Oak also has even more clients.
          pjones Peter Jones added a comment -

          Stephane

          Does this reproduce reliably or just occur occasionally?

          Yang Sheng

          Could you please investigate?

          Thanks

          Peter

          pjones Peter Jones added a comment - Stephane Does this reproduce reliably or just occur occasionally? Yang Sheng Could you please investigate? Thanks Peter

          People

            ys Yang Sheng
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: