Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14048

Unable stop all OSTs due to obd_unlinked_exports

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Since our Oak upgrade from Lustre 2.10.8 to 2.12.5, it has not always been possible to stop OSTs anymore (when doing a manual failover for example).

      When stopping OSTs, we can see these messages on the servers:

      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011a is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 6. Is it stuck?
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff887ca4215400 f2e15550-2e62-4 10.51.1.18@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88de7f82bc00 2be2f511-7fb2-4 10.51.1.43@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88de717a7c00 f04b9922-9d5a-4 10.51.1.63@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST011e: UNLINKED ffff88deb8acd000 79a83f3f-8d8d-4 10.51.1.42@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST0112: UNLINKED ffff8833ab100c00 2be2f511-7fb2-4 10.51.1.43@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      Oct 20 06:30:12 oak-io6-s2 kernel: Lustre: oak-OST0112: UNLINKED ffff88de6d629400 5f8df1bb-2246-4 10.51.1.39@o2ib3 1 (0 0 0) 1 0 1 0:           (null)  0 stale:0
      

      This has never been an issue with 2.10. We tried to manual failback OSTs this morning and we waited for about 10 minutes before rebooting the servers because it was too long and started to impact production. It would be nice to have a short timeout on this. Restart and recovery of the OSTs went well (perhaps worst case 1 client only was evicted).

      Attaching some lustre debug logs from 3 different OSS oak-io[4-6]-s2:

      Attaching kernel logs from the same 3 servers:

       

      This could be similar to the issue reported in LU-12319 which doesn't seem to be fixed at this time.

      Thanks for the help!
      Stephane

      Attachments

        Activity

          People

            ys Yang Sheng
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: