Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8500

lustre-2.7.2 hits MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      Hit what looks like LU-4772 (MGS is waiting for obd_unlinked_exports) while running 2.7.2. So this is a duplicate case, but opening it for tracking to get the patch backported to the FE branch for NASA.

      Aug 15 09:19:12 nbp1-mds kernel: INFO: task umount:37486 blocked for more than 120 seconds.
      Aug 15 09:19:12 nbp1-mds kernel: Tainted: G           -- ------------  T 2.6.32-573.26.1.el6.20160517.x86_64.lustre272 #1
      Aug 15 09:19:13 nbp1-mds kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Aug 15 09:19:13 nbp1-mds kernel: umount        D 0000000000000003     0 37486      1 0x00000080
      Aug 15 09:19:13 nbp1-mds kernel: ffff881f07cabab8 0000000000000086 0000000000000000 ffff881f07caba58
      Aug 15 09:19:13 nbp1-mds kernel: ffff881f07caba18 ffff883e20b05400 001305aa0b19e806 0000000000000000
      Aug 15 09:19:13 nbp1-mds kernel: ffff881f0b3ece4c 000000023ecf1493 ffff881e28d7d068 ffff881f07cabfd8
      Aug 15 09:19:13 nbp1-mds kernel: Call Trace:
      Aug 15 09:19:13 nbp1-mds kernel: [<ffffffff81574ce2>] schedule_timeout+0x192/0x2e0
      Aug 15 09:19:13 nbp1-mds kernel: [<ffffffff810892c0>] ? process_timeout+0x0/0x10
      Aug 15 09:19:13 nbp1-mds kernel: [<ffffffffa05b2296>] obd_exports_barrier+0xb6/0x190 [obdclass]
      Aug 15 09:19:13 nbp1-mds kernel: [<ffffffffa0e7ea14>] mgs_device_fini+0x134/0x5b0 [mgs]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d7122>] class_cleanup+0x562/0xd20 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05b4216>] ? class_name2dev+0x56/0xe0 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d8e4a>] class_process_config+0x156a/0x1ad0 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d1205>] ? lustre_cfg_new+0x435/0x630 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d9525>] class_manual_cleanup+0x175/0x4c0 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05b4216>] ? class_name2dev+0x56/0xe0 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa061827f>] server_put_super+0x9df/0x1060 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811ad166>] ? invalidate_inodes+0xf6/0x190
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff8119127b>] generic_shutdown_super+0x5b/0xe0
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff81191366>] kill_anon_super+0x16/0x60
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05db0b6>] lustre_kill_super+0x36/0x60 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff81191b07>] deactivate_super+0x57/0x80
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811b1acf>] mntput_no_expire+0xbf/0x110
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811b261b>] sys_umount+0x7b/0x3a0
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      Aug 15 09:20:03 nbp1-mds kernel: Lustre: MGS is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck?
      

      Attachments

        Issue Links

          Activity

            [LU-8500] lustre-2.7.2 hits MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22031/
            Subject: LU-8500 ldlm: fix export reference problem
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 108339f1543fb006f4ddd16830e7266df0b46723

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22031/ Subject: LU-8500 ldlm: fix export reference problem Project: fs/lustre-release Branch: master Current Patch Set: Commit: 108339f1543fb006f4ddd16830e7266df0b46723
            simmonsja James A Simmons added a comment - - edited

            We just hit this on 2.8. Can't unmount our OSTs.

            simmonsja James A Simmons added a comment - - edited We just hit this on 2.8. Can't unmount our OSTs.

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/22031
            Subject: LU-8500 ldlm: fix export reference problem
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a8fc617b0edbd195d22da21f30370fa4b33e74c1

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/22031 Subject: LU-8500 ldlm: fix export reference problem Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a8fc617b0edbd195d22da21f30370fa4b33e74c1
            pjones Peter Jones added a comment -

            Hongchao

            I recommend fixing this on master first, then backporting to 2.7 FE for NASA to use

            Regards

            Peter

            pjones Peter Jones added a comment - Hongchao I recommend fixing this on master first, then backporting to 2.7 FE for NASA to use Regards Peter

            by investigating the code lines, I find some cases leaking the export reference, which could be related to the issue.
            http://review.whamcloud.com/#/c/22021/

            hongchao.zhang Hongchao Zhang added a comment - by investigating the code lines, I find some cases leaking the export reference, which could be related to the issue. http://review.whamcloud.com/#/c/22021/

            I'm still investigating the problem of the first stack trace, and don't find the cause of the problem yet.
            and I'm trying to create a debug patch to try to locate the problem, it's a little complicated and still need more time on it.

            hongchao.zhang Hongchao Zhang added a comment - I'm still investigating the problem of the first stack trace, and don't find the cause of the problem yet. and I'm trying to create a debug patch to try to locate the problem, it's a little complicated and still need more time on it.
            pjones Peter Jones added a comment -

            Hongchao

            What is your assessment of the first stack trace?

            Peter

            pjones Peter Jones added a comment - Hongchao What is your assessment of the first stack trace? Peter

            Peter,

            According to Jay, the lustre-2.7.2-1.1nasS_mofed32v1 build (what was running on "nbp1-mds" above) does indeed have the LU-4772 fix.

            However, the lustre-2.7.1-5.1nasS_mofed31v5 build (what was on "nbp8-mds1" above) did NOT have the fix. I presume neither did the "lustre-2.5.3-6nasS_mofed31v5" version running on "nbp6-mds" above. Therefore, the issues for the latter two (regarding MDT) can be ignored for now as we assume the LU-4772 fix will prevent those in the future.

            Only the first stack trace on nbp1-mds is of interest then.

            Thanks!

            ndauchy Nathan Dauchy (Inactive) added a comment - Peter, According to Jay, the lustre-2.7.2-1.1nasS_mofed32v1 build (what was running on "nbp1-mds" above) does indeed have the LU-4772 fix. However, the lustre-2.7.1-5.1nasS_mofed31v5 build (what was on "nbp8-mds1" above) did NOT have the fix. I presume neither did the "lustre-2.5.3-6nasS_mofed31v5" version running on "nbp6-mds" above. Therefore, the issues for the latter two (regarding MDT) can be ignored for now as we assume the LU-4772 fix will prevent those in the future. Only the first stack trace on nbp1-mds is of interest then. Thanks!

            I'm afraid I don't have additional debugging info... was in a rush to get things shut down so we could complete server and Lustre version upgrades during a maintenance window, so I just power cycled the servers that hit this issue.

            ndauchy Nathan Dauchy (Inactive) added a comment - I'm afraid I don't have additional debugging info... was in a rush to get things shut down so we could complete server and Lustre version upgrades during a maintenance window, so I just power cycled the servers that hit this issue.

            Hi Nathan,

            Is the debug log available during umounting MDT? it will dump the exports in this case, and will help to find the cause of the issue.
            Thanks!

            hongchao.zhang Hongchao Zhang added a comment - Hi Nathan, Is the debug log available during umounting MDT? it will dump the exports in this case, and will help to find the cause of the issue. Thanks!
            pjones Peter Jones added a comment -

            Nathan

            The LU-4772 fix is already in the branch you are running with so it seems like this must be something new

            Hongchao

            What do you suggest here?

            Peter

            pjones Peter Jones added a comment - Nathan The LU-4772 fix is already in the branch you are running with so it seems like this must be something new Hongchao What do you suggest here? Peter

            People

              hongchao.zhang Hongchao Zhang
              ndauchy Nathan Dauchy (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: