Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8500

lustre-2.7.2 hits MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      Hit what looks like LU-4772 (MGS is waiting for obd_unlinked_exports) while running 2.7.2. So this is a duplicate case, but opening it for tracking to get the patch backported to the FE branch for NASA.

      Aug 15 09:19:12 nbp1-mds kernel: INFO: task umount:37486 blocked for more than 120 seconds.
      Aug 15 09:19:12 nbp1-mds kernel: Tainted: G           -- ------------  T 2.6.32-573.26.1.el6.20160517.x86_64.lustre272 #1
      Aug 15 09:19:13 nbp1-mds kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Aug 15 09:19:13 nbp1-mds kernel: umount        D 0000000000000003     0 37486      1 0x00000080
      Aug 15 09:19:13 nbp1-mds kernel: ffff881f07cabab8 0000000000000086 0000000000000000 ffff881f07caba58
      Aug 15 09:19:13 nbp1-mds kernel: ffff881f07caba18 ffff883e20b05400 001305aa0b19e806 0000000000000000
      Aug 15 09:19:13 nbp1-mds kernel: ffff881f0b3ece4c 000000023ecf1493 ffff881e28d7d068 ffff881f07cabfd8
      Aug 15 09:19:13 nbp1-mds kernel: Call Trace:
      Aug 15 09:19:13 nbp1-mds kernel: [<ffffffff81574ce2>] schedule_timeout+0x192/0x2e0
      Aug 15 09:19:13 nbp1-mds kernel: [<ffffffff810892c0>] ? process_timeout+0x0/0x10
      Aug 15 09:19:13 nbp1-mds kernel: [<ffffffffa05b2296>] obd_exports_barrier+0xb6/0x190 [obdclass]
      Aug 15 09:19:13 nbp1-mds kernel: [<ffffffffa0e7ea14>] mgs_device_fini+0x134/0x5b0 [mgs]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d7122>] class_cleanup+0x562/0xd20 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05b4216>] ? class_name2dev+0x56/0xe0 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d8e4a>] class_process_config+0x156a/0x1ad0 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d1205>] ? lustre_cfg_new+0x435/0x630 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05d9525>] class_manual_cleanup+0x175/0x4c0 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05b4216>] ? class_name2dev+0x56/0xe0 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa061827f>] server_put_super+0x9df/0x1060 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811ad166>] ? invalidate_inodes+0xf6/0x190
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff8119127b>] generic_shutdown_super+0x5b/0xe0
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff81191366>] kill_anon_super+0x16/0x60
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffffa05db0b6>] lustre_kill_super+0x36/0x60 [obdclass]
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff81191b07>] deactivate_super+0x57/0x80
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811b1acf>] mntput_no_expire+0xbf/0x110
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff811b261b>] sys_umount+0x7b/0x3a0
      Aug 15 09:19:14 nbp1-mds kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      Aug 15 09:20:03 nbp1-mds kernel: Lustre: MGS is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck?
      

      Attachments

        Issue Links

          Activity

            [LU-8500] lustre-2.7.2 hits MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?

            by investigating the code lines, I find some cases leaking the export reference, which could be related to the issue.
            http://review.whamcloud.com/#/c/22021/

            hongchao.zhang Hongchao Zhang added a comment - by investigating the code lines, I find some cases leaking the export reference, which could be related to the issue. http://review.whamcloud.com/#/c/22021/

            I'm still investigating the problem of the first stack trace, and don't find the cause of the problem yet.
            and I'm trying to create a debug patch to try to locate the problem, it's a little complicated and still need more time on it.

            hongchao.zhang Hongchao Zhang added a comment - I'm still investigating the problem of the first stack trace, and don't find the cause of the problem yet. and I'm trying to create a debug patch to try to locate the problem, it's a little complicated and still need more time on it.
            pjones Peter Jones added a comment -

            Hongchao

            What is your assessment of the first stack trace?

            Peter

            pjones Peter Jones added a comment - Hongchao What is your assessment of the first stack trace? Peter

            Peter,

            According to Jay, the lustre-2.7.2-1.1nasS_mofed32v1 build (what was running on "nbp1-mds" above) does indeed have the LU-4772 fix.

            However, the lustre-2.7.1-5.1nasS_mofed31v5 build (what was on "nbp8-mds1" above) did NOT have the fix. I presume neither did the "lustre-2.5.3-6nasS_mofed31v5" version running on "nbp6-mds" above. Therefore, the issues for the latter two (regarding MDT) can be ignored for now as we assume the LU-4772 fix will prevent those in the future.

            Only the first stack trace on nbp1-mds is of interest then.

            Thanks!

            ndauchy Nathan Dauchy (Inactive) added a comment - Peter, According to Jay, the lustre-2.7.2-1.1nasS_mofed32v1 build (what was running on "nbp1-mds" above) does indeed have the LU-4772 fix. However, the lustre-2.7.1-5.1nasS_mofed31v5 build (what was on "nbp8-mds1" above) did NOT have the fix. I presume neither did the "lustre-2.5.3-6nasS_mofed31v5" version running on "nbp6-mds" above. Therefore, the issues for the latter two (regarding MDT) can be ignored for now as we assume the LU-4772 fix will prevent those in the future. Only the first stack trace on nbp1-mds is of interest then. Thanks!

            I'm afraid I don't have additional debugging info... was in a rush to get things shut down so we could complete server and Lustre version upgrades during a maintenance window, so I just power cycled the servers that hit this issue.

            ndauchy Nathan Dauchy (Inactive) added a comment - I'm afraid I don't have additional debugging info... was in a rush to get things shut down so we could complete server and Lustre version upgrades during a maintenance window, so I just power cycled the servers that hit this issue.

            Hi Nathan,

            Is the debug log available during umounting MDT? it will dump the exports in this case, and will help to find the cause of the issue.
            Thanks!

            hongchao.zhang Hongchao Zhang added a comment - Hi Nathan, Is the debug log available during umounting MDT? it will dump the exports in this case, and will help to find the cause of the issue. Thanks!
            pjones Peter Jones added a comment -

            Nathan

            The LU-4772 fix is already in the branch you are running with so it seems like this must be something new

            Hongchao

            What do you suggest here?

            Peter

            pjones Peter Jones added a comment - Nathan The LU-4772 fix is already in the branch you are running with so it seems like this must be something new Hongchao What do you suggest here? Peter
            ndauchy Nathan Dauchy (Inactive) added a comment - - edited

            Note that I saw a similar messages (one with stack trace, one without) other servers. The difference being that these are for the MDT, not the MGS:

            Aug 15 09:22:09 nbp6-mds kernel: Lustre: nbp6-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 9. Is it stuck?
            
            Aug 15 10:20:15 nbp8-mds1 kernel: INFO: task umount:55640 blocked for more than 120 seconds.
            Aug 15 10:20:15 nbp8-mds1 kernel: Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre271 #1
            Aug 15 10:20:16 nbp8-mds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            Aug 15 10:20:16 nbp8-mds1 kernel: umount        D 0000000000000002     0 55640  55625 0x00000080
            Aug 15 10:20:16 nbp8-mds1 kernel: ffff880935d2da78 0000000000000082 0000000000000000 ffff880935d2da18
            Aug 15 10:20:16 nbp8-mds1 kernel: ffff880935d2d9d8 ffff883ddf90bc00 001a700794410389 0000000000000000
            Aug 15 10:20:16 nbp8-mds1 kernel: ffff883f224ace8c 00000002bb51a25e ffff880e76565ad0 ffff880935d2dfd8
            Aug 15 10:20:16 nbp8-mds1 kernel: Call Trace:
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff81562c32>] schedule_timeout+0x192/0x2e0
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff81086900>] ? process_timeout+0x0/0x10
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05a8296>] obd_exports_barrier+0xb6/0x190 [obdclass]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa0f9f2c2>] mdt_device_fini+0x642/0x1010 [mdt]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05ad406>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cd122>] class_cleanup+0x562/0xd20 [obdclass]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05aa216>] ? class_name2dev+0x56/0xe0 [obdclass]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cee4a>] class_process_config+0x156a/0x1ad0 [obdclass]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05c7205>] ? lustre_cfg_new+0x435/0x630 [obdclass]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cf525>] class_manual_cleanup+0x175/0x4c0 [obdclass]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05aa216>] ? class_name2dev+0x56/0xe0 [obdclass]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa060e597>] server_put_super+0xcf7/0x1060 [obdclass]
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff811a67a6>] ? invalidate_inodes+0xf6/0x190
            Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff8118ab8b>] generic_shutdown_super+0x5b/0xe0
            Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8118ac76>] kill_anon_super+0x16/0x60
            Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffffa05d10b6>] lustre_kill_super+0x36/0x60 [obdclass]
            Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8118b417>] deactivate_super+0x57/0x80
            Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff811ab11f>] mntput_no_expire+0xbf/0x110
            Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff811abc6b>] sys_umount+0x7b/0x3a0
            Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
            Aug 15 10:21:46 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Not available for connect from 10.151.55.23@o2ib (stopping)
            Aug 15 10:21:46 nbp8-mds1 kernel: Lustre: Skipped 47513 previous similar messages
            Aug 15 10:22:06 nbp8-mds1 kernel: Lustre: nbp8-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 7. Is it stuck?
            

            Please confirm whether the fix in LU-4772 addresses both the MGS and MDT.

            ndauchy Nathan Dauchy (Inactive) added a comment - - edited Note that I saw a similar messages (one with stack trace, one without) other servers. The difference being that these are for the MDT, not the MGS: Aug 15 09:22:09 nbp6-mds kernel: Lustre: nbp6-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 9. Is it stuck? Aug 15 10:20:15 nbp8-mds1 kernel: INFO: task umount:55640 blocked for more than 120 seconds. Aug 15 10:20:15 nbp8-mds1 kernel: Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre271 #1 Aug 15 10:20:16 nbp8-mds1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 15 10:20:16 nbp8-mds1 kernel: umount D 0000000000000002 0 55640 55625 0x00000080 Aug 15 10:20:16 nbp8-mds1 kernel: ffff880935d2da78 0000000000000082 0000000000000000 ffff880935d2da18 Aug 15 10:20:16 nbp8-mds1 kernel: ffff880935d2d9d8 ffff883ddf90bc00 001a700794410389 0000000000000000 Aug 15 10:20:16 nbp8-mds1 kernel: ffff883f224ace8c 00000002bb51a25e ffff880e76565ad0 ffff880935d2dfd8 Aug 15 10:20:16 nbp8-mds1 kernel: Call Trace: Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff81562c32>] schedule_timeout+0x192/0x2e0 Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff81086900>] ? process_timeout+0x0/0x10 Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05a8296>] obd_exports_barrier+0xb6/0x190 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa0f9f2c2>] mdt_device_fini+0x642/0x1010 [mdt] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05ad406>] ? class_disconnect_exports+0x116/0x2f0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cd122>] class_cleanup+0x562/0xd20 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05aa216>] ? class_name2dev+0x56/0xe0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cee4a>] class_process_config+0x156a/0x1ad0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05c7205>] ? lustre_cfg_new+0x435/0x630 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05cf525>] class_manual_cleanup+0x175/0x4c0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa05aa216>] ? class_name2dev+0x56/0xe0 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffffa060e597>] server_put_super+0xcf7/0x1060 [obdclass] Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff811a67a6>] ? invalidate_inodes+0xf6/0x190 Aug 15 10:20:16 nbp8-mds1 kernel: [<ffffffff8118ab8b>] generic_shutdown_super+0x5b/0xe0 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8118ac76>] kill_anon_super+0x16/0x60 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffffa05d10b6>] lustre_kill_super+0x36/0x60 [obdclass] Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8118b417>] deactivate_super+0x57/0x80 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff811ab11f>] mntput_no_expire+0xbf/0x110 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff811abc6b>] sys_umount+0x7b/0x3a0 Aug 15 10:20:17 nbp8-mds1 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Aug 15 10:21:46 nbp8-mds1 kernel: Lustre: nbp8-MDT0000: Not available for connect from 10.151.55.23@o2ib (stopping) Aug 15 10:21:46 nbp8-mds1 kernel: Lustre: Skipped 47513 previous similar messages Aug 15 10:22:06 nbp8-mds1 kernel: Lustre: nbp8-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 7. Is it stuck? Please confirm whether the fix in LU-4772 addresses both the MGS and MDT.

            People

              hongchao.zhang Hongchao Zhang
              ndauchy Nathan Dauchy (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: