Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.1.0
    • None
    • lustre 2.1.0-21chaos (github.com/chaos/lustre)
    • 2
    • 6465

    Description

      On one of our filesystems that was upgraded to lustre 2.1 last week, the MDS has one thread that appears to be stick spinning:

      13711 root      20   0     0    0    0 R 99.9  0.0   6289:28 mdt_253
      

      Using sysreq (since crash can't trace a live process):

      2012-02-09 16:18:39 Call Trace:
      2012-02-09 16:18:39  [<ffffffffa06de440>] ? ldlm_cancel_locks_for_export_cb+0x0/0xd0 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa06de440>] ? ldlm_cancel_locks_for_export_cb+0x0/0xd0 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa0498f46>] cfs_hash_for_each_empty+0xc6/0x170 [libcfs]
      2012-02-09 16:18:39  [<ffffffffa06db15f>] ldlm_cancel_locks_for_export+0x1f/0x30 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa06e7c94>] server_disconnect_export+0x64/0x170 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa0c50dcb>] mdt_obd_disconnect+0x4b/0xd50 [mdt]
      2012-02-09 16:18:39  [<ffffffffa071681e>] ? lustre_pack_reply_flags+0xae/0x1b0 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa0716931>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa06e4d4a>] target_handle_disconnect+0x16a/0x3a0 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa0c4a805>] mdt_disconnect+0x35/0x100 [mdt]
      2012-02-09 16:18:39  [<ffffffffa0c4674d>] mdt_handle_common+0x73d/0x12b0 [mdt]
      2012-02-09 16:18:39  [<ffffffffa0717334>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa0c47395>] mdt_regular_handle+0x15/0x20 [mdt]
      2012-02-09 16:18:39  [<ffffffffa0723181>] ptlrpc_main+0xcd1/0x1690 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffff8100c14a>] child_rip+0xa/0x20
      2012-02-09 16:18:39  [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
      2012-02-09 16:18:39  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      Attachments

        Issue Links

          Activity

            [LU-1087] mdt thread spinning out of control
            pjones Peter Jones added a comment -

            This is believed to be a duplicate of LU-1428

            pjones Peter Jones added a comment - This is believed to be a duplicate of LU-1428

            I've posted a patch for review: http://review.whamcloud.com/#change,3028

            liang Liang Zhen (Inactive) added a comment - I've posted a patch for review: http://review.whamcloud.com/#change,3028

            If it happens again I will do that. Unfortunately D_INFO caused LU-1088, so I was not able to get any further debugging information.

            morrone Christopher Morrone (Inactive) added a comment - If it happens again I will do that. Unfortunately D_INFO caused LU-1088 , so I was not able to get any further debugging information.
            laisiyao Lai Siyao added a comment -

            Chris, your log shows you've enabled D_INFO log, could you enable D_DLMTRACE and check which lock is in the dead loop (ldlm_cancel_lock_for_export_cb() should print lock information)?

            laisiyao Lai Siyao added a comment - Chris, your log shows you've enabled D_INFO log, could you enable D_DLMTRACE and check which lock is in the dead loop (ldlm_cancel_lock_for_export_cb() should print lock information)?

            We saw all manner of timeouts, and reconnects, slow service warnings and just generally a lot of log traffic. There are certainly lock callback timer expirations and other ldlm stuff happening in there.

            morrone Christopher Morrone (Inactive) added a comment - We saw all manner of timeouts, and reconnects, slow service warnings and just generally a lot of log traffic. There are certainly lock callback timer expirations and other ldlm stuff happening in there.
            laisiyao Lai Siyao added a comment -

            Hi Chris, did you see any error messages about ldlm lock before this happened?

            laisiyao Lai Siyao added a comment - Hi Chris, did you see any error messages about ldlm lock before this happened?
            pjones Peter Jones added a comment -

            Lai

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please look into this one? Thanks Peter

            Excerpt from the log for the spinning thread, showing it repeatedly trying to free the same hash:

            00010000:00000040:15.0:1328835885.126894:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2
            00010000:00000040:15.0:1328835885.126895:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1
            00000001:00000040:15.0:1328835885.126925:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178108
            00010000:00000040:15.0:1328835885.126957:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2
            00010000:00000040:15.0:1328835885.126958:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1
            00000001:00000040:15.0:1328835885.126988:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178109
            00010000:00000040:15.0:1328835885.127020:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2
            00010000:00000040:15.0:1328835885.127022:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1
            00000001:00000040:15.0:1328835885.127051:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178110
            
            morrone Christopher Morrone (Inactive) added a comment - - edited Excerpt from the log for the spinning thread, showing it repeatedly trying to free the same hash: 00010000:00000040:15.0:1328835885.126894:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2 00010000:00000040:15.0:1328835885.126895:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1 00000001:00000040:15.0:1328835885.126925:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178108 00010000:00000040:15.0:1328835885.126957:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2 00010000:00000040:15.0:1328835885.126958:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1 00000001:00000040:15.0:1328835885.126988:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178109 00010000:00000040:15.0:1328835885.127020:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2 00010000:00000040:15.0:1328835885.127022:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1 00000001:00000040:15.0:1328835885.127051:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178110

            People

              laisiyao Lai Siyao
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: