Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-500

MDS threads hang ldlm_expired_completion_wait+

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 1.8.6
    • None
    • CentOS 5.3
    • 3
    • 24,450
    • 6583

    Description

      At a key cutomer site we were and still are experiencing MDS thread hangs. Initially they were seen under 1.8.4 and when the MDS would dump the threads the only way to recover would be to reboot the MDS. The site did upgrade to 1.8.6 which includes a at_min patch from bug 23352 which was thought that might help the issue. However they are still seeing the thread hangs but can usually now get out of it without a MDS reboot but still a serious problem.
      The trace looks like:

      Call Trace:
      [<ffffffff888e8c10>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      [<ffffffff888ea762>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc]
      [<ffffffff888cf709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
      [<ffffffff8008d07b>] default_wake_function+0x0/0xe
      [<ffffffff888cab6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
      [<ffffffff888e92cb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
      [<ffffffff88becd7a>] enqueue_ordered_locks+0x26a/0x4d0 [mds]
      [<ffffffff888e6bc0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
      [<ffffffff888ea2a0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
      [<ffffffff88bed5c1>] mds_get_parent_child_locked+0x5e1/0x8a0 [mds]
      [<ffffffff88c0f394>] mds_open+0xc44/0x35f8 [mds]
      [<ffffffff8899c2b6>] kiblnd_post_tx_locked+0x566/0x730 [ko2iblnd]
      [<ffffffff888e6d5e>] ldlm_blocking_ast+0x19e/0x2a0 [ptlrpc]
      [<ffffffff887dcb38>] upcall_cache_get_entry+0x958/0xa50 [lvfs]
      [<ffffffff888eb9b8>] ldlm_handle_bl_callback+0x1c8/0x230 [ptlrpc]
      [<ffffffff88be7f49>] mds_reint_rec+0x1d9/0x2b0 [mds]
      [<ffffffff88c13c32>] mds_open_unpack+0x312/0x430 [mds]
      [<ffffffff88bdae7a>] mds_reint+0x35a/0x420 [mds]
      [<ffffffff88bd9d8a>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
      [<ffffffff88be4bfc>] mds_intent_policy+0x4ac/0xc80 [mds]
      [<ffffffff888d18b6>] ldlm_resource_putref+0x1b6/0x3c0 [ptlrpc]
      [<ffffffff888ceeb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
      [<ffffffff888cb7fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
      [<ffffffff888f3720>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
      [<ffffffff888f0849>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
      [<ffffffff88be3b20>] mds_handle+0x4130/0x4d60 [mds]
      [<ffffffff887ffbe5>] lnet_match_blocked_msg+0x375/0x390 [lnet]
      [<ffffffff88914705>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
      [<ffffffff8006e244>] do_gettimeoffset_tsc+0x19/0x3c
      [<ffffffff8891bc37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc]
      [<ffffffff8008ca80>] __activate_task+0x56/0x6d
      [<ffffffff8008c865>] dequeue_task+0x18/0x37
      [<ffffffff80062ff8>] thread_return+0x62/0xfe
      [<ffffffff8003da91>] lock_timer_base+0x1b/0x3c
      [<ffffffff8001cb46>] __mod_timer+0x100/0x10f
      [<ffffffff8891f698>] ptlrpc_main+0x1258/0x1420 [ptlrpc]
      [<ffffffff8008d07b>] default_wake_function+0x0/0xe
      [<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362
      [<ffffffff8005dfb1>] child_rip+0xa/0x11
      [<ffffffff8891e440>] ptlrpc_main+0x0/0x1420 [ptlrpc]
      [<ffffffff8005dfa7>] child_rip+0x0/0x11

      Attachments

        Issue Links

          Activity

            [LU-500] MDS threads hang ldlm_expired_completion_wait+
            pjones Peter Jones added a comment -

            duplicate of LU-1269

            pjones Peter Jones added a comment - duplicate of LU-1269
            spitzcor Cory Spitz added a comment -

            James, that sounds right to me. This should now be closed as a dup of LU-1269.

            spitzcor Cory Spitz added a comment - James, that sounds right to me. This should now be closed as a dup of LU-1269 .

            LU-1269 has those patches ported to Lustre 1.8-wc branch. If those patches are the solution then this ticket can be marked as a duplicate of LU-1269. What do you say Cory?

            simmonsja James A Simmons added a comment - LU-1269 has those patches ported to Lustre 1.8-wc branch. If those patches are the solution then this ticket can be marked as a duplicate of LU-1269 . What do you say Cory?
            spitzcor Cory Spitz added a comment -

            > Which patches from bz 24450?
            The ones that Oracle has landed; namely attachments 33099, 33106, 33137, and 33144.

            spitzcor Cory Spitz added a comment - > Which patches from bz 24450? The ones that Oracle has landed; namely attachments 33099, 33106, 33137, and 33144.

            Which patches from bz 24450?

            simmonsja James A Simmons added a comment - Which patches from bz 24450?
            spitzcor Cory Spitz added a comment -

            on 2/Sep/11 I mentioned LU-146. Although that could have been a contributor, the serialization of ptlrpc sets of size PARALLEL_AST_LIMIT cause these threads to trigger watchdog. I think we need the patches posted to bz 24450 pulled to WC's b1_8 in order to close this ticket.

            spitzcor Cory Spitz added a comment - on 2/Sep/11 I mentioned LU-146 . Although that could have been a contributor, the serialization of ptlrpc sets of size PARALLEL_AST_LIMIT cause these threads to trigger watchdog. I think we need the patches posted to bz 24450 pulled to WC's b1_8 in order to close this ticket.

            Can this bug be closed now?

            simmonsja James A Simmons added a comment - Can this bug be closed now?

            People

              green Oleg Drokin
              woods Steven Woods
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: