Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-500

MDS threads hang ldlm_expired_completion_wait+

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 1.8.6
    • None
    • CentOS 5.3
    • 3
    • 24,450
    • 6583

    Description

      At a key cutomer site we were and still are experiencing MDS thread hangs. Initially they were seen under 1.8.4 and when the MDS would dump the threads the only way to recover would be to reboot the MDS. The site did upgrade to 1.8.6 which includes a at_min patch from bug 23352 which was thought that might help the issue. However they are still seeing the thread hangs but can usually now get out of it without a MDS reboot but still a serious problem.
      The trace looks like:

      Call Trace:
      [<ffffffff888e8c10>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      [<ffffffff888ea762>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc]
      [<ffffffff888cf709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
      [<ffffffff8008d07b>] default_wake_function+0x0/0xe
      [<ffffffff888cab6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
      [<ffffffff888e92cb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
      [<ffffffff88becd7a>] enqueue_ordered_locks+0x26a/0x4d0 [mds]
      [<ffffffff888e6bc0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
      [<ffffffff888ea2a0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
      [<ffffffff88bed5c1>] mds_get_parent_child_locked+0x5e1/0x8a0 [mds]
      [<ffffffff88c0f394>] mds_open+0xc44/0x35f8 [mds]
      [<ffffffff8899c2b6>] kiblnd_post_tx_locked+0x566/0x730 [ko2iblnd]
      [<ffffffff888e6d5e>] ldlm_blocking_ast+0x19e/0x2a0 [ptlrpc]
      [<ffffffff887dcb38>] upcall_cache_get_entry+0x958/0xa50 [lvfs]
      [<ffffffff888eb9b8>] ldlm_handle_bl_callback+0x1c8/0x230 [ptlrpc]
      [<ffffffff88be7f49>] mds_reint_rec+0x1d9/0x2b0 [mds]
      [<ffffffff88c13c32>] mds_open_unpack+0x312/0x430 [mds]
      [<ffffffff88bdae7a>] mds_reint+0x35a/0x420 [mds]
      [<ffffffff88bd9d8a>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
      [<ffffffff88be4bfc>] mds_intent_policy+0x4ac/0xc80 [mds]
      [<ffffffff888d18b6>] ldlm_resource_putref+0x1b6/0x3c0 [ptlrpc]
      [<ffffffff888ceeb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
      [<ffffffff888cb7fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
      [<ffffffff888f3720>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
      [<ffffffff888f0849>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
      [<ffffffff88be3b20>] mds_handle+0x4130/0x4d60 [mds]
      [<ffffffff887ffbe5>] lnet_match_blocked_msg+0x375/0x390 [lnet]
      [<ffffffff88914705>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
      [<ffffffff8006e244>] do_gettimeoffset_tsc+0x19/0x3c
      [<ffffffff8891bc37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc]
      [<ffffffff8008ca80>] __activate_task+0x56/0x6d
      [<ffffffff8008c865>] dequeue_task+0x18/0x37
      [<ffffffff80062ff8>] thread_return+0x62/0xfe
      [<ffffffff8003da91>] lock_timer_base+0x1b/0x3c
      [<ffffffff8001cb46>] __mod_timer+0x100/0x10f
      [<ffffffff8891f698>] ptlrpc_main+0x1258/0x1420 [ptlrpc]
      [<ffffffff8008d07b>] default_wake_function+0x0/0xe
      [<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362
      [<ffffffff8005dfb1>] child_rip+0xa/0x11
      [<ffffffff8891e440>] ptlrpc_main+0x0/0x1420 [ptlrpc]
      [<ffffffff8005dfa7>] child_rip+0x0/0x11

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              woods Steven Woods
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: