Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-500

MDS threads hang ldlm_expired_completion_wait+

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • Lustre 1.8.6
    • None
    • CentOS 5.3
    • 3
    • 24,450
    • 6583

      At a key cutomer site we were and still are experiencing MDS thread hangs. Initially they were seen under 1.8.4 and when the MDS would dump the threads the only way to recover would be to reboot the MDS. The site did upgrade to 1.8.6 which includes a at_min patch from bug 23352 which was thought that might help the issue. However they are still seeing the thread hangs but can usually now get out of it without a MDS reboot but still a serious problem.
      The trace looks like:

      Call Trace:
      [<ffffffff888e8c10>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      [<ffffffff888ea762>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc]
      [<ffffffff888cf709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
      [<ffffffff8008d07b>] default_wake_function+0x0/0xe
      [<ffffffff888cab6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
      [<ffffffff888e92cb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
      [<ffffffff88becd7a>] enqueue_ordered_locks+0x26a/0x4d0 [mds]
      [<ffffffff888e6bc0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
      [<ffffffff888ea2a0>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
      [<ffffffff88bed5c1>] mds_get_parent_child_locked+0x5e1/0x8a0 [mds]
      [<ffffffff88c0f394>] mds_open+0xc44/0x35f8 [mds]
      [<ffffffff8899c2b6>] kiblnd_post_tx_locked+0x566/0x730 [ko2iblnd]
      [<ffffffff888e6d5e>] ldlm_blocking_ast+0x19e/0x2a0 [ptlrpc]
      [<ffffffff887dcb38>] upcall_cache_get_entry+0x958/0xa50 [lvfs]
      [<ffffffff888eb9b8>] ldlm_handle_bl_callback+0x1c8/0x230 [ptlrpc]
      [<ffffffff88be7f49>] mds_reint_rec+0x1d9/0x2b0 [mds]
      [<ffffffff88c13c32>] mds_open_unpack+0x312/0x430 [mds]
      [<ffffffff88bdae7a>] mds_reint+0x35a/0x420 [mds]
      [<ffffffff88bd9d8a>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
      [<ffffffff88be4bfc>] mds_intent_policy+0x4ac/0xc80 [mds]
      [<ffffffff888d18b6>] ldlm_resource_putref+0x1b6/0x3c0 [ptlrpc]
      [<ffffffff888ceeb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
      [<ffffffff888cb7fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
      [<ffffffff888f3720>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
      [<ffffffff888f0849>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
      [<ffffffff88be3b20>] mds_handle+0x4130/0x4d60 [mds]
      [<ffffffff887ffbe5>] lnet_match_blocked_msg+0x375/0x390 [lnet]
      [<ffffffff88914705>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
      [<ffffffff8006e244>] do_gettimeoffset_tsc+0x19/0x3c
      [<ffffffff8891bc37>] ptlrpc_server_handle_request+0xaa7/0x1150 [ptlrpc]
      [<ffffffff8008ca80>] __activate_task+0x56/0x6d
      [<ffffffff8008c865>] dequeue_task+0x18/0x37
      [<ffffffff80062ff8>] thread_return+0x62/0xfe
      [<ffffffff8003da91>] lock_timer_base+0x1b/0x3c
      [<ffffffff8001cb46>] __mod_timer+0x100/0x10f
      [<ffffffff8891f698>] ptlrpc_main+0x1258/0x1420 [ptlrpc]
      [<ffffffff8008d07b>] default_wake_function+0x0/0xe
      [<ffffffff800b7a9c>] audit_syscall_exit+0x336/0x362
      [<ffffffff8005dfb1>] child_rip+0xa/0x11
      [<ffffffff8891e440>] ptlrpc_main+0x0/0x1420 [ptlrpc]
      [<ffffffff8005dfa7>] child_rip+0x0/0x11

            green Oleg Drokin
            woods Steven Woods
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: