Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1395

MDS hangs after calltrace at ldlm_expired_completion_wait()

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 1.8.6
    • None
    • 3
    • 10343

    Description

      We saw the following call traces on MDS and it hanged after it.

      Apr 23 15:58:34 ALPL505 kernel: Call Trace:
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88953a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88955542>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8893a709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8008e421>] default_wake_function+0x0/0xe
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88935b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889540bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88caa157>] enqueue_ordered_locks+0x387/0x4d0 [mds]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889519a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88955080>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88caa8e9>] mds_get_parent_child_locked+0x649/0x960 [mds]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88c9b652>] mds_getattr_lock+0x632/0xc90 [mds]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88c96dda>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88ca1d83>] mds_intent_policy+0x623/0xc20 [mds]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8893c270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88939eb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889367fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8895e870>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8895bb39>] ldlm_handle_enqueue+0xc09/0x1210 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88ca0b30>] mds_handle+0x40e0/0x4d10 [mds]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff800774ed>] smp_send_reschedule+0x4e/0x53
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8008ddcd>] enqueue_task+0x41/0x56
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8897fd55>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889896d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88989e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8008c85d>] __wake_up_common+0x3e/0x68
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8898adc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88989e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
      

      this might be related to LU-59, but please review on this.

      Attachments

        Issue Links

          Activity

            [LU-1395] MDS hangs after calltrace at ldlm_expired_completion_wait()

            Close old ticket.

            adilger Andreas Dilger added a comment - Close old ticket.

            This looks like a dupe of LU-500 and LU-1269. I think because LU-1269 is marked as an improvement instead of bug, it hasn't been getting the attention it should be. It appears as if there are several different ideas for fixing the issue. Can someone take a look at it? We have been hitting this bug regularly, most recently at IU.

            kitwestneat Kit Westneat (Inactive) added a comment - This looks like a dupe of LU-500 and LU-1269 . I think because LU-1269 is marked as an improvement instead of bug, it hasn't been getting the attention it should be. It appears as if there are several different ideas for fixing the issue. Can someone take a look at it? We have been hitting this bug regularly, most recently at IU.
            green Oleg Drokin added a comment -

            This trace is just a sign of client not responding to lock cancel request. It should be followed by a client being evicted.
            We need the client log to see what was happening there, I guess.

            green Oleg Drokin added a comment - This trace is just a sign of client not responding to lock cancel request. It should be followed by a client being evicted. We need the client log to see what was happening there, I guess.

            Hi Peter, Oleg,
            could you plesae view on this quickly? we saw similar problems at a couple of sites.

            ihara Shuichi Ihara (Inactive) added a comment - Hi Peter, Oleg, could you plesae view on this quickly? we saw similar problems at a couple of sites.
            pjones Peter Jones added a comment -

            Oleg will look into this one

            pjones Peter Jones added a comment - Oleg will look into this one

            People

              green Oleg Drokin
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: