Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • Lustre 2.7.0
    • Lustre 2.5.3
    • None
    • 3
    • 9223372036854775807

    Description

      On MDT LNet service thread hung Dumped stack trace

      an 20 13:42:10 nbp8-mds1 kernel: LNet: Service thread pid 17862 was inactive for 424.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jan 20 13:42:10 nbp8-mds1 kernel: Pid: 17862, comm: mdt00_057
      Jan 20 13:42:14 nbp8-mds1 kernel: 
      Jan 20 13:42:14 nbp8-mds1 kernel: Call Trace:
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7cde4>] mdt_object_lock0+0x394/0xb30 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7d801>] mdt_object_find_lock+0x61/0x170 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0ea982c>] mdt_reint_open+0x88c/0x21a0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa04fb5d6>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0652ef0>] ? lu_ucred+0x20/0x30 [obdclass]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e71935>] ? mdt_ucred+0x15/0x20 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e8e51c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e92481>] mdt_reint_rec+0x41/0xe0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e77ed3>] mdt_reint_internal+0x4c3/0x780 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7845e>] mdt_intent_reint+0x1ee/0x410 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa04f08d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07c0a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      
      an 20 13:42:18 nbp8-mds1 kernel: Pid: 17780, comm: mdt00_044
      Jan 20 13:42:18 nbp8-mds1 kernel: 
      Jan 20 13:42:18 nbp8-mds1 kernel: Call Trace:
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e7cc06>] mdt_object_lock0+0x1b6/0xb30 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e85b8e>] mdt_getattr_name_lock+0x8fe/0x19d0 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07ba7b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e86ef9>] mdt_intent_getattr+0x299/0x480 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff81061ff2>] ? default_wake_function+0x12/0x20
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Jan 20 13:42:19 nbp8-mds1 kernel: 
      Jan 20 13:42:19 nbp8-mds1 kernel: Pid: 16840, comm: mdt03_017
      Jan 20 13:42:19 nbp8-mds1 kernel: 
      Jan 20 13:42:19 nbp8-mds1 kernel: Call Trace:
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffffa076a885>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e7cc06>] mdt_object_lock0+0x1b6/0xb30 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e85b8e>] mdt_getattr_name_lock+0x8fe/0x19d0 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07ba7b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e86ef9>] mdt_intent_getattr+0x299/0x480 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa04f08d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07c0a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Looks the same as LU-7232

      I can upload debug logs if needed.

      Attachments

        1. lustre-log.1453326130.17862.gz
          0.3 kB
        2. mds.dmesg
          237 kB
        3. mds.processtrace
          2.01 MB

        Issue Links

          Activity

            [LU-7692] LNet: Service thread Hung

            I checked the difference of patch set 3 and patch set 4 of
            http://review.whamcloud.com/#/c/17853/
            Well, no difference. They are the same.

            Assuming
            http://review.whamcloud.com/17976 is a b2_5_fe port of 17853 patch set 3, and we already have that patch in our mds/mgs, then the patch did not address our problem.

            jaylan Jay Lan (Inactive) added a comment - I checked the difference of patch set 3 and patch set 4 of http://review.whamcloud.com/#/c/17853/ Well, no difference. They are the same. Assuming http://review.whamcloud.com/17976 is a b2_5_fe port of 17853 patch set 3, and we already have that patch in our mds/mgs, then the patch did not address our problem.
            jaylan Jay Lan (Inactive) added a comment - - edited

            I just posted a comment to LU-7372:

            "
            Zhenyu Xu ported an earlier version (patch set 3, IIRC) of #17853 for b2_5_fe at
            http://review.whamcloud.com/17976 (see comment at 12/Jan/16 9:26 PM)

            Could someone from Intel update 17976 to the latest patch set #4 for us? Thanks!
            "

            jaylan Jay Lan (Inactive) added a comment - - edited I just posted a comment to LU-7372 : " Zhenyu Xu ported an earlier version (patch set 3, IIRC) of #17853 for b2_5_fe at http://review.whamcloud.com/17976 (see comment at 12/Jan/16 9:26 PM) Could someone from Intel update 17976 to the latest patch set #4 for us? Thanks! "

            I always wonder how I can tell which patchset of a review was cherry-picked into our tree. Note that we cherry-picked some patch not in b2_5_fe repo, so the commit id is different than the one that I cherry-picked from. I have been using date that I did cherry-pick and examining codes to judge. Is there a good way to tell?

            As to http://review.whamcloud.com/#/c/17853/, I picked patch set #2 of 17853 (see LU-7640) but MDS still hung. I noticed that the lastest patch set is #4. I will cherry-pick #4 when it passes your internal testing.

            jaylan Jay Lan (Inactive) added a comment - I always wonder how I can tell which patchset of a review was cherry-picked into our tree. Note that we cherry-picked some patch not in b2_5_fe repo, so the commit id is different than the one that I cherry-picked from. I have been using date that I did cherry-pick and examining codes to judge. Is there a good way to tell? As to http://review.whamcloud.com/#/c/17853/ , I picked patch set #2 of 17853 (see LU-7640 ) but MDS still hung. I noticed that the lastest patch set is #4. I will cherry-pick #4 when it passes your internal testing.
            bobijam Zhenyu Xu added a comment -

            I think the patch http://review.whamcloud.com/#/c/17853/ addresses the MDT thread lockup issue, where the thread won't come out of ldlm_completion_ast() waiting.

            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa076a885>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
            ...
            

            ldlm_expired_completion_wait() should return -ETIMEDOUT instead of 0, otherwise the ldlm_completion_ast() won't finish and keeps waiting.

            bobijam Zhenyu Xu added a comment - I think the patch http://review.whamcloud.com/#/c/17853/ addresses the MDT thread lockup issue, where the thread won't come out of ldlm_completion_ast() waiting. Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa076a885>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc] Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0 Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10 Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc] Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc] ... ldlm_expired_completion_wait() should return -ETIMEDOUT instead of 0, otherwise the ldlm_completion_ast() won't finish and keeps waiting.

            This issues is triggered when our remote cluster loses connectivity to the servers. We are trying to find the root cause of the network errors. But the MDT threads shouldn't lockup.

            mhanafi Mahmoud Hanafi added a comment - This issues is triggered when our remote cluster loses connectivity to the servers. We are trying to find the root cause of the network errors. But the MDT threads shouldn't lockup.
            green Oleg Drokin added a comment -

            The relationship between the message and the eviction is direct more or less.

            What happens here is:
            1. client gets a lock, starts to do IO under the lock
            2. Somebody else wants the lock, so asks the above client to finish the io and drop the lock
            3. The IO encounters a timeout, client is evicted
            4. Lock is cancelled
            5. if some thread spent long time waiting for the lock - you'll get hte message.

            Now in your case the eviciton is on one server (OST), but the message is on another (MDT). They should not really influence each other as such.
            But if there's some message dropping going on, then if the blocking AST is lost in transint, you'll se exactly this:
            "lock callback timer expired after" on the server that then evicts the offending client and then the thread that was waiting for the lock is able to get it, do whatever processing it needed and then complain that processing took too long once it was done - all thanks to the long waiting to get the lock.

            green Oleg Drokin added a comment - The relationship between the message and the eviction is direct more or less. What happens here is: 1. client gets a lock, starts to do IO under the lock 2. Somebody else wants the lock, so asks the above client to finish the io and drop the lock 3. The IO encounters a timeout, client is evicted 4. Lock is cancelled 5. if some thread spent long time waiting for the lock - you'll get hte message. Now in your case the eviciton is on one server (OST), but the message is on another (MDT). They should not really influence each other as such. But if there's some message dropping going on, then if the blocking AST is lost in transint, you'll se exactly this: "lock callback timer expired after" on the server that then evicts the offending client and then the thread that was waiting for the lock is able to get it, do whatever processing it needed and then complain that processing took too long once it was done - all thanks to the long waiting to get the lock.

            Some info from the user. Each task reads a restart file does some processing and writes the results. All the 5832 restart files are in the same directory.

            mhanafi Mahmoud Hanafi added a comment - Some info from the user. Each task reads a restart file does some processing and writes the results. All the 5832 restart files are in the same directory.

            I am checking with a suspected user now. His job is 484 node and 5832 Tasks. The nodes are part of a remote cluster connected via routers.

            Can a client eviction cause this? Here we see this client getting evicted.

            Jan 21 13:16:19 nbp8-mds1 kernel: LustreError: 0:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 251s: evicting client at 10.153.10.226@o2ib233  ns: mdt-nbp8-MDT0000_UUID lock: ffff883cfce50dc0/0x8ea5cf5f62ad392a lrc: 3/0,0 mode: PR/PR res: [0x3603b8f6b:0xb4b:0x0].0 bits 0x13 rrc: 584 type: IBT flags: 0x60200400000020 nid: 10.153.10.226@o2ib233 remote: 0x4103cc31cf8b204c expref: 131 pid: 16613 timeout: 4333267923 lvb_type: 0
            Jan 21 13:17:34 nbp8-mds1 kernel: LNet: Service thread pid 16447 was inactive for 300.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Jan 21 13:17:34 nbp8-mds1 kernel: Pid: 16447, comm: mdt01_085
            Jan 21 13:17:38 nbp8-mds1 kernel: 
            Jan 21 13:17:38 nbp8-mds1 kernel: Call Trace:
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa076a885>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
            Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e73de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7dde4>] mdt_object_lock0+0x394/0xb30 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e73de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7e644>] mdt_object_lock+0x14/0x20 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7e801>] mdt_object_find_lock+0x61/0x170 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0eaa88c>] mdt_reint_open+0x88c/0x21a0 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa04fb5d6>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0652ef0>] ? lu_ucred+0x20/0x30 [obdclass]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e72935>] ? mdt_ucred+0x15/0x20 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e8f51c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e93481>] mdt_reint_rec+0x41/0xe0 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e78ed3>] mdt_reint_internal+0x4c3/0x780 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7945e>] mdt_intent_reint+0x1ee/0x410 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e76c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e77106>] mdt_enqueue+0x46/0xe0 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7bada>] mdt_handle_common+0x52a/0x1470 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0eb8505>] mds_regular_handle+0x15/0x20 [mdt]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa04f08d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07c0a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
            Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
            

            Here is the client console log

            [1453410882.304755] LNetError: 7223:0:(o2iblnd_cb.c:3018:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds
            [1453410882.316755] LNetError: 7223:0:(o2iblnd_cb.c:3081:kiblnd_check_conns()) Timed out RDMA with 10.153.26.93@o2ib233 (13): c: 61, oc: 0, rc: 63
            [1453410882.328756] LustreError: 7224:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807b8798000
            [1453410882.340756] Lustre: 7260:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1453410728/real 1453410728]  req@ffff880c045b5000 x1523566274464988/t0(0) o4->nbp8-OST00eb-osc-ffff88060b4b8800@10.151.27.63@o2ib:6/4 lens 488/448 e 0 to 1 dl 1453411070 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [1453410882.340756] Lustre: 7260:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 82 previous similar messages
            [1453410882.340756] Lustre: nbp8-OST00eb-osc-ffff88060b4b8800: Connection to nbp8-OST00eb (at 10.151.27.63@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            [1453410882.340756] Lustre: Skipped 133 previous similar messages
            [1453410882.404758] Lustre: nbp8-OST00eb-osc-ffff88060b4b8800: Connection restored to nbp8-OST00eb (at 10.151.27.63@o2ib)
            [1453410882.404758] Lustre: Skipped 56 previous similar messages
            [1453410919.469932] LustreError: 7232:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880bce378000
            [1453411070.294708] Lustre: 24291:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453410728/real 1453410728]  req@ffff88061692bc00 x1523566274464784/t0(0) o101->nbp8-MDT0000-mdc-ffff88060b4b8800@10.151.27.60@o2ib:12/10 lens 624/4944 e 0 to 1 dl 1453411070 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [1453411070.322709] Lustre: 24291:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            [1453411070.334710] Lustre: nbp8-MDT0000-mdc-ffff88060b4b8800: Connection to nbp8-MDT0000 (at 10.151.27.60@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            [1453411071.346742] Lustre: nbp9-OST00e3-osc-ffff880607dea400: Connection restored to nbp9-OST00e3 (at 10.151.26.9@o2ib)
            [1453411096.347534] Lustre: nbp8-OST00be-osc-ffff88060b4b8800: Connection to nbp8-OST00be (at 10.151.27.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            [1453411096.363534] Lustre: Skipped 169 previous similar messages
            [1453411121.368326] Lustre: nbp2-OST001f-osc-ffff880607d01c00: Connection restored to nbp2-OST001f (at 10.151.26.108@o2ib)
            [1453411121.368326] Lustre: Skipped 178 previous similar messages
            [1453411146.369118] Lustre: 7247:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453410804/real 1453410804]  req@ffff880a414bb400 x1523566274482484/t0(0) o400->nbp7-OST0014-osc-ffff880607e0e000@10.151.27.45@o2ib:28/4 lens 224/224 e 0 to 1 dl 1453411146 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [1453411146.369118] Lustre: nbp7-OST0023-osc-ffff880607e0e000: Connection to nbp7-OST0023 (at 10.151.27.51@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            [1453411146.369118] Lustre: Skipped 34 previous similar messages
            
            mhanafi Mahmoud Hanafi added a comment - I am checking with a suspected user now. His job is 484 node and 5832 Tasks. The nodes are part of a remote cluster connected via routers. Can a client eviction cause this? Here we see this client getting evicted. Jan 21 13:16:19 nbp8-mds1 kernel: LustreError: 0:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 251s: evicting client at 10.153.10.226@o2ib233 ns: mdt-nbp8-MDT0000_UUID lock: ffff883cfce50dc0/0x8ea5cf5f62ad392a lrc: 3/0,0 mode: PR/PR res: [0x3603b8f6b:0xb4b:0x0].0 bits 0x13 rrc: 584 type: IBT flags: 0x60200400000020 nid: 10.153.10.226@o2ib233 remote: 0x4103cc31cf8b204c expref: 131 pid: 16613 timeout: 4333267923 lvb_type: 0 Jan 21 13:17:34 nbp8-mds1 kernel: LNet: Service thread pid 16447 was inactive for 300.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 21 13:17:34 nbp8-mds1 kernel: Pid: 16447, comm: mdt01_085 Jan 21 13:17:38 nbp8-mds1 kernel: Jan 21 13:17:38 nbp8-mds1 kernel: Call Trace: Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa076a885>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc] Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0 Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10 Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc] Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc] Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20 Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc] Jan 21 13:17:38 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e73de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7dde4>] mdt_object_lock0+0x394/0xb30 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e73de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7e644>] mdt_object_lock+0x14/0x20 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7e801>] mdt_object_find_lock+0x61/0x170 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0eaa88c>] mdt_reint_open+0x88c/0x21a0 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa04fb5d6>] ? upcall_cache_get_entry+0x296/0x880 [libcfs] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0652ef0>] ? lu_ucred+0x20/0x30 [obdclass] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e72935>] ? mdt_ucred+0x15/0x20 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e8f51c>] ? mdt_root_squash+0x2c/0x3f0 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e93481>] mdt_reint_rec+0x41/0xe0 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e78ed3>] mdt_reint_internal+0x4c3/0x780 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7945e>] mdt_intent_reint+0x1ee/0x410 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e76c3e>] mdt_intent_policy+0x3ae/0x770 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e77106>] mdt_enqueue+0x46/0xe0 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0e7bada>] mdt_handle_common+0x52a/0x1470 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa0eb8505>] mds_regular_handle+0x15/0x20 [mdt] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa04f08d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07c0a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc] Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 Jan 21 13:17:41 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] Here is the client console log [1453410882.304755] LNetError: 7223:0:(o2iblnd_cb.c:3018:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds [1453410882.316755] LNetError: 7223:0:(o2iblnd_cb.c:3081:kiblnd_check_conns()) Timed out RDMA with 10.153.26.93@o2ib233 (13): c: 61, oc: 0, rc: 63 [1453410882.328756] LustreError: 7224:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff8807b8798000 [1453410882.340756] Lustre: 7260:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1453410728/real 1453410728] req@ffff880c045b5000 x1523566274464988/t0(0) o4->nbp8-OST00eb-osc-ffff88060b4b8800@10.151.27.63@o2ib:6/4 lens 488/448 e 0 to 1 dl 1453411070 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [1453410882.340756] Lustre: 7260:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 82 previous similar messages [1453410882.340756] Lustre: nbp8-OST00eb-osc-ffff88060b4b8800: Connection to nbp8-OST00eb (at 10.151.27.63@o2ib) was lost; in progress operations using this service will wait for recovery to complete [1453410882.340756] Lustre: Skipped 133 previous similar messages [1453410882.404758] Lustre: nbp8-OST00eb-osc-ffff88060b4b8800: Connection restored to nbp8-OST00eb (at 10.151.27.63@o2ib) [1453410882.404758] Lustre: Skipped 56 previous similar messages [1453410919.469932] LustreError: 7232:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc ffff880bce378000 [1453411070.294708] Lustre: 24291:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453410728/real 1453410728] req@ffff88061692bc00 x1523566274464784/t0(0) o101->nbp8-MDT0000-mdc-ffff88060b4b8800@10.151.27.60@o2ib:12/10 lens 624/4944 e 0 to 1 dl 1453411070 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [1453411070.322709] Lustre: 24291:0:(client.c:1940:ptlrpc_expire_one_request()) Skipped 1 previous similar message [1453411070.334710] Lustre: nbp8-MDT0000-mdc-ffff88060b4b8800: Connection to nbp8-MDT0000 (at 10.151.27.60@o2ib) was lost; in progress operations using this service will wait for recovery to complete [1453411071.346742] Lustre: nbp9-OST00e3-osc-ffff880607dea400: Connection restored to nbp9-OST00e3 (at 10.151.26.9@o2ib) [1453411096.347534] Lustre: nbp8-OST00be-osc-ffff88060b4b8800: Connection to nbp8-OST00be (at 10.151.27.70@o2ib) was lost; in progress operations using this service will wait for recovery to complete [1453411096.363534] Lustre: Skipped 169 previous similar messages [1453411121.368326] Lustre: nbp2-OST001f-osc-ffff880607d01c00: Connection restored to nbp2-OST001f (at 10.151.26.108@o2ib) [1453411121.368326] Lustre: Skipped 178 previous similar messages [1453411146.369118] Lustre: 7247:0:(client.c:1940:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1453410804/real 1453410804] req@ffff880a414bb400 x1523566274482484/t0(0) o400->nbp7-OST0014-osc-ffff880607e0e000@10.151.27.45@o2ib:28/4 lens 224/224 e 0 to 1 dl 1453411146 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [1453411146.369118] Lustre: nbp7-OST0023-osc-ffff880607e0e000: Connection to nbp7-OST0023 (at 10.151.27.51@o2ib) was lost; in progress operations using this service will wait for recovery to complete [1453411146.369118] Lustre: Skipped 34 previous similar messages
            pjones Peter Jones added a comment -

            Oleg

            NASA shared that there is one specific user who can trigger this issue with their job runs and they will supply details when they are able to

            Bobijam

            Could you please review the uploaded debug logs and see if it is possible to deduce anything about the situation?

            Thanks

            Peter

            pjones Peter Jones added a comment - Oleg NASA shared that there is one specific user who can trigger this issue with their job runs and they will supply details when they are able to Bobijam Could you please review the uploaded debug logs and see if it is possible to deduce anything about the situation? Thanks Peter
            green Oleg Drokin added a comment -

            is this a particular job of this user that can trigger it? how large is the scale of that job?

            green Oleg Drokin added a comment - is this a particular job of this user that can trigger it? how large is the scale of that job?

            Uploading debug dump to see if can help.

            mhanafi Mahmoud Hanafi added a comment - Uploading debug dump to see if can help.

            People

              bobijam Zhenyu Xu
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: