Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6015

mds evicting clients - lock timed out?

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.4.1
    • None
    • EL6
    • 3
    • 16763

    Description

      We have seen this issue twice in the last week as the load on one of our Lustre filesystems has increased. Long story short - we have 4 Lustre clients exporting the filesystem over NFS (round-robin). The MDS seems to get snagged on a lock and then the clients take it in turn to drop off which sends IO errors to the NFS clients.

      Dec 10 16:00:02 cmds1 kernel: LNet: Service thread pid 22181 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Dec 10 16:00:02 cmds1 kernel: Pid: 22181, comm: mdt00_026
      Dec 10 16:00:02 cmds1 kernel: 
      Dec 10 16:00:02 cmds1 kernel: Call Trace:
      Dec 10 16:00:02 cmds1 kernel: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0
      Dec 10 16:00:02 cmds1 kernel: [<ffffffff810811e0>] ? process_timeout+0x0/0x10
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa03fe6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
      Dec 10 16:00:02 cmds1 kernel: Lustre: lock timed out (enqueued at 1418227002, 200s ago)
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a601d>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a1790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a5758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
      Dec 10 16:00:02 cmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1418227202.22181
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a5b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d93a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d99c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d93a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06a5b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d9a4f4>] mdt_object_lock+0x14/0x20 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0dc24f8>] mdt_object_open_lock+0x1c8/0x510 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0da6bfa>] ? mdt_attr_get_complex+0x38a/0x770 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0dc76b3>] mdt_open_by_fid_lock+0x443/0x7d0 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0dc829b>] mdt_reint_open+0x56b/0x20c0 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa041a82e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06cedcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0565f50>] ? lu_ucred+0x20/0x30 [obdclass]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0d93015>] ? mdt_ucred+0x15/0x20 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa0daf15c>] ? mdt_root_squash+0x2c/0x410 [mdt]
      Dec 10 16:00:02 cmds1 kernel: [<ffffffffa06f6646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
      ..
      ..
      Dec 10 16:07:58 cmds1 kernel: 
      Dec 10 16:07:58 cmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1418227678.7488
      Dec 10 16:11:13 cmds1 kernel: Lustre: 22178:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-271), not sending early reply
      Dec 10 16:11:13 cmds1 kernel:  req@ffff882fd6dfd000 x1486501717804232/t0(0) o101->4540220b-aff6-ec4a-3862-59a2488b6775@10.21.22.27@tcp:0/0 lens 576/3448 e 5 to 0 dl 1418227878 ref 2 fl Interpret:/0/0 rc 0/0
      Dec 10 16:11:44 cmds1 kernel: Lustre: 9532:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-271), not sending early reply
      Dec 10 16:11:44 cmds1 kernel:  req@ffff88427dd19400 x1486495779658676/t0(0) o101->dce9c151-cc13-2ac2-7911-8ded672c09d6@10.21.22.29@tcp:0/0 lens 608/3448 e 5 to 0 dl 1418227909 ref 2 fl Interpret:/0/0 rc 0/0
      Dec 10 16:12:02 cmds1 kernel: Lustre: 7487:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-260), not sending early reply
      Dec 10 16:12:02 cmds1 kernel:  req@ffff88026471d800 x1486501720068016/t0(0) o101->4540220b-aff6-ec4a-3862-59a2488b6775@10.21.22.27@tcp:0/0 lens 576/3448 e 3 to 0 dl 1418227927 ref 2 fl Interpret:/0/0 rc 0/0
      Dec 10 16:12:34 cmds1 kernel: Lustre: 6826:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-134), not sending early reply
      Dec 10 16:12:34 cmds1 kernel:  req@ffff8822e2c0d000 x1486501725457272/t0(0) o101->4540220b-aff6-ec4a-3862-59a2488b6775@10.21.22.27@tcp:0/0 lens 608/3448 e 1 to 0 dl 1418227959 ref 2 fl Interpret:/0/0 rc 0/0
      Dec 10 16:13:17 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting
      Dec 10 16:13:17 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 6 active RPCs
      Dec 10 16:13:42 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting
      Dec 10 16:13:42 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 3 active RPCs
      Dec 10 16:13:48 cmds1 kernel: Lustre: charlie-MDT0000: Client dce9c151-cc13-2ac2-7911-8ded672c09d6 (at 10.21.22.29@tcp) reconnecting
      Dec 10 16:13:48 cmds1 kernel: Lustre: charlie-MDT0000: Client dce9c151-cc13-2ac2-7911-8ded672c09d6 (at 10.21.22.29@tcp) refused reconnection, still busy with 2 active RPCs
      Dec 10 16:13:53 cmds1 kernel: Lustre: 22166:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-431), not sending early reply
      Dec 10 16:13:53 cmds1 kernel:  req@ffff882fe3e6a000 x1486492815968152/t0(0) o101->b9fd8649-3e4d-fb11-ba42-c2526b8f651f@10.21.22.28@tcp:0/0 lens 576/1152 e 4 to 0 dl 1418228038 ref 2 fl Interpret:/0/0 rc 0/0
      Dec 10 16:14:07 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting
      Dec 10 16:14:07 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 3 active RPCs
      Dec 10 16:14:13 cmds1 kernel: Lustre: charlie-MDT0000: Client dce9c151-cc13-2ac2-7911-8ded672c09d6 (at 10.21.22.29@tcp) reconnecting
      Dec 10 16:14:13 cmds1 kernel: Lustre: charlie-MDT0000: Client dce9c151-cc13-2ac2-7911-8ded672c09d6 (at 10.21.22.29@tcp) refused reconnection, still busy with 2 active RPCs
      Dec 10 16:14:32 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting
      Dec 10 16:14:32 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 3 active RPCs
      Dec 10 16:14:57 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) reconnecting
      Dec 10 16:14:57 cmds1 kernel: Lustre: Skipped 1 previous similar message
      Dec 10 16:14:57 cmds1 kernel: Lustre: charlie-MDT0000: Client 4540220b-aff6-ec4a-3862-59a2488b6775 (at 10.21.22.27@tcp) refused reconnection, still busy with 3 active RPCs
      Dec 10 16:14:57 cmds1 kernel: Lustre: Skipped 1 previous similar message
      

      Both times, the only way to get back to a stable state was to reboot the mds and abort recovery. Logs attached.

      Attachments

        Issue Links

          Activity

            [LU-6015] mds evicting clients - lock timed out?
            pjones Peter Jones added a comment -

            Good news - thanks Daire!

            pjones Peter Jones added a comment - Good news - thanks Daire!

            Just to update - we have applied the patch and have not seen any further instances of this issue in 3 weeks now. You can close this if you wish and we'll re-open should we hit it again.

            Cheers.

            daire Daire Byrne (Inactive) added a comment - Just to update - we have applied the patch and have not seen any further instances of this issue in 3 weeks now. You can close this if you wish and we'll re-open should we hit it again. Cheers.
            laisiyao Lai Siyao added a comment -

            yes, it's needed by MDS only.

            laisiyao Lai Siyao added a comment - yes, it's needed by MDS only.

            Okay, great - we'll organise some downtime to upgrade the code. Just to be clear - this only needs to be applied to the MDS?

            daire Daire Byrne (Inactive) added a comment - Okay, great - we'll organise some downtime to upgrade the code. Just to be clear - this only needs to be applied to the MDS?
            laisiyao Lai Siyao added a comment -

            This looks to be duplicate of LU-4152, and the patch for LU-4152 http://review.whamcloud.com/#/c/8083/ can be applied to 2.4.1 code directly, will you apply it and test again?

            laisiyao Lai Siyao added a comment - This looks to be duplicate of LU-4152 , and the patch for LU-4152 http://review.whamcloud.com/#/c/8083/ can be applied to 2.4.1 code directly, will you apply it and test again?

            From the full trace looks like that's a deadlock, should be dup of LU-4152:

            Dec 12 18:55:07 cmds1 kernel: mdt05_032     S 000000000000000b     0 15079      2 0x00000000
            Dec 12 18:55:07 cmds1 kernel: ffff88500a907770 0000000000000046 0000000000000000 ffff88500a9077c0
            Dec 12 18:55:07 cmds1 kernel: 0000000000000001 ffff885d74c78040 ffff88500a907740 ffffffffa03b92d1
            Dec 12 18:55:07 cmds1 kernel: ffff885d74c785f8 ffff88500a907fd8 000000000000fb88 ffff885d74c785f8
            Dec 12 18:55:07 cmds1 kernel: Call Trace:
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03b92d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03a96fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06810aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa067c790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8ac7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8b4f4>] mdt_object_lock+0x14/0x20 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0db34f8>] mdt_object_open_lock+0x1c8/0x510 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d97bfa>] ? mdt_attr_get_complex+0x38a/0x770 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0db86b3>] mdt_open_by_fid_lock+0x443/0x7d0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0db929b>] mdt_reint_open+0x56b/0x20c0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03c582e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06a9dcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0540f50>] ? lu_ucred+0x20/0x30 [obdclass]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84015>] ? mdt_ucred+0x15/0x20 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0da015c>] ? mdt_root_squash+0x2c/0x410 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06d1646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0540f50>] ? lu_ucred+0x20/0x30 [obdclass]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0da4911>] mdt_reint_rec+0x41/0xe0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d89ae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8a06d>] mdt_intent_reint+0x1ed/0x520 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d87f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0661831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06881ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d883a6>] mdt_enqueue+0x46/0xe0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8ea97>] mdt_handle_common+0x647/0x16d0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06aabac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0dc83f5>] mds_regular_handle+0x15/0x20 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06ba3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
            
            Dec 12 18:55:07 cmds1 kernel: mdt03_012     S 000000000000002c     0  7511      2 0x00000000
            Dec 12 18:55:07 cmds1 kernel: ffff882f6da2f8f0 0000000000000046 00000000548b3476 ffff882f6da2f940
            Dec 12 18:55:07 cmds1 kernel: 0000000000000001 ffff882f8e4f8aa0 ffff882f6da2f8c0 ffffffffa03b92d1
            Dec 12 18:55:07 cmds1 kernel: ffff882f8e4f9058 ffff882f6da2ffd8 000000000000fb88 ffff882f8e4f9058
            Dec 12 18:55:07 cmds1 kernel: Call Trace:
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03b92d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03a96fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06810aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa067c790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8ac7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8b4f4>] mdt_object_lock+0x14/0x20 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d9a5a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06a9135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06d1646>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06ab3c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d9b3ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d87f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0661831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06881ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d883a6>] mdt_enqueue+0x46/0xe0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8ea97>] mdt_handle_common+0x647/0x16d0 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06aabac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0dc83f5>] mds_regular_handle+0x15/0x20 [mdt]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06ba3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03a95de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03bad9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06b1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06bb75e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06bac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06bac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06bac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            Dec 12 18:55:07 cmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            
            niu Niu Yawei (Inactive) added a comment - From the full trace looks like that's a deadlock, should be dup of LU-4152 : Dec 12 18:55:07 cmds1 kernel: mdt05_032 S 000000000000000b 0 15079 2 0x00000000 Dec 12 18:55:07 cmds1 kernel: ffff88500a907770 0000000000000046 0000000000000000 ffff88500a9077c0 Dec 12 18:55:07 cmds1 kernel: 0000000000000001 ffff885d74c78040 ffff88500a907740 ffffffffa03b92d1 Dec 12 18:55:07 cmds1 kernel: ffff885d74c785f8 ffff88500a907fd8 000000000000fb88 ffff885d74c785f8 Dec 12 18:55:07 cmds1 kernel: Call Trace: Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03b92d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03a96fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06810aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa067c790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8ac7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8b4f4>] mdt_object_lock+0x14/0x20 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0db34f8>] mdt_object_open_lock+0x1c8/0x510 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d97bfa>] ? mdt_attr_get_complex+0x38a/0x770 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0db86b3>] mdt_open_by_fid_lock+0x443/0x7d0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0db929b>] mdt_reint_open+0x56b/0x20c0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03c582e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06a9dcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0540f50>] ? lu_ucred+0x20/0x30 [obdclass] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84015>] ? mdt_ucred+0x15/0x20 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0da015c>] ? mdt_root_squash+0x2c/0x410 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06d1646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0540f50>] ? lu_ucred+0x20/0x30 [obdclass] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0da4911>] mdt_reint_rec+0x41/0xe0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d89ae3>] mdt_reint_internal+0x4c3/0x780 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8a06d>] mdt_intent_reint+0x1ed/0x520 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d87f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0661831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06881ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d883a6>] mdt_enqueue+0x46/0xe0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8ea97>] mdt_handle_common+0x647/0x16d0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06aabac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0dc83f5>] mds_regular_handle+0x15/0x20 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06ba3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: mdt03_012 S 000000000000002c 0 7511 2 0x00000000 Dec 12 18:55:07 cmds1 kernel: ffff882f6da2f8f0 0000000000000046 00000000548b3476 ffff882f6da2f940 Dec 12 18:55:07 cmds1 kernel: 0000000000000001 ffff882f8e4f8aa0 ffff882f6da2f8c0 ffffffffa03b92d1 Dec 12 18:55:07 cmds1 kernel: ffff882f8e4f9058 ffff882f6da2ffd8 000000000000fb88 ffff882f8e4f9058 Dec 12 18:55:07 cmds1 kernel: Call Trace: Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03b92d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03a96fe>] cfs_waitq_wait+0xe/0x10 [libcfs] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06810aa>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa067c790>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffff81063410>] ? default_wake_function+0x0/0x20 Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680758>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8ac7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d84a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0680b30>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8b4f4>] mdt_object_lock+0x14/0x20 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d9a5a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06a9135>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06d1646>] ? __req_capsule_get+0x166/0x700 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06ab3c4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d9b3ad>] mdt_intent_getattr+0x29d/0x490 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d87f1e>] mdt_intent_policy+0x39e/0x720 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0661831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06881ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d883a6>] mdt_enqueue+0x46/0xe0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0d8ea97>] mdt_handle_common+0x647/0x16d0 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06aabac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa0dc83f5>] mds_regular_handle+0x15/0x20 [mdt] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06ba3c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03a95de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa03bad9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06b1729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06bb75e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06bac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06bac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffffa06bac90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Dec 12 18:55:07 cmds1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

            Are there any updates on this? Was the sysrq info provided useful? Can we provide any more information?

            It may or may not be useful but when we saw this issue for the first time 2 weeks ago, it was right after changing the network card hardware in 2/4 of the clients/NFS exporters. The process of bringing the clients back online caused the MDS to fall over. All 4 of the clients and the MDS have been rebooted a few times since then. Since then we seem to be able to hit it frequently under high load. And because the clients are not recovering, long running simulations are getting IO errors resulting in lots of lost render time.

            daire Daire Byrne (Inactive) added a comment - Are there any updates on this? Was the sysrq info provided useful? Can we provide any more information? It may or may not be useful but when we saw this issue for the first time 2 weeks ago, it was right after changing the network card hardware in 2/4 of the clients/NFS exporters. The process of bringing the clients back online caused the MDS to fall over. All 4 of the clients and the MDS have been rebooted a few times since then. Since then we seem to be able to hit it frequently under high load. And because the clients are not recovering, long running simulations are getting IO errors resulting in lots of lost render time.

            This just happened again (so only two days between occurrences now). I'm attaching the messages file after doing a sysrq. I then had to reboot the server to get things stable.

            daire Daire Byrne (Inactive) added a comment - This just happened again (so only two days between occurrences now). I'm attaching the messages file after doing a sysrq. I then had to reboot the server to get things stable.

            Okay we will do that. I was hoping that the lustre dumps to /tmp may have contained enough debug info to figure out what was going on but apparently not.

            daire Daire Byrne (Inactive) added a comment - Okay we will do that. I was hoping that the lustre dumps to /tmp may have contained enough debug info to figure out what was going on but apparently not.
            laisiyao Lai Siyao added a comment -

            This looks to be that MDS enqueued a local lock, but timed out, in this case, it will sleep wait for it to become available. Normally the conflicting lock against this local lock should be released later (if other clients don't cooperate, MDS will evict them to achieve this). So there should be something go wrong, and cause deadlock. To make out the cause of this deadlock, I need to know who holds it, could you `echo t > /proc/sysrq-trigger` on MDS to collect all process backtraces when you see this error next time?

            laisiyao Lai Siyao added a comment - This looks to be that MDS enqueued a local lock, but timed out, in this case, it will sleep wait for it to become available. Normally the conflicting lock against this local lock should be released later (if other clients don't cooperate, MDS will evict them to achieve this). So there should be something go wrong, and cause deadlock. To make out the cause of this deadlock, I need to know who holds it, could you `echo t > /proc/sysrq-trigger` on MDS to collect all process backtraces when you see this error next time?

            People

              laisiyao Lai Siyao
              daire Daire Byrne (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: