Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5497

Many MDS service threads blocked in ldlm_completion_ast()

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.4.2
    • 3
    • 15338

    Description

      Our production MDS systems occasionally get stuck with many service threads stuck in ldlm_completion_ast(). The details were described in LU-4579, but that issue was closed when the patch landed which fixed how timeouts are reported.

      When this happens, client access hangs and the MDS appears completely idle.

      Attachments

        Issue Links

          Activity

            [LU-5497] Many MDS service threads blocked in ldlm_completion_ast()
            simmonsja James A Simmons made changes -
            Resolution New: Won't Fix [ 2 ]
            Status Original: Open [ 1 ] New: Closed [ 6 ]

            Old blocker for unsupported version

            simmonsja James A Simmons added a comment - Old blocker for unsupported version
            pjones Peter Jones made changes -
            End date New: 13/Jan/16
            Start date New: 15/Aug/14
            vinayakh Vinayak (Inactive) added a comment - - edited

            We have faced this issue in 2.1.0 build. There are too many hung threads in ldlm_completion_ast. Is there any work around for this ?

            Oleg, can you please point to the patch which fixes this problem ?

            PID: 240529  TASK: ffff8810288aab40  CPU: 8   COMMAND: "mdt_46"
            #0 [ffff880fc6d1b860] schedule at ffffffff814ea122
            #1 [ffff880fc6d1b928] cfs_waitq_wait at ffffffffa041c69e [libcfs]
            #2 [ffff880fc6d1b938] ldlm_completion_ast at ffffffffa0709722 [ptlrpc]
            #3 [ffff880fc6d1b9e8] ldlm_cli_enqueue_local at ffffffffa0708e11 [ptlrpc]
            #4 [ffff880fc6d1ba78] mdt_object_lock at ffffffffa0bf2a8d [mdt]
            #5 [ffff880fc6d1bb18] mdt_getattr_name_lock at ffffffffa0bff710 [mdt]
            #6 [ffff880fc6d1bbb8] mdt_intent_getattr at ffffffffa0c005cd [mdt]
            #7 [ffff880fc6d1bc08] mdt_intent_policy at ffffffffa0c01679 [mdt]
            #8 [ffff880fc6d1bc48] ldlm_lock_enqueue at ffffffffa06ea1f9 [ptlrpc]
            #9 [ffff880fc6d1bca8] ldlm_handle_enqueue0 at ffffffffa07120ef [ptlrpc]
            #10 [ffff880fc6d1bd18] mdt_enqueue at ffffffffa0c01a16 [mdt]
            #11 [ffff880fc6d1bd38] mdt_handle_common at ffffffffa0bf4ffa [mdt]
            #12 [ffff880fc6d1bd88] mdt_regular_handle at ffffffffa0bf5eb5 [mdt]
            #13 [ffff880fc6d1bd98] ptlrpc_main at ffffffffa0742e0e [ptlrpc]
            #14 [ffff880fc6d1bee8] kthread at ffffffff81090806
            #15 [ffff880fc6d1bf48] kernel_thread at ffffffff8100c14a
            
            vinayakh Vinayak (Inactive) added a comment - - edited We have faced this issue in 2.1.0 build. There are too many hung threads in ldlm_completion_ast . Is there any work around for this ? Oleg, can you please point to the patch which fixes this problem ? PID: 240529 TASK: ffff8810288aab40 CPU: 8 COMMAND: "mdt_46" #0 [ffff880fc6d1b860] schedule at ffffffff814ea122 #1 [ffff880fc6d1b928] cfs_waitq_wait at ffffffffa041c69e [libcfs] #2 [ffff880fc6d1b938] ldlm_completion_ast at ffffffffa0709722 [ptlrpc] #3 [ffff880fc6d1b9e8] ldlm_cli_enqueue_local at ffffffffa0708e11 [ptlrpc] #4 [ffff880fc6d1ba78] mdt_object_lock at ffffffffa0bf2a8d [mdt] #5 [ffff880fc6d1bb18] mdt_getattr_name_lock at ffffffffa0bff710 [mdt] #6 [ffff880fc6d1bbb8] mdt_intent_getattr at ffffffffa0c005cd [mdt] #7 [ffff880fc6d1bc08] mdt_intent_policy at ffffffffa0c01679 [mdt] #8 [ffff880fc6d1bc48] ldlm_lock_enqueue at ffffffffa06ea1f9 [ptlrpc] #9 [ffff880fc6d1bca8] ldlm_handle_enqueue0 at ffffffffa07120ef [ptlrpc] #10 [ffff880fc6d1bd18] mdt_enqueue at ffffffffa0c01a16 [mdt] #11 [ffff880fc6d1bd38] mdt_handle_common at ffffffffa0bf4ffa [mdt] #12 [ffff880fc6d1bd88] mdt_regular_handle at ffffffffa0bf5eb5 [mdt] #13 [ffff880fc6d1bd98] ptlrpc_main at ffffffffa0742e0e [ptlrpc] #14 [ffff880fc6d1bee8] kthread at ffffffff81090806 #15 [ffff880fc6d1bf48] kernel_thread at ffffffff8100c14a

            Looks like we have this this issue at 3 times in the past 48 hours. Lots of threads stuck at ldlm_completion_ast. We are running with 2.4.3. It is not clear from the conversations above what patch is need.

            LNet: Service thread pid 4390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Pid: 4390, comm: mdt00_094
            
            Call Trace:
             [<ffffffffa07a58b5>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
             [<ffffffff815404c2>] schedule_timeout+0x192/0x2e0
             [<ffffffff81080610>] ? process_timeout+0x0/0x10
             [<ffffffffa050d6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
             [<ffffffffa07c9fed>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc]
             [<ffffffffa07c5780>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
             [<ffffffff81063be0>] ? default_wake_function+0x0/0x20
             [<ffffffffa07c9728>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
             [<ffffffffa07c9b00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
             [<ffffffffa0da1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
             [<ffffffffa0da7cbb>] mdt_object_lock0+0x33b/0xaf0 [mdt]
             [<ffffffffa0da1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
             [<ffffffffa07c9b00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
             [<ffffffffa0da8534>] mdt_object_lock+0x14/0x20 [mdt]
             [<ffffffffa0db77a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
             [<ffffffffa07f2125>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
             [<ffffffffa081a636>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
             [<ffffffffa07f43b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
             [<ffffffffa0db85b2>] mdt_intent_getattr+0x2a2/0x4b0 [mdt]
             [<ffffffffa0da4f3e>] mdt_intent_policy+0x39e/0x720 [mdt]
             [<ffffffffa07aa831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
             [<ffffffffa07d11bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
             [<ffffffffa0da53c6>] mdt_enqueue+0x46/0xe0 [mdt]
             [<ffffffffa0dabad7>] mdt_handle_common+0x647/0x16d0 [mdt]
             [<ffffffffa0de58f5>] mds_regular_handle+0x15/0x20 [mdt]
             [<ffffffffa08033b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
             [<ffffffffa050d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
             [<ffffffffa051ed5f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
             [<ffffffffa07fa719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
             [<ffffffff81055813>] ? __wake_up+0x53/0x70
             [<ffffffffa080474e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
             [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
             [<ffffffff8100c0ca>] child_rip+0xa/0x20
             [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
             [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
             [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            
            mhanafi Mahmoud Hanafi added a comment - Looks like we have this this issue at 3 times in the past 48 hours. Lots of threads stuck at ldlm_completion_ast. We are running with 2.4.3. It is not clear from the conversations above what patch is need. LNet: Service thread pid 4390 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 4390, comm: mdt00_094 Call Trace: [<ffffffffa07a58b5>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc] [<ffffffff815404c2>] schedule_timeout+0x192/0x2e0 [<ffffffff81080610>] ? process_timeout+0x0/0x10 [<ffffffffa050d6d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa07c9fed>] ldlm_completion_ast+0x4ed/0x960 [ptlrpc] [<ffffffffa07c5780>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] [<ffffffff81063be0>] ? default_wake_function+0x0/0x20 [<ffffffffa07c9728>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] [<ffffffffa07c9b00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0da1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa0da7cbb>] mdt_object_lock0+0x33b/0xaf0 [mdt] [<ffffffffa0da1a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] [<ffffffffa07c9b00>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] [<ffffffffa0da8534>] mdt_object_lock+0x14/0x20 [mdt] [<ffffffffa0db77a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt] [<ffffffffa07f2125>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] [<ffffffffa081a636>] ? __req_capsule_get+0x166/0x700 [ptlrpc] [<ffffffffa07f43b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] [<ffffffffa0db85b2>] mdt_intent_getattr+0x2a2/0x4b0 [mdt] [<ffffffffa0da4f3e>] mdt_intent_policy+0x39e/0x720 [mdt] [<ffffffffa07aa831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] [<ffffffffa07d11bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] [<ffffffffa0da53c6>] mdt_enqueue+0x46/0xe0 [mdt] [<ffffffffa0dabad7>] mdt_handle_common+0x647/0x16d0 [mdt] [<ffffffffa0de58f5>] mds_regular_handle+0x15/0x20 [mdt] [<ffffffffa08033b8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa050d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa051ed5f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] [<ffffffffa07fa719>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81055813>] ? __wake_up+0x53/0x70 [<ffffffffa080474e>] ptlrpc_main+0xace/0x1700 [ptlrpc] [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffffa0803c80>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

            We ran Patch Set 9 of change 9488 as part of 2.4.2, but we dropped it in favor of whatever landed on b2_5 now that we are based on 2.5.3.

            morrone Christopher Morrone (Inactive) added a comment - We ran Patch Set 9 of change 9488 as part of 2.4.2, but we dropped it in favor of whatever landed on b2_5 now that we are based on 2.5.3.

            We have avoided this problem by going to 2.5 for most of our systems. The patch LU-4584 should help with this but if I remember right Chris reported issues at LU-5525 due to that patch.

            simmonsja James A Simmons added a comment - We have avoided this problem by going to 2.5 for most of our systems. The patch LU-4584 should help with this but if I remember right Chris reported issues at LU-5525 due to that patch.

            Oleg - LU-4584 is... long. Is the patch you're referring to this one: http://review.whamcloud.com/#/c/9488/ ?

            Also, Chris, James, et al. - Any further updates on this? Have the patches been tried? Are you still seeing the problems? Etc.

            paf Patrick Farrell (Inactive) added a comment - Oleg - LU-4584 is... long. Is the patch you're referring to this one: http://review.whamcloud.com/#/c/9488/ ? Also, Chris, James, et al. - Any further updates on this? Have the patches been tried? Are you still seeing the problems? Etc.
            green Oleg Drokin added a comment -

            Was this with lu4584 in the original form?
            There's not too much data here, but I observed similar lockups in my testing using your chaos tree initially.

            Using http://review.whamcloud.com/#/c/6511/ + the latest form of lu4584 patch has a high chance of eliminating this problem too, I think.

            green Oleg Drokin added a comment - Was this with lu4584 in the original form? There's not too much data here, but I observed similar lockups in my testing using your chaos tree initially. Using http://review.whamcloud.com/#/c/6511/ + the latest form of lu4584 patch has a high chance of eliminating this problem too, I think.

            We are using LU-4584 with 2.4.

            morrone Christopher Morrone (Inactive) added a comment - We are using LU-4584 with 2.4.

            People

              green Oleg Drokin
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: