Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5497

Many MDS service threads blocked in ldlm_completion_ast()

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • Lustre 2.4.2
    • 3
    • 15338

    Description

      Our production MDS systems occasionally get stuck with many service threads stuck in ldlm_completion_ast(). The details were described in LU-4579, but that issue was closed when the patch landed which fixed how timeouts are reported.

      When this happens, client access hangs and the MDS appears completely idle.

      Attachments

        Issue Links

          Activity

            [LU-5497] Many MDS service threads blocked in ldlm_completion_ast()

            We have avoided this problem by going to 2.5 for most of our systems. The patch LU-4584 should help with this but if I remember right Chris reported issues at LU-5525 due to that patch.

            simmonsja James A Simmons added a comment - We have avoided this problem by going to 2.5 for most of our systems. The patch LU-4584 should help with this but if I remember right Chris reported issues at LU-5525 due to that patch.

            Oleg - LU-4584 is... long. Is the patch you're referring to this one: http://review.whamcloud.com/#/c/9488/ ?

            Also, Chris, James, et al. - Any further updates on this? Have the patches been tried? Are you still seeing the problems? Etc.

            paf Patrick Farrell (Inactive) added a comment - Oleg - LU-4584 is... long. Is the patch you're referring to this one: http://review.whamcloud.com/#/c/9488/ ? Also, Chris, James, et al. - Any further updates on this? Have the patches been tried? Are you still seeing the problems? Etc.
            green Oleg Drokin added a comment -

            Was this with lu4584 in the original form?
            There's not too much data here, but I observed similar lockups in my testing using your chaos tree initially.

            Using http://review.whamcloud.com/#/c/6511/ + the latest form of lu4584 patch has a high chance of eliminating this problem too, I think.

            green Oleg Drokin added a comment - Was this with lu4584 in the original form? There's not too much data here, but I observed similar lockups in my testing using your chaos tree initially. Using http://review.whamcloud.com/#/c/6511/ + the latest form of lu4584 patch has a high chance of eliminating this problem too, I think.

            We are using LU-4584 with 2.4.

            morrone Christopher Morrone (Inactive) added a comment - We are using LU-4584 with 2.4.

            Is Livermore running 2.4 with LU-2827 or LU-4584?

            simmonsja James A Simmons added a comment - Is Livermore running 2.4 with LU-2827 or LU-4584 ?
            green Oleg Drokin added a comment -

            This looks like one of LU-2827 fallouts.

            green Oleg Drokin added a comment - This looks like one of LU-2827 fallouts.
            pjones Peter Jones added a comment -

            Oleg's initial reaction to this ticket was that this was LU-4584 (which you are evaluating a fix for).

            Oleg would you care to elaborate?

            pjones Peter Jones added a comment - Oleg's initial reaction to this ticket was that this was LU-4584 (which you are evaluating a fix for). Oleg would you care to elaborate?

            Example stack trace.

            2014-02-03 11:49:01 LustreError: dumping log to /tmp/lustre-log.1391456941.15242
            2014-02-03 11:49:02 Pid: 14993, comm: mdt00_011
            2014-02-03 11:49:02
            2014-02-03 11:49:02 Call Trace:
            2014-02-03 11:49:02  [<ffffffffa05cc341>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            2014-02-03 11:49:02  [<ffffffffa05bc77e>] cfs_waitq_wait+0xe/0x10 [libcfs]
            2014-02-03 11:49:02  [<ffffffffa08914ca>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa088cb60>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffff81063ba0>] ? default_wake_function+0x0/0x20
            2014-02-03 11:49:02  [<ffffffffa0890b78>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa0890f50>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa0e42a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            2014-02-03 11:49:02  [<ffffffffa0e48c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
            2014-02-03 11:49:02  [<ffffffffa0e42a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
            2014-02-03 11:49:02  [<ffffffffa0890f50>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa0e494f4>] mdt_object_lock+0x14/0x20 [mdt]
            2014-02-03 11:49:02  [<ffffffffa0e586d9>] mdt_getattr_name_lock+0xe09/0x1960 [mdt]
            2014-02-03 11:49:02  [<ffffffffa08b9cf5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa08e2846>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa08bbf84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa0e594cd>] mdt_intent_getattr+0x29d/0x490 [mdt]
            2014-02-03 11:49:02  [<ffffffffa0e45f1e>] mdt_intent_policy+0x39e/0x720 [mdt]
            2014-02-03 11:49:02  [<ffffffffa08718b1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa089a9df>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa0e463a6>] mdt_enqueue+0x46/0xe0 [mdt]
            2014-02-03 11:49:02  [<ffffffffa0e4d758>] mdt_handle_common+0x648/0x1660 [mdt]
            2014-02-03 11:49:02  [<ffffffffa0e86405>] mds_regular_handle+0x15/0x20 [mdt]
            2014-02-03 11:49:02  [<ffffffffa08caf88>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa05bc63e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
            2014-02-03 11:49:02  [<ffffffffa05cde0f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
            2014-02-03 11:49:02  [<ffffffffa08c22e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa08cc31e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffff8100c10a>] child_rip+0xa/0x20
            2014-02-03 11:49:02  [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            2014-02-03 11:49:02  [<ffffffff8100c100>] ? child_rip+0x0/0x20
            2014-02-03 11:49:02
            
            nedbass Ned Bass (Inactive) added a comment - Example stack trace. 2014-02-03 11:49:01 LustreError: dumping log to /tmp/lustre-log.1391456941.15242 2014-02-03 11:49:02 Pid: 14993, comm: mdt00_011 2014-02-03 11:49:02 2014-02-03 11:49:02 Call Trace: 2014-02-03 11:49:02 [<ffffffffa05cc341>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2014-02-03 11:49:02 [<ffffffffa05bc77e>] cfs_waitq_wait+0xe/0x10 [libcfs] 2014-02-03 11:49:02 [<ffffffffa08914ca>] ldlm_completion_ast+0x57a/0x960 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa088cb60>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc] 2014-02-03 11:49:02 [<ffffffff81063ba0>] ? default_wake_function+0x0/0x20 2014-02-03 11:49:02 [<ffffffffa0890b78>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0890f50>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0e42a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e48c7b>] mdt_object_lock0+0x33b/0xaf0 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e42a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt] 2014-02-03 11:49:02 [<ffffffffa0890f50>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0e494f4>] mdt_object_lock+0x14/0x20 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e586d9>] mdt_getattr_name_lock+0xe09/0x1960 [mdt] 2014-02-03 11:49:02 [<ffffffffa08b9cf5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08e2846>] ? __req_capsule_get+0x166/0x700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08bbf84>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0e594cd>] mdt_intent_getattr+0x29d/0x490 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e45f1e>] mdt_intent_policy+0x39e/0x720 [mdt] 2014-02-03 11:49:02 [<ffffffffa08718b1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa089a9df>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa0e463a6>] mdt_enqueue+0x46/0xe0 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e4d758>] mdt_handle_common+0x648/0x1660 [mdt] 2014-02-03 11:49:02 [<ffffffffa0e86405>] mds_regular_handle+0x15/0x20 [mdt] 2014-02-03 11:49:02 [<ffffffffa08caf88>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa05bc63e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2014-02-03 11:49:02 [<ffffffffa05cde0f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2014-02-03 11:49:02 [<ffffffffa08c22e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08cc31e>] ptlrpc_main+0xace/0x1700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffff8100c10a>] child_rip+0xa/0x20 2014-02-03 11:49:02 [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffffa08cb850>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2014-02-03 11:49:02 [<ffffffff8100c100>] ? child_rip+0x0/0x20 2014-02-03 11:49:02

            People

              green Oleg Drokin
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: