Details

    • Improvement
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 1.8.7
    • None
    • 24,450
    • 9740

    Description

      The goal of this ticket is to land into WC 1.8 branch next patches from Vladimir Saveliev (Oracle):

      https://bugzilla.lustre.org/attachment.cgi?id=33145
      https://bugzilla.lustre.org/attachment.cgi?id=33137
      https://bugzilla.lustre.org/attachment.cgi?id=33106
      https://bugzilla.lustre.org/attachment.cgi?id=33099

      (details of these patches are in the https://bugzilla.lustre.org/show_bug.cgi?id=24450).

      Patches summary:

      ldlm_run_bl_ast_work() sends ASTs in set of PARALLEL_AST_LIMIT
      requests and waits for whole set to complete and then sends another
      set of requests and waits again. If there is a least one request per
      set which timeouts, we have timeout serialization.

      This patch changes ldlm_run_bl_ast_work() so that having sent one
      set it then waits for any of its requests to complete and refills
      the running set with requests which are yet to be sent. For a case
      where number of timeout-ing requests is smaller than
      PARALLEL_AST_LIMIT it is supposed to eliminate possibility of timeout
      serailization.

      This patch uses posibility to specify wait condition for
      ptlrpc_set_wait().

      Attachments

        Issue Links

          Activity

            [LU-1269] speed up ASTs sending

            close old tickets

            jay Jinshan Xiong (Inactive) added a comment - close old tickets

            Xyratex-bug-id: MRP-478

            nrutman Nathan Rutman added a comment - Xyratex-bug-id: MRP-478
            spitzcor Cory Spitz added a comment -

            Thanks, Jinshan. Change #2650/LU-1373 does look interesting.

            BTW, http://jira.whamcloud.com/browse/LU-571, http://review.whamcloud.com/#change,1190, and http://review.whamcloud.com/#change,1608 are a few handy links for master commit 0bd27be7f20a671e7128f341a070838a2bd318dc.

            spitzcor Cory Spitz added a comment - Thanks, Jinshan. Change #2650/ LU-1373 does look interesting. BTW, http://jira.whamcloud.com/browse/LU-571 , http://review.whamcloud.com/#change,1190 , and http://review.whamcloud.com/#change,1608 are a few handy links for master commit 0bd27be7f20a671e7128f341a070838a2bd318dc.

            the hash # in master is: 0bd27be7f20a671e7128f341a070838a2bd318dc

            and johann is working on an improvement at: http://review.whamcloud.com/2650 and you might be interested.

            jay Jinshan Xiong (Inactive) added a comment - the hash # in master is: 0bd27be7f20a671e7128f341a070838a2bd318dc and johann is working on an improvement at: http://review.whamcloud.com/2650 and you might be interested.
            spitzcor Cory Spitz added a comment -

            Also, it might be worthwhile to hear from Johann. I had a conversation with him and he suggested that b1_8 might be better off simply by removing the PARALLEL_AST_LIMIT. Cray has been using the patches listed in the description from bz 24450. I'm not sure what the correct approach should be for b1_8 though.

            spitzcor Cory Spitz added a comment - Also, it might be worthwhile to hear from Johann. I had a conversation with him and he suggested that b1_8 might be better off simply by removing the PARALLEL_AST_LIMIT. Cray has been using the patches listed in the description from bz 24450. I'm not sure what the correct approach should be for b1_8 though.

            Reopening issue due to problem reports hit on 1.8.

            Jinshan, can you please find the patch set for master that resolved this problem? I believe it was one of the early patches in the Imperative Recovery feature.

            adilger Andreas Dilger added a comment - Reopening issue due to problem reports hit on 1.8. Jinshan, can you please find the patch set for master that resolved this problem? I believe it was one of the early patches in the Imperative Recovery feature.

            Hi, we also got very similar problem on lustre-1.8.7-wc1 too, and MDS hanged.

            Apr 23 15:58:34 ALPL505 kernel: Call Trace:
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88953a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88955542>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8893a709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8008e421>] default_wake_function+0x0/0xe
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88935b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889540bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88caa157>] enqueue_ordered_locks+0x387/0x4d0 [mds]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889519a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88955080>] ldlm_completion_ast+0x0/0x880 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88caa8e9>] mds_get_parent_child_locked+0x649/0x960 [mds]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88c9b652>] mds_getattr_lock+0x632/0xc90 [mds]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88c96dda>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88ca1d83>] mds_intent_policy+0x623/0xc20 [mds]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8893c270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88939eb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889367fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8895e870>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8895bb39>] ldlm_handle_enqueue+0xc09/0x1210 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88ca0b30>] mds_handle+0x40e0/0x4d10 [mds]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff800774ed>] smp_send_reschedule+0x4e/0x53
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8008ddcd>] enqueue_task+0x41/0x56
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8897fd55>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff889896d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88989e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8008c85d>] __wake_up_common+0x3e/0x68
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8898adc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff88989e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
            Apr 23 15:58:34 ALPL505 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
            
            ihara Shuichi Ihara (Inactive) added a comment - Hi, we also got very similar problem on lustre-1.8.7-wc1 too, and MDS hanged. Apr 23 15:58:34 ALPL505 kernel: Call Trace: Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88953a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88955542>] ldlm_completion_ast+0x4c2/0x880 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8893a709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88935b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff889540bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88caa157>] enqueue_ordered_locks+0x387/0x4d0 [mds] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff889519a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88955080>] ldlm_completion_ast+0x0/0x880 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88caa8e9>] mds_get_parent_child_locked+0x649/0x960 [mds] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88c9b652>] mds_getattr_lock+0x632/0xc90 [mds] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88c96dda>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88ca1d83>] mds_intent_policy+0x623/0xc20 [mds] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8893c270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88939eb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff889367fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8895e870>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8895bb39>] ldlm_handle_enqueue+0xc09/0x1210 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88ca0b30>] mds_handle+0x40e0/0x4d10 [mds] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff800774ed>] smp_send_reschedule+0x4e/0x53 Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8008ddcd>] enqueue_task+0x41/0x56 Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8897fd55>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff889896d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88989e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8008c85d>] __wake_up_common+0x3e/0x68 Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8898adc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Apr 23 15:58:34 ALPL505 kernel: [<ffffffff88989e60>] ptlrpc_main+0x0/0x1120 [ptlrpc] Apr 23 15:58:34 ALPL505 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11

            Peter, this ticket is about landing bugfixes which are committed on review at http://review.whamcloud.com/#change,2406

            it's NOT about back-porting.

            Please, don't close this ticket like "Won't Fix" since these fixes require landing into 1.8 branch.

            Thank you,
            Iurii

            iurii Iurii Golovach (Inactive) added a comment - Peter, this ticket is about landing bugfixes which are committed on review at http://review.whamcloud.com/#change,2406 it's NOT about back-porting. Please, don't close this ticket like "Won't Fix" since these fixes require landing into 1.8 branch. Thank you, Iurii
            pjones Peter Jones added a comment -

            No there are no plans to backport new features to b1_8. We are landing bugfixes only into b1_8 and new feature development is limited to master

            pjones Peter Jones added a comment - No there are no plans to backport new features to b1_8. We are landing bugfixes only into b1_8 and new feature development is limited to master

            Andreas, Jinshan, do you mean that there is a plan to port your changes with such functionality from 2.2 into the 1.8? If yes - let me know the ticket where this is tracked and we may close this one then.

            igolovach Iurii Golovach (Inactive) added a comment - Andreas, Jinshan, do you mean that there is a plan to port your changes with such functionality from 2.2 into the 1.8? If yes - let me know the ticket where this is tracked and we may close this one then.

            People

              jay Jinshan Xiong (Inactive)
              igolovach Iurii Golovach (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: