Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12704

racer test_1: Invalid layout: The component end must be aligned by the stripe size

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4c310714-c728-11e9-9fc9-52540065bddc

      test_1 failed with the following error:

      layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
      Invalid layout: The component end must be aligned by the stripe size
      

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      racer test_1 - Timeout occurred after 833 mins, last suite running was racer, restarting cluster to continue tests

      Attachments

        Issue Links

          Activity

            [LU-12704] racer test_1: Invalid layout: The component end must be aligned by the stripe size
            adilger Andreas Dilger made changes -
            Link New: This issue is duplicated by LU-13928 [ LU-13928 ]
            jamesanunez James Nunez (Inactive) made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 24710 ]
            jamesanunez James Nunez (Inactive) made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 24277 ]
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36368/
            Subject: LU-12704 lov: check all entries in lov_flush_composite
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 44460570fd21a91002190c8a0620923125135b52

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36368/ Subject: LU-12704 lov: check all entries in lov_flush_composite Project: fs/lustre-release Branch: master Current Patch Set: Commit: 44460570fd21a91002190c8a0620923125135b52
            jamesanunez James Nunez (Inactive) made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 24243 ]
            jamesanunez James Nunez (Inactive) made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 24221 ]
            vsaveliev Vladimir Saveliev added a comment - - edited

            (defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout

            Mike, it looks like we are in trouble with this patch as io->ci_ignore_layout = 0; as well as io->ci_ignore_layout = 1; leads to a problem. Any idea?

            vsaveliev Vladimir Saveliev added a comment - - edited (defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout Mike, it looks like we are in trouble with this patch as io->ci_ignore_layout = 0; as well as io->ci_ignore_layout = 1; leads to a problem. Any idea?
            +»       io->ci_ignore_layout = 1;
            

            (defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout

            Mike, yes, you are right, io->ci_ignore_layout set to 1 leads to race between layout change and cl_io_init. Something like

            00020000:00040000:0.0:1571007482.482526:0:28488:0:(lov_io.c:318:lov_io_mirror_init()) ASSERTION( comp->lo_preferred_mirror == 0 ) failed:
            

            has been seen few times.

            However, there is also a problem with io->ci_ignore_layout set to 0.

            Namely, the below lockup has beed observed:

            [<ffffffffc096f065>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc]
            [<ffffffffc0970e2a>] ldlm_cli_enqueue_fini+0x63a/0xef0 [ptlrpc]
            [<ffffffffc0973b71>] ldlm_cli_enqueue+0x451/0xa60 [ptlrpc]
            [<ffffffffc0ba7730>] mdc_enqueue_base+0x330/0x1c40 [mdc]
            [<ffffffffc0ba9a85>] mdc_intent_lock+0x135/0x560 [mdc]
            [<ffffffffc0be6742>] lmv_intent_lock+0x402/0xa20 [lmv]
            [<ffffffffc0c0eb1d>] ll_layout_intent+0x1dd/0x720 [lustre]
            [<ffffffffc0c1fa6c>] ll_layout_refresh+0x30c/0x900 [lustre]
            [<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre]
            [<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass]
            [<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass]
            [<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre]
            [<ffffffffc0c42a5c>] ll_md_blocking_ast+0x24c/0x2b0 [lustre]
            [<ffffffffc09626ba>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
            [<ffffffffc096e311>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc]
            [<ffffffffc097294a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc]
            [<ffffffffc0972c6b>] ldlm_cancel_resource_local+0x18b/0x2a0 [ptlrpc]
            [<ffffffffc0ba02ac>] mdc_resource_get_unused_res+0x10c/0x250 [mdc]
            [<ffffffffc0bb1057>] mdc_enqueue_send+0x557/0x710 [mdc]
            [<ffffffffc0bb14b2>] mdc_lock_enqueue+0x2a2/0x6f2 [mdc]
            [<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass]
            [<ffffffffc0ab51e5>] lov_lock_enqueue+0x95/0x150 [lov]
            [<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass]
            [<ffffffffc0769c67>] cl_lock_request+0x67/0x1f0 [obdclass]
            [<ffffffffc076d9cb>] cl_io_lock+0x2bb/0x3d0 [obdclass]
            [<ffffffffc076dcfa>] cl_io_loop+0xba/0x1c0 [obdclass]
            [<ffffffffc0c5981f>] cl_setattr_ost+0x25f/0x3d0 [lustre]
            [<ffffffffc0c34b28>] ll_setattr_raw+0xcc8/0x1060 [lustre]
            [<ffffffffc0c34f23>] ll_setattr+0x63/0xc0 [lustre]
            [<ffffffffbc260524>] notify_change+0x2c4/0x420
            [<ffffffffbc23f335>] do_truncate+0x75/0xc0
            

            mdc_enqueue_send() tries to do early cancel, ll_lock_cancel_bits()->ll_dom_lock_cancel() initializes cl_io->ci_ignore_layout to 0, so vvp_io_init() does ll_layout_refresh() and takes lli->lli_layout_mutex and sends enqueue rpc. Server sends blocking ast back to the client, so another ll_lock_cancel_bits() gets to run and stuck on trying to lock the mutex lli->lli_layout_mutex:

            int ll_layout_refresh(struct inode *inode, __u32 *gen)
            {
            ...
                mutex_lock(&lli->lli_layout_mutex);
            ...
            
            [<ffffffffc0c1f94e>] ll_layout_refresh+0x1ee/0x900 [lustre]
            [<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre]
            [<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass]
            [<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass]
            [<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre]
            [<ffffffffc0c42a99>] ll_md_blocking_ast+0x289/0x2b0 [lustre]
            [<ffffffffc0978bdd>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc]
            [<ffffffffc09797d0>] ldlm_bl_thread_main+0x800/0xa40 [ptlrpc]
            [<ffffffffbc0c1c71>] kthread+0xd1/0xe0
            
            vsaveliev Vladimir Saveliev added a comment - +»       io->ci_ignore_layout = 1; (defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout Mike, yes, you are right, io->ci_ignore_layout set to 1 leads to race between layout change and cl_io_init. Something like 00020000:00040000:0.0:1571007482.482526:0:28488:0:(lov_io.c:318:lov_io_mirror_init()) ASSERTION( comp->lo_preferred_mirror == 0 ) failed: has been seen few times. However, there is also a problem with io->ci_ignore_layout set to 0. Namely, the below lockup has beed observed: [<ffffffffc096f065>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc] [<ffffffffc0970e2a>] ldlm_cli_enqueue_fini+0x63a/0xef0 [ptlrpc] [<ffffffffc0973b71>] ldlm_cli_enqueue+0x451/0xa60 [ptlrpc] [<ffffffffc0ba7730>] mdc_enqueue_base+0x330/0x1c40 [mdc] [<ffffffffc0ba9a85>] mdc_intent_lock+0x135/0x560 [mdc] [<ffffffffc0be6742>] lmv_intent_lock+0x402/0xa20 [lmv] [<ffffffffc0c0eb1d>] ll_layout_intent+0x1dd/0x720 [lustre] [<ffffffffc0c1fa6c>] ll_layout_refresh+0x30c/0x900 [lustre] [<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre] [<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass] [<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass] [<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre] [<ffffffffc0c42a5c>] ll_md_blocking_ast+0x24c/0x2b0 [lustre] [<ffffffffc09626ba>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] [<ffffffffc096e311>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc] [<ffffffffc097294a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc] [<ffffffffc0972c6b>] ldlm_cancel_resource_local+0x18b/0x2a0 [ptlrpc] [<ffffffffc0ba02ac>] mdc_resource_get_unused_res+0x10c/0x250 [mdc] [<ffffffffc0bb1057>] mdc_enqueue_send+0x557/0x710 [mdc] [<ffffffffc0bb14b2>] mdc_lock_enqueue+0x2a2/0x6f2 [mdc] [<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass] [<ffffffffc0ab51e5>] lov_lock_enqueue+0x95/0x150 [lov] [<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass] [<ffffffffc0769c67>] cl_lock_request+0x67/0x1f0 [obdclass] [<ffffffffc076d9cb>] cl_io_lock+0x2bb/0x3d0 [obdclass] [<ffffffffc076dcfa>] cl_io_loop+0xba/0x1c0 [obdclass] [<ffffffffc0c5981f>] cl_setattr_ost+0x25f/0x3d0 [lustre] [<ffffffffc0c34b28>] ll_setattr_raw+0xcc8/0x1060 [lustre] [<ffffffffc0c34f23>] ll_setattr+0x63/0xc0 [lustre] [<ffffffffbc260524>] notify_change+0x2c4/0x420 [<ffffffffbc23f335>] do_truncate+0x75/0xc0 mdc_enqueue_send() tries to do early cancel, ll_lock_cancel_bits()->ll_dom_lock_cancel() initializes cl_io->ci_ignore_layout to 0, so vvp_io_init() does ll_layout_refresh() and takes lli->lli_layout_mutex and sends enqueue rpc. Server sends blocking ast back to the client, so another ll_lock_cancel_bits() gets to run and stuck on trying to lock the mutex lli->lli_layout_mutex: int ll_layout_refresh(struct inode *inode, __u32 *gen) { ... mutex_lock(&lli->lli_layout_mutex); ... [<ffffffffc0c1f94e>] ll_layout_refresh+0x1ee/0x900 [lustre] [<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre] [<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass] [<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass] [<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre] [<ffffffffc0c42a99>] ll_md_blocking_ast+0x289/0x2b0 [lustre] [<ffffffffc0978bdd>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc] [<ffffffffc09797d0>] ldlm_bl_thread_main+0x800/0xa40 [ptlrpc] [<ffffffffbc0c1c71>] kthread+0xd1/0xe0

            People

              vsaveliev Vladimir Saveliev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: