[LU-12704] racer test_1: Invalid layout: The component end must be aligned by the stripe size - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.13.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4c310714-c728-11e9-9fc9-52540065bddc

test_1 failed with the following error:

layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
Invalid layout: The component end must be aligned by the stripe size

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
racer test_1 - Timeout occurred after 833 mins, last suite running was racer, restarting cluster to continue tests

Attachments

Issue Links

is duplicated by

LU-13928 racer/file_create.sh passing bad PFL arguments

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(2 mentioned in)

Activity

[LU-12704] racer test_1: Invalid layout: The component end must be aligned by the stripe size

Andreas Dilger made changes - 23/Dec/20 10:44 AM

Link

New: This issue is duplicated by ~~LU-13928~~ [ ~~LU-13928~~ ]

James Nunez (Inactive) made changes - 29/Sep/20 7:59 PM

Remote Link

New: This issue links to "Page (Whamcloud Community Wiki)" [ 24710 ]

James Nunez (Inactive) made changes - 15/Nov/19 9:40 PM

Remote Link

New: This issue links to "Page (Whamcloud Community Wiki)" [ 24277 ]

Peter Jones made changes - 12/Nov/19 6:39 AM

Resolution		New: Fixed [ 1 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Peter Jones added a comment - 12/Nov/19 6:39 AM

Landed for 2.14

Peter Jones added a comment - 12/Nov/19 6:39 AM Landed for 2.14

Gerrit Updater added a comment - 12/Nov/19 4:07 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36368/
Subject: ~~LU-12704~~ lov: check all entries in lov_flush_composite
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 44460570fd21a91002190c8a0620923125135b52

Gerrit Updater added a comment - 12/Nov/19 4:07 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36368/ Subject: LU-12704 lov: check all entries in lov_flush_composite Project: fs/lustre-release Branch: master Current Patch Set: Commit: 44460570fd21a91002190c8a0620923125135b52

James Nunez (Inactive) made changes - 05/Nov/19 8:05 PM

Remote Link

New: This issue links to "Page (Whamcloud Community Wiki)" [ 24243 ]

James Nunez (Inactive) made changes - 29/Oct/19 5:02 PM

Remote Link

New: This issue links to "Page (Whamcloud Community Wiki)" [ 24221 ]

Vladimir Saveliev added a comment - 22/Oct/19 3:19 PM - edited

(defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout

Mike, it looks like we are in trouble with this patch as io->ci_ignore_layout = 0; as well as io->ci_ignore_layout = 1; leads to a problem. Any idea?

Vladimir Saveliev added a comment - 22/Oct/19 3:19 PM - edited (defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout Mike, it looks like we are in trouble with this patch as io->ci_ignore_layout = 0; as well as io->ci_ignore_layout = 1; leads to a problem. Any idea?

Vladimir Saveliev added a comment - 18/Oct/19 10:16 AM

+»       io->ci_ignore_layout = 1;
(defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout

Mike, yes, you are right, io->ci_ignore_layout set to 1 leads to race between layout change and cl_io_init. Something like

00020000:00040000:0.0:1571007482.482526:0:28488:0:(lov_io.c:318:lov_io_mirror_init()) ASSERTION( comp->lo_preferred_mirror == 0 ) failed:

has been seen few times.

However, there is also a problem with io->ci_ignore_layout set to 0.

Namely, the below lockup has beed observed:

[<ffffffffc096f065>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc]
[<ffffffffc0970e2a>] ldlm_cli_enqueue_fini+0x63a/0xef0 [ptlrpc]
[<ffffffffc0973b71>] ldlm_cli_enqueue+0x451/0xa60 [ptlrpc]
[<ffffffffc0ba7730>] mdc_enqueue_base+0x330/0x1c40 [mdc]
[<ffffffffc0ba9a85>] mdc_intent_lock+0x135/0x560 [mdc]
[<ffffffffc0be6742>] lmv_intent_lock+0x402/0xa20 [lmv]
[<ffffffffc0c0eb1d>] ll_layout_intent+0x1dd/0x720 [lustre]
[<ffffffffc0c1fa6c>] ll_layout_refresh+0x30c/0x900 [lustre]
[<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre]
[<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass]
[<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass]
[<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre]
[<ffffffffc0c42a5c>] ll_md_blocking_ast+0x24c/0x2b0 [lustre]
[<ffffffffc09626ba>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
[<ffffffffc096e311>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc]
[<ffffffffc097294a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc]
[<ffffffffc0972c6b>] ldlm_cancel_resource_local+0x18b/0x2a0 [ptlrpc]
[<ffffffffc0ba02ac>] mdc_resource_get_unused_res+0x10c/0x250 [mdc]
[<ffffffffc0bb1057>] mdc_enqueue_send+0x557/0x710 [mdc]
[<ffffffffc0bb14b2>] mdc_lock_enqueue+0x2a2/0x6f2 [mdc]
[<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass]
[<ffffffffc0ab51e5>] lov_lock_enqueue+0x95/0x150 [lov]
[<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass]
[<ffffffffc0769c67>] cl_lock_request+0x67/0x1f0 [obdclass]
[<ffffffffc076d9cb>] cl_io_lock+0x2bb/0x3d0 [obdclass]
[<ffffffffc076dcfa>] cl_io_loop+0xba/0x1c0 [obdclass]
[<ffffffffc0c5981f>] cl_setattr_ost+0x25f/0x3d0 [lustre]
[<ffffffffc0c34b28>] ll_setattr_raw+0xcc8/0x1060 [lustre]
[<ffffffffc0c34f23>] ll_setattr+0x63/0xc0 [lustre]
[<ffffffffbc260524>] notify_change+0x2c4/0x420
[<ffffffffbc23f335>] do_truncate+0x75/0xc0

mdc_enqueue_send() tries to do early cancel, ll_lock_cancel_bits()->ll_dom_lock_cancel() initializes cl_io->ci_ignore_layout to 0, so vvp_io_init() does ll_layout_refresh() and takes lli->lli_layout_mutex and sends enqueue rpc. Server sends blocking ast back to the client, so another ll_lock_cancel_bits() gets to run and stuck on trying to lock the mutex lli->lli_layout_mutex:

int ll_layout_refresh(struct inode *inode, __u32 *gen)
{
...
    mutex_lock(&lli->lli_layout_mutex);
...

[<ffffffffc0c1f94e>] ll_layout_refresh+0x1ee/0x900 [lustre]
[<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre]
[<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass]
[<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass]
[<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre]
[<ffffffffc0c42a99>] ll_md_blocking_ast+0x289/0x2b0 [lustre]
[<ffffffffc0978bdd>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc]
[<ffffffffc09797d0>] ldlm_bl_thread_main+0x800/0xa40 [ptlrpc]
[<ffffffffbc0c1c71>] kthread+0xd1/0xe0

Vladimir Saveliev added a comment - 18/Oct/19 10:16 AM +» io->ci_ignore_layout = 1; (defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout Mike, yes, you are right, io->ci_ignore_layout set to 1 leads to race between layout change and cl_io_init. Something like 00020000:00040000:0.0:1571007482.482526:0:28488:0:(lov_io.c:318:lov_io_mirror_init()) ASSERTION( comp->lo_preferred_mirror == 0 ) failed: has been seen few times. However, there is also a problem with io->ci_ignore_layout set to 0. Namely, the below lockup has beed observed: [<ffffffffc096f065>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc] [<ffffffffc0970e2a>] ldlm_cli_enqueue_fini+0x63a/0xef0 [ptlrpc] [<ffffffffc0973b71>] ldlm_cli_enqueue+0x451/0xa60 [ptlrpc] [<ffffffffc0ba7730>] mdc_enqueue_base+0x330/0x1c40 [mdc] [<ffffffffc0ba9a85>] mdc_intent_lock+0x135/0x560 [mdc] [<ffffffffc0be6742>] lmv_intent_lock+0x402/0xa20 [lmv] [<ffffffffc0c0eb1d>] ll_layout_intent+0x1dd/0x720 [lustre] [<ffffffffc0c1fa6c>] ll_layout_refresh+0x30c/0x900 [lustre] [<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre] [<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass] [<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass] [<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre] [<ffffffffc0c42a5c>] ll_md_blocking_ast+0x24c/0x2b0 [lustre] [<ffffffffc09626ba>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] [<ffffffffc096e311>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc] [<ffffffffc097294a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc] [<ffffffffc0972c6b>] ldlm_cancel_resource_local+0x18b/0x2a0 [ptlrpc] [<ffffffffc0ba02ac>] mdc_resource_get_unused_res+0x10c/0x250 [mdc] [<ffffffffc0bb1057>] mdc_enqueue_send+0x557/0x710 [mdc] [<ffffffffc0bb14b2>] mdc_lock_enqueue+0x2a2/0x6f2 [mdc] [<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass] [<ffffffffc0ab51e5>] lov_lock_enqueue+0x95/0x150 [lov] [<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass] [<ffffffffc0769c67>] cl_lock_request+0x67/0x1f0 [obdclass] [<ffffffffc076d9cb>] cl_io_lock+0x2bb/0x3d0 [obdclass] [<ffffffffc076dcfa>] cl_io_loop+0xba/0x1c0 [obdclass] [<ffffffffc0c5981f>] cl_setattr_ost+0x25f/0x3d0 [lustre] [<ffffffffc0c34b28>] ll_setattr_raw+0xcc8/0x1060 [lustre] [<ffffffffc0c34f23>] ll_setattr+0x63/0xc0 [lustre] [<ffffffffbc260524>] notify_change+0x2c4/0x420 [<ffffffffbc23f335>] do_truncate+0x75/0xc0 mdc_enqueue_send() tries to do early cancel, ll_lock_cancel_bits()->ll_dom_lock_cancel() initializes cl_io->ci_ignore_layout to 0, so vvp_io_init() does ll_layout_refresh() and takes lli->lli_layout_mutex and sends enqueue rpc. Server sends blocking ast back to the client, so another ll_lock_cancel_bits() gets to run and stuck on trying to lock the mutex lli->lli_layout_mutex: int ll_layout_refresh(struct inode *inode, __u32 *gen) { ... mutex_lock(&lli->lli_layout_mutex); ... [<ffffffffc0c1f94e>] ll_layout_refresh+0x1ee/0x900 [lustre] [<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre] [<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass] [<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass] [<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre] [<ffffffffc0c42a99>] ll_md_blocking_ast+0x289/0x2b0 [lustre] [<ffffffffc0978bdd>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc] [<ffffffffc09797d0>] ldlm_bl_thread_main+0x800/0xa40 [ptlrpc] [<ffffffffbc0c1c71>] kthread+0xd1/0xe0

People

Assignee:: Vladimir Saveliev

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 27/Aug/19 5:07 AM

Updated:: 23/Dec/20 10:44 AM

Resolved:: 12/Nov/19 6:39 AM