[LU-12704] racer test_1: Invalid layout: The component end must be aligned by the stripe size Created: 27/Aug/19  Updated: 23/Dec/20  Resolved: 12/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Vladimir Saveliev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-13928 racer/file_create.sh passing bad PFL ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4c310714-c728-11e9-9fc9-52540065bddc

test_1 failed with the following error:

layout: raid0 raid0 pfl pfl pfl dom dom dom flr flr flr
Invalid layout: The component end must be aligned by the stripe size

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
racer test_1 - Timeout occurred after 833 mins, last suite running was racer, restarting cluster to continue tests



 Comments   
Comment by Andreas Dilger [ 12/Sep/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36174
Subject: LU-12704 tests: component end must be multiple of stripesize
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 75b596e686209cd3dc7f4add49a7c57596c6f1f1

Comment by Andreas Dilger [ 12/Sep/19 ]

Note that on 2.10/2.12 (between commits v2_9_55_0-14-g89693927f0..v2_12_55-83-gff5eb304fa) it isn't allowed to create files with stripe_size that isn't a power-of-two value due to:

                       if (stripe_size == 0 ||
                           (prev_end != LUSTRE_EOF &&
                            (prev_end & (stripe_size - 1)))) {
                               CDEBUG(D_LAYOUT, "stripe size isn't aligned. "
                                      " stripe_sz: %u, [%llu, %llu)\n",
                                      stripe_size, ext->e_start, prev_end);
                               RETURN(-EINVAL);
                       }

This was fixed as part of SEL to check only that they are a multiple, not assuming a power-of-two value:

                if (prev_end % stripe_size) {
                        CDEBUG(D_LAYOUT, "stripe size isn't aligned, "
                               "stripe_sz: %u, [%llu, %llu)\n",
                               stripe_size, ext->e_start, prev_end);
                        RETURN(-EINVAL);
                }

so if this patch is backported to b2_12 we should include that part of the change, otherwise setstripe will still be failing with random 64KiB-multiple stripe_size.

Comment by James Nunez (Inactive) [ 25/Sep/19 ]

We're seeing racer test 1 crash with the 'Invalid layout' messages in the client test_log and the following call trace:

[50194.820231] 2[26110]: segfault at 0 ip           (null) sp 00007ffd1d570c28 error 14 in 2[400000+6000]
[50260.478242] 13[17503]: segfault at 8 ip 00007f5cedcb5718 sp 00007fff9f4f5d70 error 4 in ld-2.17.so[7f5cedcaa000+22000]
[50261.303734] 13[17172]: segfault at 8 ip 00007fc73cbf4718 sp 00007ffffb9ec790 error 4 in ld-2.17.so[7fc73cbe9000+22000]
[50280.191268] 16[9651]: segfault at 8 ip 00007f7cbdbdd718 sp 00007ffe4ab86420 error 4 in ld-2.17.so[7f7cbdbd2000+22000]
[50283.020027] 19[13428]: segfault at 8 ip 00007f9ce34c0718 sp 00007ffc51e417c0 error 4 in ld-2.17.so[7f9ce34b5000+22000]
[50297.598388] general protection fault: 0000 [#1] SMP 
[50297.599253] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) nfsd nfs_acl brd loop rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk ata_piix libata 8139too crct10dif_pclmul crct10dif_common crc32c_intel
[50297.607808]  serio_raw 8139cp virtio_pci virtio_ring mii virtio floppy [last unloaded: libcfs]
[50297.608709] CPU: 1 PID: 16082 Comm: ldlm_bl_03 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7.x86_64 #1
[50297.609823] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[50297.610384] task: ffff95157aa35140 ti: ffff95155a928000 task.ti: ffff95155a928000
[50297.611111] RIP: 0010:[<ffffffffc082c368>]  [<ffffffffc082c368>] cl_object_flush+0x48/0x120 [obdclass]
[50297.612168] RSP: 0018:ffff95155a92bb20  EFLAGS: 00010292
[50297.612680] RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a42 RCX: 0000000000000000
[50297.613366] RDX: ffff95154c051bd8 RSI: 0000000000000246 RDI: 0000000000000246
[50297.614050] RBP: ffff95155a92bb40 R08: 0000000000000000 R09: 0000000180180017
[50297.614735] R10: 0000000000000001 R11: ffffe84bc1bdc2c0 R12: ffff95156f70b430
[50297.615430] R13: ffff951549f8ac68 R14: ffff95155f7ae6c0 R15: 0000000000000000
[50297.616120] FS:  0000000000000000(0000) GS:ffff95157fd00000(0000) knlGS:0000000000000000
[50297.616899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50297.617451] CR2: 00007fde9a099f34 CR3: 0000000078754000 CR4: 00000000000606e0
[50297.618146] Call Trace:
[50297.618436]  [<ffffffffc0cde4ec>] lov_flush_composite+0x17c/0x1d0 [lov]
[50297.619091]  [<ffffffffc0cdd2e2>] lov_object_flush+0x22/0x60 [lov]
[50297.619713]  [<ffffffffc082c383>] cl_object_flush+0x63/0x120 [obdclass]
[50297.620440]  [<ffffffffc0d5affe>] ll_lock_cancel_bits+0x3ce/0x9f0 [lustre]
[50297.621121]  [<ffffffffc0d5bd68>] ll_md_blocking_ast+0x248/0x2b0 [lustre]
[50297.621895]  [<ffffffffc0a704aa>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
[50297.622789]  [<ffffffffc067e425>] ? cfs_trace_unlock_tcd+0x35/0x90 [libcfs]
[50297.623828]  [<ffffffffc0a7b3c0>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
[50297.625013]  [<ffffffffc0a81121>] ldlm_cli_cancel+0x161/0x650 [ptlrpc]
[50297.626050]  [<ffffffffc0d5bc0a>] ll_md_blocking_ast+0xea/0x2b0 [lustre]
[50297.627204]  [<ffffffffc0a8593d>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc]
[50297.627945]  [<ffffffffc0a86530>] ldlm_bl_thread_main+0x800/0xa40 [ptlrpc]
[50297.628653]  [<ffffffff854d7c40>] ? wake_up_state+0x20/0x20
[50297.629234]  [<ffffffffc0a85d30>] ? ldlm_handle_bl_callback+0x4e0/0x4e0 [ptlrpc]
[50297.629962]  [<ffffffff854c2e81>] kthread+0xd1/0xe0
[50297.630458]  [<ffffffff854c2db0>] ? insert_kthread_work+0x40/0x40
[50297.631082]  [<ffffffff85b76c37>] ret_from_fork_nospec_begin+0x21/0x21
[50297.631728]  [<ffffffff854c2db0>] ? insert_kthread_work+0x40/0x40
[50297.632319] Code: 4c 8b 26 74 0d f6 05 d3 7c e7 ff 20 0f 85 91 00 00 00 49 8b 44 24 40 49 83 c4 40 49 39 c4 48 8d 58 e8 74 33 0f 1f 80 00 00 00 00 <48> 8b 43 28 48 8b 40 68 48 85 c0 74 12 4c 89 f2 48 89 de 4c 89 
[50297.635385] RIP  [<ffffffffc082c368>] cl_object_flush+0x48/0x120 [obdclass]
[50297.636104]  RSP <ffff95155a92bb20>

Logs are at
https://testing.whamcloud.com/test_sets/a38f656e-dc59-11e9-b62b-52540065bddc
https://testing.whamcloud.com/test_sets/f9fb09b6-dc46-11e9-a197-52540065bddc

Comment by Vladimir Saveliev [ 26/Sep/19 ]
[50297.612680] RAX: 5a5a5a5a5a5a5a5a RBX: 5a5a5a5a5a5a5a42 RCX: 0000000000000000
[50297.613366] RDX: ffff95154c051bd8 RSI: 0000000000000246 RDI: 0000000000000246
[50297.614050] RBP: ffff95155a92bb40 R08: 0000000000000000 R09: 0000000180180017
[50297.614735] R10: 0000000000000001 R11: ffffe84bc1bdc2c0 R12: ffff95156f70b430
[50297.615430] R13: ffff951549f8ac68 R14: ffff95155f7ae6c0 R15: 0000000000000000
[50297.616120] FS:  0000000000000000(0000) GS:ffff95157fd00000(0000) knlGS:0000000000000000
[50297.616899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50297.617451] CR2: 00007fde9a099f34 CR3: 0000000078754000 CR4: 00000000000606e0
[50297.618146] Call Trace:
[50297.618436]  [<ffffffffc0cde4ec>] lov_flush_composite+0x17c/0x1d0 [lov]
[50297.619091]  [<ffffffffc0cdd2e2>] lov_object_flush+0x22/0x60 [lov]

This is introduced by 707bab62f5 ("LU-12296 llite: improve ll_dom_lock_cancel").

cl_object_flush needs i/o initialized. Otherwise, it races with layout change.

Comment by Gerrit Updater [ 26/Sep/19 ]

Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/36300
Subject: LU-12704 llite: init i/o for cl_object_flush
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aa6e3dbaf3249ca26537b21565882aec42b2aa57

Comment by Gerrit Updater [ 04/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36174/
Subject: LU-12704 tests: component end must be multiple of stripesize
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1cb7bdb883b0b19d944a4bee8403d0b5898a3998

Comment by Gerrit Updater [ 04/Oct/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36368
Subject: LU-12704 lov: take lsm reference in lov_flush_composite
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9c5d50763ac4ce36f48a05e968ad3c84ffcdbe96

Comment by Mikhail Pershin [ 04/Oct/19 ]

I was fixing DOM entry checking in lov_flush_composite() and thought that taking LSM reference should protect us from layout change. Could you check if patch above fixes that problem?

Comment by Vladimir Saveliev [ 04/Oct/19 ]

Could you check if patch above fixes that problem?

[ 2091.709375] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 2091.712428] IP: [<          (null)>]           (null)
...
[ 2091.736582] CPU: 0 PID: 20963 Comm: ldlm_bl_08 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.5.1.el7.x86_64 #1
...
[ 2091.773122] Call Trace:
[ 2091.775002]  [<ffffffffc0c9a1a2>] ? lov_object_flush+0x22/0x60 [lov]
[ 2091.777577]  [<ffffffffc093e193>] cl_object_flush+0x63/0x120 [obdclass]
[ 2091.780136]  [<ffffffffc0e1c408>] ll_lock_cancel_bits+0x9b8/0xc00 [lustre]

According to crash dump (from Cray's test system), the BUG happened when lov_object_flush() tried to dispatch to lov_flush_composite(), so probably the change in lov_flush_composite() will not help.
However, lov_object has LLT_EMPTY layout already:

crash> lov_object.lo_type,lo_lsm 0xffff8f46e1af42e0
  lo_type = LLT_EMPTY
  lo_lsm = 0x0

That probably happened earlier when layout lock was canceled:

00000080:00010000:0.0:1569576242.662719:0:21009:0:(namei.c:248:ll_lock_cancel_bits()) ### to cancel bits 0x19 ns: lustre-MDT0001-mdc-ffff8f4777b0e000 lock: ffff8f4765861d40/0x7909e24de166dec lrc: 3/0,0 mode: PR/PR res: [0x240000406:0x5278:0x0].0x0 bits 0x19/0x19 rrc: 3 type: IBT flags: 0x429400000000 nid: local remote: 0x96dffd219df32662 expref: -99 pid: 3158 timeout: 0 lvb_type: 0
00000080:00200000:0.0:1569576242.662727:0:21009:0:(vvp_object.c:140:vvp_conf_set()) [0x240000406:0x5278:0x0]: losing layout lock

Then on canceling DoM lock ldlm_bl_08 thread faced with empty layout:

00000080:00010000:0.0:1569576242.728877:0:20963:0:(namei.c:248:ll_lock_cancel_bits()) ### to cancel bits 0x40 ns: lustre-MDT0001-mdc-ffff8f4777b0e000 lock: ffff8f47663ee000/0x7909e24de166f19 lrc: 2/0,0 mode: PR/PR res: [0x240000406:0x5278:0x0].0x0 bits 0x48/0x40 rrc: 3 type: IBT flags: 0x460400000000 nid: local remote: 0x96dffd219df32c12 expref: -99 pid: 3288 timeout: 0 lvb_type: 3
Comment by Vladimir Saveliev [ 04/Oct/19 ]

It seems that all elements of lov_dispatch[] need all members of struct lov_layout_operations initialized.

So, how about defining lov_dispatch[LLT_EMPTY].llo_flush and lov_dispatch[LLT_RELEASED].llo_flush to a function like

static int lov_flush_empty()
{
   return 0;
}

Mike, would that work?

Comment by Mikhail Pershin [ 04/Oct/19 ]

Vladimir, yes, I think it is what we need to add

P.S. I've updated patch with that code

Comment by Vladimir Saveliev [ 04/Oct/19 ]

Mike, wouldn't it be better to have  https://review.whamcloud.com/36368 on top of https://review.whamcloud.com/#/c/36300?

Comment by Mikhail Pershin [ 04/Oct/19 ]

Vladimir, do you think https://review.whamcloud.com/#/c/36300 is still needed? I was thinking that we are safe from layout change while LSM is referenced and no need to initialize IO in llite.

Comment by Andreas Dilger [ 18/Oct/19 ]

Mike, Oleg is still hitting crashes with https://review.whamcloud.com/36300 so it can't land as-is. Does it make sense to rebase your patch to be directly on master so that it can land independently, or does it depend on 36300 in order to work properly?

Comment by Peter Jones [ 18/Oct/19 ]

Moving to 2.14 until we can understand the crashes in Olegtest

Comment by Vladimir Saveliev [ 18/Oct/19 ]
+»       io->ci_ignore_layout = 1;

(defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout

Mike, yes, you are right, io->ci_ignore_layout set to 1 leads to race between layout change and cl_io_init. Something like

00020000:00040000:0.0:1571007482.482526:0:28488:0:(lov_io.c:318:lov_io_mirror_init()) ASSERTION( comp->lo_preferred_mirror == 0 ) failed:

has been seen few times.

However, there is also a problem with io->ci_ignore_layout set to 0.

Namely, the below lockup has beed observed:

[<ffffffffc096f065>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc]
[<ffffffffc0970e2a>] ldlm_cli_enqueue_fini+0x63a/0xef0 [ptlrpc]
[<ffffffffc0973b71>] ldlm_cli_enqueue+0x451/0xa60 [ptlrpc]
[<ffffffffc0ba7730>] mdc_enqueue_base+0x330/0x1c40 [mdc]
[<ffffffffc0ba9a85>] mdc_intent_lock+0x135/0x560 [mdc]
[<ffffffffc0be6742>] lmv_intent_lock+0x402/0xa20 [lmv]
[<ffffffffc0c0eb1d>] ll_layout_intent+0x1dd/0x720 [lustre]
[<ffffffffc0c1fa6c>] ll_layout_refresh+0x30c/0x900 [lustre]
[<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre]
[<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass]
[<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass]
[<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre]
[<ffffffffc0c42a5c>] ll_md_blocking_ast+0x24c/0x2b0 [lustre]
[<ffffffffc09626ba>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
[<ffffffffc096e311>] ldlm_cli_cancel_local+0xd1/0x420 [ptlrpc]
[<ffffffffc097294a>] ldlm_cli_cancel_list_local+0xea/0x280 [ptlrpc]
[<ffffffffc0972c6b>] ldlm_cancel_resource_local+0x18b/0x2a0 [ptlrpc]
[<ffffffffc0ba02ac>] mdc_resource_get_unused_res+0x10c/0x250 [mdc]
[<ffffffffc0bb1057>] mdc_enqueue_send+0x557/0x710 [mdc]
[<ffffffffc0bb14b2>] mdc_lock_enqueue+0x2a2/0x6f2 [mdc]
[<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass]
[<ffffffffc0ab51e5>] lov_lock_enqueue+0x95/0x150 [lov]
[<ffffffffc07696d5>] cl_lock_enqueue+0x65/0x120 [obdclass]
[<ffffffffc0769c67>] cl_lock_request+0x67/0x1f0 [obdclass]
[<ffffffffc076d9cb>] cl_io_lock+0x2bb/0x3d0 [obdclass]
[<ffffffffc076dcfa>] cl_io_loop+0xba/0x1c0 [obdclass]
[<ffffffffc0c5981f>] cl_setattr_ost+0x25f/0x3d0 [lustre]
[<ffffffffc0c34b28>] ll_setattr_raw+0xcc8/0x1060 [lustre]
[<ffffffffc0c34f23>] ll_setattr+0x63/0xc0 [lustre]
[<ffffffffbc260524>] notify_change+0x2c4/0x420
[<ffffffffbc23f335>] do_truncate+0x75/0xc0

mdc_enqueue_send() tries to do early cancel, ll_lock_cancel_bits()->ll_dom_lock_cancel() initializes cl_io->ci_ignore_layout to 0, so vvp_io_init() does ll_layout_refresh() and takes lli->lli_layout_mutex and sends enqueue rpc. Server sends blocking ast back to the client, so another ll_lock_cancel_bits() gets to run and stuck on trying to lock the mutex lli->lli_layout_mutex:

int ll_layout_refresh(struct inode *inode, __u32 *gen)
{
...
    mutex_lock(&lli->lli_layout_mutex);
...
[<ffffffffc0c1f94e>] ll_layout_refresh+0x1ee/0x900 [lustre]
[<ffffffffc0c62ea7>] vvp_io_init+0x347/0x460 [lustre]
[<ffffffffc076bf4b>] cl_io_init0.isra.15+0x8b/0x160 [obdclass]
[<ffffffffc076c0e3>] cl_io_init+0x43/0x80 [obdclass]
[<ffffffffc0c41fe5>] ll_lock_cancel_bits+0x625/0xca0 [lustre]
[<ffffffffc0c42a99>] ll_md_blocking_ast+0x289/0x2b0 [lustre]
[<ffffffffc0978bdd>] ldlm_handle_bl_callback+0xed/0x4e0 [ptlrpc]
[<ffffffffc09797d0>] ldlm_bl_thread_main+0x800/0xa40 [ptlrpc]
[<ffffffffbc0c1c71>] kthread+0xd1/0xe0
Comment by Vladimir Saveliev [ 22/Oct/19 ]

(defect) I think that ignore_layout shouldn't be used here as its ignoring layout locking in LOV, as mentioned in lov_io_init() it is used along with CIT_MISC from OSC usually, because OSC object pins layout already. In our case we want cl_object_flush() be protected from layout change so shouldn't set ci_ignore_layout

Mike, it looks like we are in trouble with this patch as io->ci_ignore_layout = 0; as well as io->ci_ignore_layout = 1; leads to a problem. Any idea?

Comment by Gerrit Updater [ 12/Nov/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36368/
Subject: LU-12704 lov: check all entries in lov_flush_composite
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 44460570fd21a91002190c8a0620923125135b52

Comment by Peter Jones [ 12/Nov/19 ]

Landed for 2.14

Generated at Sat Feb 10 02:54:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.