[LU-11280] sanity: test_56w: unable to handle kernel paging request in lod_qos_prep_create Created: 24/Aug/18  Updated: 19/Dec/18  Resolved: 04/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-11279 sanity test_65c: lverify failed Resolved
Related
is related to LU-10279 sanityn test_101c: FAIL: Found WRITE ... Resolved
is related to LU-11146 setstripe for specific osts are broken Resolved
is related to LU-11279 sanity test_65c: lverify failed Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for S Buisson <sbuisson@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/299d644c-a6de-11e8-a5f2-52540065bddc

test_56w failed because MDS crashed with following stack trace:

[ 2736.998005] BUG: unable to handle kernel paging request at ffff9ed8c35f3714
[ 2736.998890] IP: [<ffffffffc10884e3>] lod_qos_prep_create+0x5d3/0x17a0 [lod]
[ 2736.999731] PGD 5442e067 PUD 0 
[ 2737.000131] Oops: 0000 [#1] SMP 
[ 2737.000531] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw ppdev gf128mul glue_helper ablk_helper cryptd i2c_piix4 parport_pc joydev i2c_core pcspkr virtio_balloon parport ip_tables ext4 ata_generic mbcache pata_acpi jbd2 ata_piix virtio_blk libata 8139too crct10dif_pclmul crct10dif_common
[ 2737.009634]  crc32c_intel floppy serio_raw virtio_pci 8139cp virtio_ring virtio mii
[ 2737.010477] CPU: 0 PID: 4463 Comm: mdt00_000 Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
[ 2737.011730] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 2737.012337] task: ffff9ed89ec00000 ti: ffff9ed8a5aec000 task.ti: ffff9ed8a5aec000
[ 2737.013118] RIP: 0010:[<ffffffffc10884e3>]  [<ffffffffc10884e3>] lod_qos_prep_create+0x5d3/0x17a0 [lod]
[ 2737.014122] RSP: 0018:ffff9ed8a5aef5f0  EFLAGS: 00010293
[ 2737.014671] RAX: ffff9ed8b84bfdb0 RBX: 000000005899cae8 RCX: ffff9ed85899cad8
[ 2737.015404] RDX: 0000000000000002 RSI: 0000000000000004 RDI: ffff9ed89c1fd200
[ 2737.016147] RBP: ffff9ed8a5aef6e0 R08: ffff9ed89e2bc550 R09: 0000000000000000
[ 2737.016876] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000
[ 2737.017620] R13: 0000000000000002 R14: ffff9ed8b7e00000 R15: ffff9ed8b97a1fc0
[ 2737.018378] FS:  0000000000000000(0000) GS:ffff9ed8bfc00000(0000) knlGS:0000000000000000
[ 2737.019224] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2737.019832] CR2: ffff9ed8c35f3714 CR3: 0000000036372000 CR4: 00000000000606f0
[ 2737.020590] Call Trace:
[ 2737.020908]  [<ffffffffc0d74d70>] ? qsd_op_begin+0xb0/0x4d0 [lquota]
[ 2737.021581]  [<ffffffffc10898c5>] lod_prepare_create+0x215/0x2e0 [lod]
[ 2737.022273]  [<ffffffffc107b7ee>] lod_declare_striped_create+0x1ee/0x980 [lod]
[ 2737.023039]  [<ffffffffc1089e8f>] ? lod_sub_declare_create+0xdf/0x210 [lod]
[ 2737.023764]  [<ffffffffc107fec4>] lod_declare_create+0x204/0x590 [lod]
[ 2737.024456]  [<ffffffffc106f412>] ? lod_striping_from_default+0x492/0x5b0 [lod]
[ 2737.025335]  [<ffffffffc094e9d9>] ? lu_context_refill+0x19/0x50 [obdclass]
[ 2737.026101]  [<ffffffffc10f2892>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd]
[ 2737.026909]  [<ffffffffc10e21c8>] mdd_declare_create+0x48/0xc10 [mdd]
[ 2737.027590]  [<ffffffffc10e65e9>] mdd_create+0x929/0x13f0 [mdd]
[ 2737.028284]  [<ffffffffc0f91e37>] mdt_reint_open+0x2117/0x3160 [mdt]
[ 2737.028973]  [<ffffffffc09634af>] ? upcall_cache_get_entry+0x3df/0x8b0 [obdclass]
[ 2737.029767]  [<ffffffffc0f85ce3>] mdt_reint_rec+0x83/0x210 [mdt]
[ 2737.030409]  [<ffffffffc0f651d2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
[ 2737.031100]  [<ffffffffc0f716c2>] mdt_intent_open+0x82/0x350 [mdt]
[ 2737.031759]  [<ffffffffc092d6f9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[ 2737.032527]  [<ffffffffc0f6f768>] mdt_intent_policy+0x2f8/0xd10 [mdt]
[ 2737.033224]  [<ffffffffc0f71640>] ? mdt_intent_fixup_resent+0x220/0x220 [mdt]
[ 2737.034120]  [<ffffffffc0b39e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
[ 2737.034863]  [<ffffffffc07516ee>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
[ 2737.035592]  [<ffffffffc0b62483>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
[ 2737.036373]  [<ffffffffc0b8a2d0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[ 2737.037224]  [<ffffffffc0be8932>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 2737.037901]  [<ffffffffc0bf127a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ 2737.038639]  [<ffffffffc0748ee7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 2737.039364]  [<ffffffffc0b9440b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 2737.040188]  [<ffffffff972c52ab>] ? __wake_up_common+0x5b/0x90
[ 2737.040823]  [<ffffffffc0b97c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
[ 2737.041488]  [<ffffffff972c9e50>] ? finish_task_switch+0x50/0x170
[ 2737.042152]  [<ffffffffc0b97130>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
[ 2737.042933]  [<ffffffff972bb621>] kthread+0xd1/0xe0
[ 2737.043459]  [<ffffffff972bb550>] ? insert_kthread_work+0x40/0x40
[ 2737.044119]  [<ffffffff979205f7>] ret_from_fork_nospec_begin+0x21/0x21
[ 2737.044804]  [<ffffffff972bb550>] ? insert_kthread_work+0x40/0x40
[ 2737.045452] Code: ff 89 1c 90 8b 45 98 31 d2 0f b7 4f 24 83 c0 01 f7 f1 41 39 cd 89 55 98 7d 25 48 8b 4f 30 8b 45 98 8b 1c 81 49 8b 86 88 09 00 00 <0f> a3 58 08 19 c0 85 c0 0f 85 17 ff ff ff 41 bc ed ff ff ff f6 
[ 2737.048776] RIP  [<ffffffffc10884e3>] lod_qos_prep_create+0x5d3/0x17a0 [lod]
[ 2737.049546]  RSP <ffff9ed8a5aef5f0>
[ 2737.049919] CR2: ffff9ed8c35f3714


 Comments   
Comment by James Nunez (Inactive) [ 28/Aug/18 ]

Although there were time outs for this test in July 2018, we see this crash in review-ldiskfs since August 23, 2018.

A few more failures since this ticket was opened:
https://testing.whamcloud.com/test_sets/e7abc4e2-a918-11e8-80f7-52540065bddc
https://testing.whamcloud.com/test_sets/e068ec00-aab7-11e8-bd05-52540065bddc
https://testing.whamcloud.com/test_sets/47659a3c-aad3-11e8-80f7-52540065bddc

There are a few crashes that have the same stack trace, but different instruction pointer

[ 2825.966155] BUG: unable to handle kernel NULL pointer dereference at 0000000000000380
[ 2825.966986] IP: [<ffffffffc1025976>] lod_statfs_and_check+0x66/0x590 [lod]

https://testing.whamcloud.com/test_sets/18b3c6b0-a956-11e8-bd05-52540065bddc

Comment by Peter Jones [ 28/Aug/18 ]

Bobijam

Can you please investigate?

Thanks

Peter

Comment by Andreas Dilger [ 30/Aug/18 ]

Hitting this fairly often in testing.

Comment by Andreas Dilger [ 31/Aug/18 ]

About 11% of review-ldiskfs sessions are currently failing because of this.

Comment by Zhenyu Xu [ 01/Sep/18 ]

Tried but failed to reproduce it. The memory access fault seems locates in lod_alloc_ost_list()

...
                /*
                 * We've successfully declared (reserved) an object
                 */
                lod_qos_ost_in_use(env, stripe_count, ost_idx);
                stripe[stripe_count] = o;
                ost_indices[stripe_count] = ost_idx;              // memory access fault here it seems.
                stripe_count++;
         }

        RETURN(rc);
}

I haven't found the code defect here yet, the ost_indices and stripe array has been allocated enough space, and stripe_count should not beyond the arrays boundary.

Comment by Andreas Dilger [ 03/Sep/18 ]

It looks like patch https://review.whamcloud.com/32814 "LU-11146 lustre: fix setstripe for specific osts upon dir" is responsible for this failure. That patch failed once with this stack on 2018-08-07 before it was landed, and no other tests failed with the same stack until 2018-08-23 after it was landed.

Comment by Gerrit Updater [ 03/Sep/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33105
Subject: LU-11280 revert: fix setstripe for specific osts upon dir"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 799514054df0f2c71d98e1480e31d2ccfc3b6713

Comment by Wang Shilong (Inactive) [ 03/Sep/18 ]

I am wondering whether https://review.whamcloud.com/#/c/33069/ this might be cause and fix the problem.

It deserve to run sanity_27H together with 56W to reproduce this BUG.

Comment by Wang Shilong (Inactive) [ 04/Sep/18 ]

I just retriggered https://review.whamcloud.com/#/c/33069/6 to check
whether that patch fixed the problem.

Because it looks strange 56w make the path to lod_alloc_ostlist() and
what above patch tried to address might make it happen.

Comment by Wang Shilong (Inactive) [ 04/Sep/18 ]

I just retriggered https://review.whamcloud.com/#/c/33069/6 to check
whether that patch fixed the problem.

Because it looks strange 56w make the path to lod_alloc_ostlist() and
what above patch tried to address might make it happen.

Comment by Zhenyu Xu [ 04/Sep/18 ]

yes, I can reproduce it with test only 27H then 56w after removing LU-11279 patch, and not hit it with LU-11279 patch in place. So I think and verified with my VM that LU-11279 fixes this issue.

Comment by Peter Jones [ 04/Sep/18 ]

Great news! Then let's close this issue out as a duplicate of LU-11279

Generated at Sat Feb 10 02:42:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.