[LU-9485] conf-sanity test_46a: cl_file_inode_init()) Failure to initialize cl object Created: 10/May/17 Updated: 22/Jun/17 Resolved: 22/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
FSTYPE=zfs |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for John Hammond <john.hammond@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/2d9acc98-3546-11e7-b0a8-5254006e85c2. The sub-test test_46a failed with the following error: test failed to respond and timed out From the client: 01:20:15:[ 8727.094028] Lustre: Mounted lustre-client 01:20:15:[ 8727.326631] LustreError: 14973:0:(lov_ea.c:227:lsme_unpack()) lustre-clilov_UUID: OST index 1 more than OST count 1 01:20:15:[ 8727.329204] Lustre: 14973:0:(lov_pack.c:61:lov_dump_lmm_common()) objid 0x1:3024, magic 0x0bd10bd0, pattern 0x1 01:20:15:[ 8727.331709] Lustre: 14973:0:(lov_pack.c:65:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 2, layout_gen 0 01:20:15:[ 8727.334195] Lustre: 14973:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 0 idx 0 subobj 0x0:34 01:20:15:[ 8727.336551] Lustre: 14973:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 1 idx 1 subobj 0x0:2 01:20:15:[ 8727.338877] LustreError: 14973:0:(lcommon_cl.c:181:cl_file_inode_init()) Failure to initialize cl object [0x200000bd0:0x1:0x0]: -22 01:20:15:[ 8727.343166] Lustre: 14973:0:(lov_pack.c:61:lov_dump_lmm_common()) objid 0x1:3024, magic 0x0bd10bd0, pattern 0x1 01:20:15:[ 8727.345641] Lustre: 14973:0:(lov_pack.c:65:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 2, layout_gen 0 01:20:15:[ 8727.348135] Lustre: 14973:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 0 idx 0 subobj 0x0:34 01:20:15:[ 8727.350427] Lustre: 14973:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 1 idx 1 subobj 0x0:2 01:20:15:[ 8727.352806] LustreError: 14973:0:(llite_lib.c:2256:ll_prep_inode()) new_inode -fatal: rc -22 01:20:15:[ 8727.362035] BUG: unable to handle kernel paging request at ffffffa0c9b0b0ff 01:20:15:[ 8727.363006] IP: [<ffffffa0c9b0b0ff>] 0xffffffa0c9b0b0fe 01:20:15:[ 8727.363006] PGD 19bd067 PUD 0 01:20:15:[ 8727.363006] Oops: 0010 [#1] SMP 01:20:15:[ 8727.363006] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi ppdev crc32_pclmul ghash_clmulni_intel aesni_intel parport_pc parport lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon i2c_piix4 pcspkr nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk cirrus crct10dif_pclmul crct10dif_common drm_kms_helper 8139too ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel serio_raw ttm virtio_pci virtio_ring virtio drm 8139cp i2c_core mii libata floppy [last unloaded: libcfs] 01:20:15:[ 8727.363006] CPU: 0 PID: 14979 Comm: bash Tainted: G W OE ------------ 3.10.0-514.16.1.el7.x86_64 #1 01:20:15:[ 8727.363006] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 01:20:15:[ 8727.363006] task: ffff88007ae76dd0 ti: ffff88007a240000 task.ti: ffff88007a240000 01:20:15:[ 8727.363006] RIP: 0010:[<ffffffa0c9b0b0ff>] [<ffffffa0c9b0b0ff>] 0xffffffa0c9b0b0fe 01:20:15:[ 8727.363006] RSP: 0018:ffff88007fc03e78 EFLAGS: 00010286 01:20:15:[ 8727.363006] RAX: ffffffa0c9b0b0ff RBX: ffffffff81a1a5c0 RCX: 0000000000019b00 01:20:15:[ 8727.363006] RDX: ffff880067b8caa7 RSI: 0000000000000003 RDI: ffff880067b8caa7 01:20:15:[ 8727.363006] RBP: ffff88007fc03ed0 R08: ffff88007b7b7e00 R09: 000000018010000f 01:20:15:[ 8727.363006] R10: 0000000000000001 R11: ffffea0001ededc0 R12: 000000000000000a 01:20:15:[ 8727.363006] R13: 0000000000000001 R14: ff8800658b5fb0ff R15: ffff88007fc101c0 01:20:15:[ 8727.363006] FS: 00007f4f234d3740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 01:20:15:[ 8727.363006] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 01:20:15:[ 8727.363006] CR2: ffffffa0c9b0b0ff CR3: 00000000782bb000 CR4: 00000000000406f0 01:20:15:[ 8727.363006] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 01:20:15:[ 8727.363006] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 01:20:15:[ 8727.363006] Stack: 01:20:15:[ 8727.363006] ffffffff81139bbd ffff880066441800 0000000000000001 ffff88007fc101e8 01:20:15:[ 8727.363006] ffff88007ae76dd0 ffff88007a243fd8 0000000000000009 0000000000000009 01:20:15:[ 8727.363006] ffff880067d22f58 ffffffff819b3108 0000000000000001 ffff88007fc03f40 01:20:15:[ 8727.363006] Call Trace: 01:20:15:[ 8727.363006] <IRQ> 01:20:15:[ 8727.363006] [<ffffffff81139bbd>] ? rcu_process_callbacks+0x1dd/0x550 01:20:15:[ 8727.363006] [<ffffffff8108f63f>] __do_softirq+0xef/0x280 01:20:15:[ 8727.363006] [<ffffffff81698c1c>] call_softirq+0x1c/0x30 01:20:15:[ 8727.363006] [<ffffffff8102d365>] do_softirq+0x65/0xa0 01:20:15:[ 8727.363006] [<ffffffff8108f9d5>] irq_exit+0x115/0x120 01:20:15:[ 8727.363006] [<ffffffff81699895>] smp_apic_timer_interrupt+0x45/0x60 01:20:15:[ 8727.363006] [<ffffffff81697ddd>] apic_timer_interrupt+0x6d/0x80 01:20:15:[ 8727.363006] <EOI> 01:20:15:[ 8727.363006] [<ffffffff8121ccf3>] ? dup_fd+0x153/0x2d0 01:20:15:[ 8727.363006] [<ffffffff8121ccbf>] ? dup_fd+0x11f/0x2d0 01:20:15:[ 8727.363006] [<ffffffff81083f70>] copy_process+0xc10/0x1960 01:20:15:[ 8727.363006] [<ffffffff81084e71>] do_fork+0x91/0x2c0 01:20:16:[ 8727.363006] [<ffffffff81085126>] SyS_clone+0x16/0x20 01:20:16:[ 8727.363006] [<ffffffff816974d9>] stub_clone+0x69/0x90 01:20:16:[ 8727.363006] [<ffffffff81697189>] ? system_call_fastpath+0x16/0x1b 01:20:16:[ 8727.363006] Code: Bad RIP value. 01:20:16:[ 8727.363006] RIP [<ffffffa0c9b0b0ff>] 0xffffffa0c9b0b0fe 01:20:16:[ 8727.363006] RSP <ffff88007fc03e78> 01:20:16:[ 8727.363006] CR2: ffffffa0c9b0b0ff https://testing.hpdd.intel.com/test_logs/2f0f1c78-3546-11e7-b0a8-5254006e85c2/show_text Info required for matching: conf-sanity 46a |
| Comments |
| Comment by Jian Yu [ 11/May/17 ] |
|
This has been blocking patch review testing on review-zfs-part-2 session against master branch since 2017-05-09: |
| Comment by Peter Jones [ 16/May/17 ] |
|
Jinshan Could you please look into this one? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 16/May/17 ] |
|
This bug may imply several latent issues. 1. wait_osc_import_state client ost FULL is broken by the commit: 5c315184152, and the update in question is: diff --git a/lustre/tests/test-framework.sh b/lustre/tests/test-framework.sh
index d27cb30656..37e8475891 100755
--- a/lustre/tests/test-framework.sh
+++ b/lustre/tests/test-framework.sh
@@ -6020,9 +6020,8 @@ _wait_osc_import_state() {
if [[ $facet == client* ]]; then
# During setup time, the osc might not be setup, it need wait
- # until list_param can return valid value. And also if there
- # are mulitple osc entries we should list all of them before
- # go to wait.
+ # until list_param can return valid value.
+ param="osc.${ost%?}[^mM]*.ost_server_uuid"
params=$($LCTL list_param $param 2>/dev/null || true)
while [ -z "$params" ]; do
if [ $i -ge $maxtime ]; then
As you can see, the variable ost is not even defined, so it just lists the existing OSCs. 2. Somehow the above error triggered memory corruption somewhere. I checked the corresponding code of PFL in the LOV layer and they seemed to be ok. |
| Comment by Gerrit Updater [ 16/May/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/27148 |
| Comment by Jinshan Xiong (Inactive) [ 16/May/17 ] |
|
The above patch should be able to hide the issue However, I think we should assign someone to reproduce this issue manually(I couldn't reproduce it with my VM), and once we have a reliable way of reproducing it, we should load a debug kernel to catch the memory corruption issue. |
| Comment by Jinshan Xiong (Inactive) [ 17/May/17 ] |
|
It would be easier if we had a crash dump for this. |
| Comment by Jian Yu [ 17/May/17 ] |
|
Hi Jinshan,
FYI, from the client console log https://testing.hpdd.intel.com/test_logs/2f0f1c78-3546-11e7-b0a8-5254006e85c2/show_text , you can find the following messages: 01:20:17: Starting Kdump Vmcore Save Service... 01:20:17:kdump: dump target is trevis-2.trevis.hpdd.intel.com:/export/scratch 01:20:17:kdump: saving to /kdumproot/scratch//dumps/trevis-41vm1.trevis.hpdd.intel.com/10.9.5.237-2017-05-10-01:19:54/ 01:20:17:kdump: saving vmcore-dmesg.txt 01:20:17:kdump: saving vmcore-dmesg.txt complete 01:20:17:kdump: saving vmcore 01:20:17: Checking for memory holes : [ 0.0 %] / Checking for memory holes : [100.0 %] | Excluding unnecessary pages : [100.0 %] \ Checking for memory holes : [100.0 %] - Checking for memory holes : [100.0 %] / Excluding unnecessary pages : [100.0 %] | Copying data : [ 69.0 %] \ Copying data : [100.0 %] - Copying data : [100.0 %] / 01:20:17:kdump: saving vmcore complete |
| Comment by Jinshan Xiong (Inactive) [ 17/May/17 ] |
|
I looked at the crash dump at https://testing.hpdd.intel.com/test_sets/e592b0c0-39b7-11e7-8847-5254006e85c2 It turned out the address of ll_inode_destroy_callback is 0xffffffffa0ca60b0, and the corresponding rcu_head should be 0xffff880066067328, but somehow it was referenced as 0xffff880066067327, i.e. one byte difference. Since the rcu list is maintained by kernel, the pointer to rcu_head of inode is dereferenced from the next of the previous rcu_head, so it's possible that that piece of memory is somehow reused and decreased by 1. |
| Comment by John Hammond [ 17/May/17 ] |
|
Jinshan, please note that this has only been seen on zfs test sessions and only for changes that included the last commit of the May 9th landing batch (e0deeb237ae111e959e916ae910d15272ed4c120). So it may be from one of: e0deeb2 LU-9140 nrs: add some debug log for NRS TBF 57cc12f LU-9312 hsm: release restore lock without object 8946ce9 LU-9447 o2iblnd: Check for 2 arg ib_alloc_pd af73b39 LU-9443 mdd: omit changelog records for volatile files b0404c1 LU-9369 lfs: make lfs find work correctly 92bbd06 LU-8746 update: restore tdtd_refcount during failure e5eec25 LU-9385 mdt: remove XATTR locking from mdt_add_dirty_flag() 77a60b0 LU-9404 mdt: set HSM xattr only when needed cd58a3b LU-9403 mdt: prevent HSM leak on re-archive 945fd61 LU-8364 ldiskfs: fixes for failover mode 87a233e LU-9411 tests: skip llapi_layout_test 30, 31 for interop dded1e7 LU-9312 hsm: add a cookie indexed request hash d29e759 LU-9312 hsm: fix error handling around mdt_hsm_get_md_hsm() 2585154 LU-9119 lnet: selftest MR fix f2cba08 LU-9394 osd: __osd_obj2dnode() to return negative errors ad75291 LU-9040 scrub: detect dead loop scanning 6746484 LU-8998 llapi: add LLAPI_LAYOUT_COMP_USE_PREV 293da20 LU-9201 test: avoid long sleeps in mount_facet() 191608c LU-9205 tests: fix failures in CLIENTONLY mode 1b3028a LU-8367 osp: orphan cleanup do not wait for reserved 9a6b262 LU-9285 osp: revert patches LU-8367 and LU-8973 e9a1d3d LU-9183 llite: handle removal the pos argument of generic_write_sync 4d1bb7a LU-9183 llite: handle make the string hashes salt the hash cf4481e LU-9125 test: Update setstripe options de7a14b LU-4423 ptlrpc: use 64-bit times for request times 4ce3219 LU-4017 quota: cleanup to improve quota codes 672986c LU-8376 ost: enhance end to end bulk cksum error report 6dc05f0 LU-8650 mdt: enable REP-ACK for DNE c156613 LU-8307 ldlm: cond_resched in ldlm_bl_thread_main 2ffbcc9 LU-9430 utils: fix logic errors and putchar in sk_name2hmac() 7d30275 LU-7062 ldlm: GPF in _ldlm_lock_debug() |
| Comment by Jinshan Xiong (Inactive) [ 17/May/17 ] |
|
Those changes seem pretty simple on the client side. Also the test only handles a plain file w/o PFL layout. I wonder if there is a test env related or server side change around May 9th so that the clients become more likely to see that OST2 is not ready in conf-sanity:46a, and then the memory corruption is just a fallout in the failure path. The issue may have been existing but wasn't triggered somehow |
| Comment by John Hammond [ 18/May/17 ] |
|
> I wonder if there is a test env related or server side change around May 9th so that the clients become more likely to see that OST2 is not ready in conf-sanity:46a, and then the memory corruption is just a fallout in the failure path. The issue may have been existing but wasn't triggered somehow That's what I was thinking. Maybe one of: 293da20 |
| Comment by Jinshan Xiong (Inactive) [ 18/May/17 ] |
|
I'm more concerned about the memory corruption caused by the failure path. If you have time, please double check the code path and see if you can find something new. |
| Comment by John Hammond [ 22/May/17 ] |
|
Hi Jinshan, I double checked but didn't find anything. Can you try running with a debug enable kernel and injecting the error see in lsme_unpack()? |
| Comment by Jinshan Xiong (Inactive) [ 22/May/17 ] |
|
I injected the error before but it didn't cause the kernel crash. Debug kernel probably won't help in this case. We need to load a kernel with kasan enabled, but that will need a Ubuntu test node. Too much work for now. |
| Comment by Sarah Liu [ 03/Jun/17 ] |
|
with debug kernel, PTLDEBUG=-1 running: conf-sanity ONLY=46a run_suite conf-sanity /usr/lib64/lustre/tests/conf-sanity.sh [ 1539.918416] Lustre: DEBUG MARKER: -----============= acceptance-small: conf-sanity ============----- Sat Jun 3 00:21:33 UTC 2017 [ 1540.345014] Lustre: DEBUG MARKER: excepting tests: 32newtarball 101 [ 1540.509025] Lustre: Unmounted lustre-client [ 1594.485769] LNet: Removed LNI 10.9.6.136@tcp [ 1595.152511] LNet: 22761:0:(workitem.c:319:cfs_wi_sched_destroy()) waiting for 2 threads of WI sched[cfs_rh] to terminate [ 1596.205687] LNet: HW nodes: 1, HW CPU cores: 2, npartitions: 1 [ 1596.208649] alg: No test for adler32 (adler32-zlib) [ 1596.209259] alg: No test for crc32 (crc32-table) [ 1597.151613] Lustre: DEBUG MARKER: == conf-sanity test 46a: handle ost additional - wide striped file =================================== 00:22:29 (1496449349) [ 1598.952811] Lustre: Lustre: Build Version: 2.9.58_22_gdb59ecb [ 1598.987313] LNet: Added LNI 10.9.6.136@tcp [8/256/0/180] [ 1598.988049] LNet: Accept secure, port 988 [ 1599.099714] Lustre: Echo OBD driver; http://www.lustre.org/ [ 1634.101425] Lustre: Mounted lustre-client [ 1640.492000] Lustre: DEBUG MARKER: trevis-58vm1.trevis.hpdd.intel.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff8800362bb000.ost_server_uuid 40 [ 1640.724598] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff8800362bb000.ost_server_uuid in FULL state after 0 sec [ 1640.792619] Lustre: Mounted lustre-client [ 1640.978831] LustreError: 25798:0:(lov_ea.c:227:lsme_unpack()) lustre-clilov_UUID: OST index 1 more than OST count 1 [ 1640.979873] Lustre: 25798:0:(lov_pack.c:61:lov_dump_lmm_common()) objid 0x1:3024, magic 0x0bd10bd0, pattern 0x1 [ 1640.980876] Lustre: 25798:0:(lov_pack.c:65:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 2, layout_gen 0 [ 1640.981880] Lustre: 25798:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 0 idx 1 subobj 0x0:2 [ 1640.982705] Lustre: 25798:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 1 idx 0 subobj 0x0:34 [ 1640.983574] LustreError: 25798:0:(lcommon_cl.c:181:cl_file_inode_init()) Failure to initialize cl object [0x200000bd0:0x1:0x0]: -22 [ 1640.985037] Lustre: 25798:0:(lov_pack.c:61:lov_dump_lmm_common()) objid 0x1:3024, magic 0x0bd10bd0, pattern 0x1 [ 1640.986008] Lustre: 25798:0:(lov_pack.c:65:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 2, layout_gen 0 [ 1640.987005] Lustre: 25798:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 0 idx 1 subobj 0x0:2 [ 1640.987830] Lustre: 25798:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 1 idx 0 subobj 0x0:34 [ 1640.988788] LustreError: 25798:0:(llite_lib.c:2257:ll_prep_inode()) new_inode -fatal: rc -22 [ 1640.992995] ------------[ cut here ]------------ [ 1640.993476] WARNING: at lib/debugobjects.c:260 debug_print_object+0x83/0xa0() [ 1640.994159] ODEBUG: active_state not available (active state 0) object type: rcu_head hint: (null) [ 1640.995086] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper virtio_balloon i2c_piix4 i2c_core ppdev pcspkr ablk_helper cryptd parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common crc32c_intel 8139too serio_raw virtio_pci virtio_ring virtio ata_piix 8139cp mii libata floppy [last unloaded: libcfs] [ 1641.003768] CPU: 0 PID: 25801 Comm: grep Tainted: G OE ------------ 3.10.0-514.16.1.el7.x86_64.debug #1 [ 1641.004767] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1641.005314] ffff88007d203d48 000000000a3b44cf ffff88007d203d00 ffffffff81759e39 [ 1641.006097] ffff88007d203d38 ffffffff81095cd0 ffff88007d203df0 ffffffff81afe840 [ 1641.006875] ffffffff819f6712 0000000000000296 ffffffff81afe840 ffff88007d203da0 [ 1641.007654] Call Trace: [ 1641.007904] <IRQ> [<ffffffff81759e39>] dump_stack+0x19/0x1b [ 1641.008504] [<ffffffff81095cd0>] warn_slowpath_common+0x70/0xb0 [ 1641.009079] [<ffffffff81095d6c>] warn_slowpath_fmt+0x5c/0x80 [ 1641.009632] [<ffffffff813aa023>] debug_print_object+0x83/0xa0 [ 1641.010191] [<ffffffff813aafe0>] debug_object_active_state+0x150/0x190 [ 1641.010837] [<ffffffff81228f6c>] ? kfree+0xdc/0x2d0 [ 1641.011314] [<ffffffff811723d6>] rcu_process_callbacks+0x286/0x830 [ 1641.011919] [<ffffffff810a11df>] __do_softirq+0x10f/0x470 [ 1641.012443] [<ffffffff8176fb7c>] call_softirq+0x1c/0x30 [ 1641.012962] [<ffffffff81035bed>] do_softirq+0x8d/0xc0 [ 1641.013454] [<ffffffff810a1785>] irq_exit+0x125/0x140 [ 1641.013953] [<ffffffff81770815>] smp_apic_timer_interrupt+0x45/0x60 [ 1641.014555] [<ffffffff8176ed32>] apic_timer_interrupt+0x72/0x80 [ 1641.015131] <EOI> [<ffffffff811ce00e>] ? get_page_from_freelist+0x35e/0xb50 [ 1641.015850] [<ffffffff811cdd8f>] ? get_page_from_freelist+0xdf/0xb50 [ 1641.016463] [<ffffffff811cea03>] __alloc_pages_nodemask+0x203/0x580 [ 1641.017072] [<ffffffff8121c63a>] alloc_pages_vma+0x9a/0x150 [ 1641.017621] [<ffffffff811f9712>] handle_mm_fault+0xc92/0x1000 [ 1641.018184] [<ffffffff81768959>] ? __do_page_fault+0x109/0x4e0 [ 1641.018754] [<ffffffff817689c7>] __do_page_fault+0x177/0x4e0 [ 1641.019302] [<ffffffff81768e16>] trace_do_page_fault+0x56/0x2c0 [ 1641.019881] [<ffffffff8176842b>] do_async_page_fault+0x1b/0xd0 [ 1641.020447] [<ffffffff81764a18>] async_page_fault+0x28/0x30 [ 1641.020994] ---[ end trace 9779c2b1cea644fb ]--- [ 1641.021458] ------------[ cut here ]------------ [ 1641.021920] WARNING: at lib/debugobjects.c:260 debug_print_object+0x83/0xa0() [ 1641.022635] ODEBUG: deactivate not available (active state 0) object type: rcu_head hint: (null) [ 1641.023558] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper virtio_balloon i2c_piix4 i2c_core ppdev pcspkr ablk_helper cryptd parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common crc32c_intel 8139too serio_raw virtio_pci virtio_ring virtio ata_piix 8139cp mii libata floppy [last unloaded: libcfs] [ 1641.033087] CPU: 0 PID: 25801 Comm: grep Tainted: G W OE ------------ 3.10.0-514.16.1.el7.x86_64.debug #1 [ 1641.034082] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1641.034629] ffff88007d203d50 000000000a3b44cf ffff88007d203d08 ffffffff81759e39 [ 1641.035413] ffff88007d203d40 ffffffff81095cd0 ffff88007d203df0 ffffffff81afe840 [ 1641.036201] ffffffff819cee2f 0000000000000286 ffffffff81afe840 ffff88007d203da8 [ 1641.036991] Call Trace: [ 1641.037235] <IRQ> [<ffffffff81759e39>] dump_stack+0x19/0x1b [ 1641.037829] [<ffffffff81095cd0>] warn_slowpath_common+0x70/0xb0 [ 1641.038409] [<ffffffff81095d6c>] warn_slowpath_fmt+0x5c/0x80 [ 1641.038962] [<ffffffff813aa023>] debug_print_object+0x83/0xa0 [ 1641.039520] [<ffffffff813aa9c2>] debug_object_deactivate+0x142/0x1a0 [ 1641.040147] [<ffffffff81228f6c>] ? kfree+0xdc/0x2d0 [ 1641.040626] [<ffffffff811723e5>] rcu_process_callbacks+0x295/0x830 [ 1641.041236] [<ffffffff810a11df>] __do_softirq+0x10f/0x470 [ 1641.041770] [<ffffffff8176fb7c>] call_softirq+0x1c/0x30 [ 1641.042353] [<ffffffff81035bed>] do_softirq+0x8d/0xc0 [ 1641.042855] [<ffffffff810a1785>] irq_exit+0x125/0x140 [ 1641.043354] [<ffffffff81770815>] smp_apic_timer_interrupt+0x45/0x60 [ 1641.043967] [<ffffffff8176ed32>] apic_timer_interrupt+0x72/0x80 [ 1641.044543] <EOI> [<ffffffff811ce00e>] ? get_page_from_freelist+0x35e/0xb50 [ 1641.045270] [<ffffffff811cdd8f>] ? get_page_from_freelist+0xdf/0xb50 [ 1641.045887] [<ffffffff811cea03>] __alloc_pages_nodemask+0x203/0x580 [ 1641.046500] [<ffffffff8121c63a>] alloc_pages_vma+0x9a/0x150 [ 1641.047048] [<ffffffff811f9712>] handle_mm_fault+0xc92/0x1000 [ 1641.047610] [<ffffffff81768959>] ? __do_page_fault+0x109/0x4e0 [ 1641.048181] [<ffffffff817689c7>] __do_page_fault+0x177/0x4e0 [ 1641.048735] [<ffffffff81768e16>] trace_do_page_fault+0x56/0x2c0 [ 1641.049307] [<ffffffff8176842b>] do_async_page_fault+0x1b/0xd0 [ 1641.049878] [<ffffffff81764a18>] async_page_fault+0x28/0x30 [ 1641.050421] ---[ end trace 9779c2b1cea644fc ]--- [ 1641.050894] BUG: unable to handle kernel paging request at ffffffa0bbe200ff [ 1641.051603] IP: [<ffffffa0bbe200ff>] 0xffffffa0bbe200fe [ 1641.052131] PGD 1abd067 PUD 0 [ 1641.052560] Oops: 0010 [#1] SMP [ 1641.053100] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper virtio_balloon i2c_piix4 i2c_core ppdev pcspkr ablk_helper cryptd parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common crc32c_intel 8139too serio_raw virtio_pci virtio_ring virtio ata_piix 8139cp mii libata floppy [last unloaded: libcfs] [ 1641.062225] CPU: 0 PID: 25801 Comm: grep Tainted: G W OE ------------ 3.10.0-514.16.1.el7.x86_64.debug #1 [ 1641.063219] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1641.063772] task: ffff880079024000 ti: ffff8800493b0000 task.ti: ffff8800493b0000 [ 1641.064483] RIP: 0010:[<ffffffa0bbe200ff>] [<ffffffa0bbe200ff>] 0xffffffa0bbe200fe [ 1641.065235] RSP: 0000:ffff88007d203e58 EFLAGS: 00010286 [ 1641.065752] RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000006 [ 1641.066426] RDX: ffffffa0bbe200ff RSI: 0000000000000001 RDI: ffff8800425918af [ 1641.067111] RBP: ffff88007d203ec0 R08: 0000000000000000 R09: 0000000000000001 [ 1641.067792] R10: 0000000000000000 R11: ffff88007d2039fe R12: 0000000000000001 [ 1641.068463] R13: ffff8800425918af R14: 0000000000000001 R15: ff880060db0d58ff [ 1641.069147] FS: 00007f37f7791740(0000) GS:ffff88007d200000(0000) knlGS:0000000000000000 [ 1641.069919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1641.070466] CR2: ffffffa0bbe200ff CR3: 000000004938a000 CR4: 00000000000406f0 [ 1641.071154] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1641.071835] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1641.072506] Stack: [ 1641.072718] ffffffff8117238a ffff88007b2e56a8 ffff88007d3d0760 ffffffff81b38400 [ 1641.073497] 0000000000000000 ffff88007d3d0788 ffff880079024000 ffff8800493b3fd8 [ 1641.074279] 0000000000000009 0000000000000009 0000000000000010 ffffffff81ab3188 [ 1641.075063] Call Trace: [ 1641.075316] <IRQ> [ 1641.075512] [<ffffffff8117238a>] ? rcu_process_callbacks+0x23a/0x830 [ 1641.076203] [<ffffffff810a11df>] __do_softirq+0x10f/0x470 [ 1641.076738] [<ffffffff8176fb7c>] call_softirq+0x1c/0x30 [ 1641.077255] [<ffffffff81035bed>] do_softirq+0x8d/0xc0 [ 1641.077755] [<ffffffff810a1785>] irq_exit+0x125/0x140 [ 1641.078247] [<ffffffff81770815>] smp_apic_timer_interrupt+0x45/0x60 [ 1641.078855] [<ffffffff8176ed32>] apic_timer_interrupt+0x72/0x80 [ 1641.079432] <EOI> [ 1641.079627] [<ffffffff811ce00e>] ? get_page_from_freelist+0x35e/0xb50 [ 1641.080313] [<ffffffff811cdd8f>] ? get_page_from_freelist+0xdf/0xb50 [ 1641.080935] [<ffffffff811cea03>] __alloc_pages_nodemask+0x203/0x580 [ 1641.081537] [<ffffffff8121c63a>] alloc_pages_vma+0x9a/0x150 [ 1641.082087] [<ffffffff811f9712>] handle_mm_fault+0xc92/0x1000 [ 1641.082641] [<ffffffff81768959>] ? __do_page_fault+0x109/0x4e0 [ 1641.083216] [<ffffffff817689c7>] __do_page_fault+0x177/0x4e0 [ 1641.083777] [<ffffffff81768e16>] trace_do_page_fault+0x56/0x2c0 [ 1641.084349] [<ffffffff8176842b>] do_async_page_fault+0x1b/0xd0 [ 1641.084919] [<ffffffff81764a18>] async_page_fault+0x28/0x30 [ 1641.085455] Code: Bad RIP value. [ 1641.085830] RIP [<ffffffa0bbe200ff>] 0xffffffa0bbe200fe [ 1641.086363] RSP <ffff88007d203e58> [ 1641.086714] CR2: ffffffa0bbe200ff [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-514.16.1.el7.x86_64.debug (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Apr 12 15:21:14 UTC 2017 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-514.16.1.el7.x86_64.debug root=UUID=94c86014-757d-40d3-8e56-e1aec6f393a5 ro console=tty0 LANG=en_US.UTF-8 console=ttyS0,115200 net.ifnames=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never disable_cpu_apicid=0 elfcorehdr=867700K [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009f7ff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000021000000-0x0000000034f5cfff] usable [ 0.000000] BIOS-e820: [mem 0x0000000034fff800-0x0000000034ffffff] usable [ 0.000000] BIOS-e820: [mem 0x000000007fffe000-0x000000007fffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] SMBIOS 2.4 present. [ 0.000000] Hypervisor detected: KVM [ 0.000000] e820: last_pfn = 0x35000 max_arch_pfn = 0x400000000 [ 0.000000] x86 PAT enabled: cpu 0, old 0x7010600070106, new 0x7010600070106 [ 0.000000] x2apic enabled by BIOS, switching to x2apic ops [ 0.000000] found SMP MP-table at [mem 0x000f1fe0-0x000f1fef] mapped at [ffff8800000f1fe0] [ 0.000000] Scanning 1 areas for low memory corruption [ 0.000000] Using GB pages for direct mapping [ 0.000000] RAMDISK: [mem 0x30f4e000-0x31ffffff] [ 0.000000] ACPI: RSDP 00000000000f1e10 00014 (v00 BOCHS ) [ 0.000000] ACPI: RSDT 000000007ffffb04 00030 (v01 BOCHS BXPCRSDT 00000001 BXPC 00000001) [ 0.000000] ACPI: FACP 000000007ffff177 00074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001) [ 0.000000] ACPI: DSDT 000000007fffe040 01137 (v01 BXPC BXDSDT 00000001 INTL 20150619) [ 0.000000] ACPI: FACS 000000007fffe000 00040 [ 0.000000] ACPI: SSDT 000000007ffff1eb 00899 (v01 BOCHS BXPCSSDT 00000001 BXPC 00000001) [ 0.000000] ACPI: APIC 000000007ffffa84 00080 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001) [ 0.000000] Setting APIC routing to cluster x2apic. [ 0.000000] NUMA turned off [ 0.000000] Faking a node at [mem 0x0000000000000000-0x0000000034ffffff] [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x34ffffff] [ 0.000000] NODE_DATA [mem 0x34f36000-0x34f5cfff] [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00 [ 0.000000] kvm-clock: cpu 0, msr 0:34ee6001, primary cpu clock [ 0.000000] kvm-clock: using sched offset of 94846834025781 cycles [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009efff] [ 0.000000] node 0: [mem 0x21000000-0x34f5cfff] [ 0.000000] ACPI: PM-Timer IO Port: 0x608 [ 0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) [ 0.000000] ACPI: NR_CPUS/possible_cpus limit of 1 reached. Processor 1/0x1 ignored. [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1]) [ 0.000000] ACPI: IOAPIC (id[0x00] address[0xfec00000] gsi_base[0]) [ 0.000000] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23 [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level) [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level) [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level) [ 0.000000] Using ACPI (MADT) for SMP configuration information [ 0.000000] smpboot: 2 Processors exceeds NR_CPUS limit of 1 [ 0.000000] smpboot: Allowing 1 CPUs, 0 hotplug CPUs [ 0.000000] ------------[ cut here ]------------ [ 0.000000] WARNING: at include/linux/cpumask.h:116 cpumask_check.part.1+0x1e/0x27() [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0-514.16.1.el7.x86_64.debug #1 [ 0.000000] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 0.000000] 0000000000000000 b437bd1d0e6f8cfa ffffffff81aafe30 ffffffff81759e39 [ 0.000000] ffffffff81aafe68 ffffffff81095cd0 0000000000000002 0000000000000001 [ 0.000000] 0000000032000000 0000000000000000 0000000000000000 ffffffff81aafe78 [ 0.000000] Call Trace: [ 0.000000] [<ffffffff81759e39>] dump_stack+0x19/0x1b [ 0.000000] [<ffffffff81095cd0>] warn_slowpath_common+0x70/0xb0 [ 0.000000] [<ffffffff81095e1a>] warn_slowpath_null+0x1a/0x20 [ 0.000000] [<ffffffff817522cc>] cpumask_check.part.1+0x1e/0x27 [ 0.000000] [<ffffffff8109a7cf>] set_cpu_possible+0x4f/0x60 [ 0.000000] [<ffffffff81e29fe7>] prefill_possible_map+0xf7/0x105 [ 0.000000] [<ffffffff81e209b8>] setup_arch+0xd44/0xf0d [ 0.000000] [<ffffffff8175214c>] ? printk+0x72/0x91 [ 0.000000] [<ffffffff81e15d1b>] start_kernel+0xed/0x489 [ 0.000000] [<ffffffff81e15120>] ? early_idt_handlers+0x120/0x120 [ 0.000000] [<ffffffff81e155ee>] x86_64_start_reservations+0x2a/0x2c [ 0.000000] [<ffffffff81e15742>] x86_64_start_kernel+0x152/0x175 [ 0.000000] ---[ end trace 3b72577c75e4d8f3 ]--- [ 0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff] [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000effff] [ 0.000000] PM: Registered nosave memory: [mem 0x000f0000-0x000fffff] [ 0.000000] PM: Registered nosave memory: [mem 0x00100000-0x20ffffff] [ 0.000000] PM: Registered nosave memory: [mem 0x34f5d000-0x34ffffff] [ 0.000000] e820: [mem 0x80000000-0xfeffbfff] available for PCI devices [ 0.000000] Booting paravirtualized kernel on KVM [ 0.000000] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:1 nr_cpu_ids:1 nr_node_ids:1 [ 0.000000] PERCPU: Embedded 482 pages/cpu @ffff880034c00000 s1936216 r8192 d29864 u2097152 [ 0.000000] KVM setup async PF for cpu 0 [ 0.000000] kvm-stealtime: cpu 0, msr 34c0f600 [ 0.000000] Built 1 zonelists in Node order, mobility grouping on. Total pages: 80613 [ 0.000000] Policy zone: DMA32 [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-514.16.1.el7.x86_64.debug root=UUID=94c86014-757d-40d3-8e56-e1aec6f393a5 ro console=tty0 LANG=en_US.UTF-8 console=ttyS0,115200 net.ifnames=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never disable_cpu_apicid=0 elfcorehdr=867700K [ 0.000000] Misrouted IRQ fixup and polling support enabled [ 0.000000] This may significantly impact system performance [ 0.000000] Disabling memory control group subsystem [ 0.000000] PID hash table entries: 2048 (order: 2, 16384 bytes) [ 0.000000] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100 [ 0.000000] xsave: enabled xstate_bv 0x7, cntxt size 0x340 using standard form [ 0.000000] Memory: 261628k/868352k available (7627k kernel code, 540692k absent, 66032k reserved, 4894k data, 6804k init) [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 [ 0.000000] Hierarchical RCU implementation. [ 0.000000] RCU lockdep checking is enabled. [ 0.000000] RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=1. [ 0.000000] NR_IRQS:327936 nr_irqs:256 0 [ 0.000000] Spurious LAPIC timer interrupt on cpu 0 [ 0.000000] Console: colour VGA+ 80x25 [ 0.000000] console [tty0] enabled [ 0.000000] console [ttyS0] enabled [ 0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar [ 0.000000] ... MAX_LOCKDEP_SUBCLASSES: 8 [ 0.000000] ... MAX_LOCK_DEPTH: 48 [ 0.000000] ... MAX_LOCKDEP_KEYS: 8191 [ 0.000000] ... CLASSHASH_SIZE: 4096 [ 0.000000] ... MAX_LOCKDEP_ENTRIES: 32768 [ 0.000000] ... MAX_LOCKDEP_CHAINS: 65536 [ 0.000000] ... CHAINHASH_SIZE: 32768 [ 0.000000] memory used by lock dependency info: 8671 kB [ 0.000000] per task-struct memory footprint: 2688 bytes [ 0.000000] kmemleak: Kernel memory leak detector disabled [ 0.000000] tsc: Detected 2593.746 MHz processor [ 0.123552] Calibrating delay loop (skipped) preset value.. 5187.49 BogoMIPS (lpj=2593746) [ 0.124514] pid_max: default: 32768 minimum: 301 [ 0.125115] Security Framework initialized [ 0.125578] SELinux: Initializing. [ 0.126156] Dentry cache hash table entries: 65536 (order: 7, 524288 bytes) [ 0.127064] Inode-cache hash table entries: 32768 (order: 6, 262144 bytes) [ 0.127845] Mount-cache hash table entries: 4096 [ 0.129197] Initializing cgroup subsys memory [ 0.129714] Initializing cgroup subsys devices [ 0.130247] Initializing cgroup subsys freezer [ 0.130747] Initializing cgroup subsys net_cls [ 0.131261] Initializing cgroup subsys blkio [ 0.131735] Initializing cgroup subsys perf_event [ 0.132269] Initializing cgroup subsys hugetlb [ 0.132755] Initializing cgroup subsys pids [ 0.133238] Initializing cgroup subsys net_prio |
| Comment by Gerrit Updater [ 03/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27148/ |
| Comment by Jinshan Xiong (Inactive) [ 05/Jun/17 ] |
|
Sarah - as we discussed last Friday, please make crash dump work and enable MALLOC debug only. |
| Comment by Sarah Liu [ 05/Jun/17 ] |
|
please check crash dump on |
| Comment by Jinshan Xiong (Inactive) [ 06/Jun/17 ] |
|
I found something about this issue. It turned out the previous data structure on the RCU list was as follows: [ 3782.106995] next = ffff8800afe2d2a7, list = ffff88003553dd00, list func: (null) [ 3782.107782] next list: next = ff88007085b2f0ff, list = ffff8800afe2d2a7, list_func = ffffffa0b53fe0ff Unfortunately it doesn't come with a free function that means it should be freed by free_rcu. I also turned on malloc debug log but it seems this piece of memory has never been used by Lustre. It's not the corresponding dentry that failed inode initialization either. |
| Comment by Oleg Drokin [ 08/Jun/17 ] |
|
There's interesting message I see in the above dmesg dump: [ 1640.994159] ODEBUG: active_state not available (active state 0) object type: rcu_head hint: (null)
I guess that means a double free or something similar? |
| Comment by Jinshan Xiong (Inactive) [ 13/Jun/17 ] |
|
Oleg - if this were double free, wouldn't it be caught at the 2nd time of call_rcu()? |
| Comment by Zhenyu Xu [ 14/Jun/17 ] |
|
FYI, tried several error injections on my local VM (ldiskfs backend), cannot trigger the memory corruption yet. |
| Comment by Sarah Liu [ 14/Jun/17 ] |
|
this issue is easy to reproduce on zfs. |
| Comment by Gerrit Updater [ 15/Jun/17 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/27658 |
| Comment by Zhenyu Xu [ 15/Jun/17 ] |
|
Hi Sarah, Can you try to reproduce this issue again with this debug patch? I want to know whether it has something to do with the memory corruption, if the reproduction fails the assertion which I adds, then I'll make another fix patch. |
| Comment by Sarah Liu [ 15/Jun/17 ] |
|
Sure, I will update the ticket if I get sth. |
| Comment by Sarah Liu [ 15/Jun/17 ] |
|
Bobi, Your debug patch is based on Jinshan's https://review.whamcloud.com/#/c/27148/, with his patch, it is not easy to reproduce. So I added the line back to test-framework.sh and reproduce it. running: conf-sanity ONLY=46a run_suite conf-sanity /usr/lib64/lustre/tests/conf-sanity.sh [15069.537681] Lustre: DEBUG MARKER: -----============= acceptance-small: conf-sanity ============----- Thu Jun 15 21:38:44 UTC 2017 [15069.902165] Lustre: DEBUG MARKER: excepting tests: 32newtarball 101 [15070.034791] Lustre: Unmounted lustre-client [15119.059428] Key type lgssc unregistered [15119.059933] Lustre: 18766:0:(gss_mech_switch.c:81:lgss_mech_unregister()) Unregister krb5 mechanism [15124.614091] LNet: Removed LNI 10.9.6.136@tcp [15125.631080] LNet: HW nodes: 1, HW CPU cores: 4, npartitions: 1 [15125.634023] alg: No test for adler32 (adler32-zlib) [15125.634592] alg: No test for crc32 (crc32-table) [15126.534615] Lustre: DEBUG MARKER: == conf-sanity test 46a: handle ost additional - wide striped file =================================== 21:39:40 (1497562780) [15128.207906] Lustre: Lustre: Build Version: 2.9.59_16_g233342f [15128.232085] LNet: Added LNI 10.9.6.136@tcp [8/256/0/180] [15128.232728] LNet: Accept secure, port 988 [15129.739865] Lustre: 19557:0:(gss_svc_upcall.c:1178:gss_init_svc_upcall()) Init channel is not opened by lsvcgssd, following request might be dropped until lsvcgssd is active [15129.741378] Lustre: 19557:0:(gss_mech_switch.c:72:lgss_mech_register()) Register gssnull mechanism [15129.742225] Key type lgssc registered [15129.794743] Lustre: Echo OBD driver; http://www.lustre.org/ [15167.777159] Lustre: Mounted lustre-client [15173.996457] Lustre: DEBUG MARKER: trevis-58vm1.trevis.hpdd.intel.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff8800b4796800.ost_server_uuid 40 [15174.200433] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff8800b4796800.ost_server_uuid in FULL state after 0 sec [15174.277151] Lustre: Mounted lustre-client [15174.378436] LustreError: 21918:0:(lov_ea.c:227:lsme_unpack()) lustre-clilov_UUID: OST index 1 more than OST count 1 [15174.379473] Lustre: 21918:0:(lov_pack.c:61:lov_dump_lmm_common()) objid 0x1:3024, magic 0x0bd10bd0, pattern 0x1 [15174.380433] Lustre: 21918:0:(lov_pack.c:65:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 2, layout_gen 0 [15174.381483] Lustre: 21918:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 0 idx 0 subobj 0x0:34 [15174.382314] Lustre: 21918:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 1 idx 1 subobj 0x0:2 [15174.383184] LustreError: 21918:0:(lcommon_cl.c:181:cl_file_inode_init()) Failure to initialize cl object [0x200000bd0:0x1:0x0]: -22 [15174.384297] LustreError: 21918:0:(llite_lib.c:1811:ll_update_inode()) ASSERTION( rc == 0 ) failed: [15174.385191] LustreError: 21918:0:(llite_lib.c:1811:ll_update_inode()) LBUG Message from syslogd@trevis-58vm1 at Jun 15 21:40:29 ... kernel:LustreError: 21918:0:(llite_lib.c:1811:ll_update_inode()) ASSERTION( rc == 0 ) failed: Message from syslogd@trevis-58vm1 at Jun 15 21:40:29 ... kernel:LustreError: 21918:0:(llite_lib.c:1811:ll_update_inode()) LBUG [15174.387462] Pid: 21918, comm: stat [15174.387797] [15174.387797] Call Trace: [15174.388182] [<ffffffffa05467ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [15174.388817] [<ffffffffa054687c>] lbug_with_loc+0x4c/0xb0 [libcfs] [15174.389508] [<ffffffffa0b6577f>] ll_update_inode+0x36f/0x620 [lustre] [15174.390133] [<ffffffff81219483>] ? inode_sb_list_add+0x43/0x50 [15174.390717] [<ffffffffa0b65a97>] ll_read_inode2+0x67/0x420 [lustre] [15174.391397] [<ffffffffa0b741fb>] ll_iget+0xab/0x350 [lustre] [15174.391956] [<ffffffffa0b67ad2>] ll_prep_inode+0x232/0xc80 [lustre] [15174.392599] [<ffffffffa08cb580>] ? lustre_msg_buf+0x0/0x60 [ptlrpc] [15174.393298] [<ffffffffa0b73a20>] ? ll_md_blocking_ast+0x0/0x730 [lustre] [15174.393957] [<ffffffffa08cf4e0>] ? lustre_swab_mdt_body+0x0/0x140 [ptlrpc] [15174.394632] [<ffffffffa0b749a9>] ll_lookup_it_finish+0x109/0x1090 [lustre] [15174.395373] [<ffffffffa060cc10>] ? lmv_intent_lock+0x12d0/0x1b50 [lmv] [15174.395998] [<ffffffff81114b82>] ? from_kgid+0x12/0x20 [15174.396518] [<ffffffffa0b74504>] ? ll_i2gids+0x24/0xb0 [lustre] [15174.397147] [<ffffffff81114b82>] ? from_kgid+0x12/0x20 [15174.397660] [<ffffffffa0b73a20>] ? ll_md_blocking_ast+0x0/0x730 [lustre] [15174.398373] [<ffffffffa0b761ce>] ll_lookup_it+0x89e/0xee0 [lustre] [15174.398983] [<ffffffffa0b77b9b>] ll_lookup_nd+0xbb/0x190 [lustre] [15174.399579] [<ffffffff81208add>] lookup_real+0x1d/0x50 [15174.400159] [<ffffffff81209452>] __lookup_hash+0x42/0x60 [15174.400684] [<ffffffff816841f2>] lookup_slow+0x42/0xa7 [15174.401252] [<ffffffff8120c973>] path_lookupat+0x773/0x7a0 [15174.401783] [<ffffffff811de665>] ? kmem_cache_alloc+0x35/0x1e0 [15174.402346] [<ffffffff8120f2bf>] ? getname_flags+0x4f/0x1a0 [15174.402945] [<ffffffff8120c9cb>] filename_lookup+0x2b/0xc0 [15174.403480] [<ffffffff812103e7>] user_path_at_empty+0x67/0xc0 [15174.404092] [<ffffffff812104eb>] ? do_filp_open+0x4b/0xb0 [15174.404623] [<ffffffff81210451>] user_path_at+0x11/0x20 [15174.405174] [<ffffffff812038c3>] vfs_fstatat+0x63/0xc0 [15174.405679] [<ffffffff81203e91>] SYSC_newlstat+0x31/0x60 [15174.406259] [<ffffffff81692e66>] ? trace_do_page_fault+0x56/0x150 [15174.406853] [<ffffffff8169250b>] ? do_async_page_fault+0x1b/0xd0 [15174.407435] [<ffffffff8168eff8>] ? async_page_fault+0x28/0x30 [15174.408068] [<ffffffff8120411e>] SyS_newlstat+0xe/0x10 [15174.408581] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b [15174.409197] [15174.409364] Kernel panic - not syncing: LBUG [15174.409818] CPU: 3 PID: 21918 Comm: stat Tainted: G OE ------------ 3.10.0-514.21.1.el7.x86_64 #1 [15174.410752] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [15174.411297] ffffffffa0564e8b 000000000694263c ffff880081abf820 ffffffff81686f13 [15174.412073] ffff880081abf8a0 ffffffff8168031a ffffffff00000008 ffff880081abf8b0 [15174.412848] ffff880081abf850 000000000694263c 000000000694263c ffff88013fd8f838 [15174.413621] Call Trace: [15174.413871] [<ffffffff81686f13>] dump_stack+0x19/0x1b [15174.414358] [<ffffffff8168031a>] panic+0xe3/0x1f2 [15174.414825] [<ffffffffa0546894>] lbug_with_loc+0x64/0xb0 [libcfs] [15174.415415] [<ffffffffa0b6577f>] ll_update_inode+0x36f/0x620 [lustre] [15174.416038] [<ffffffff81219483>] ? inode_sb_list_add+0x43/0x50 [15174.416601] [<ffffffffa0b65a97>] ll_read_inode2+0x67/0x420 [lustre] [15174.417216] [<ffffffffa0b741fb>] ll_iget+0xab/0x350 [lustre] [15174.417773] [<ffffffffa0b67ad2>] ll_prep_inode+0x232/0xc80 [lustre] [15174.418414] [<ffffffffa08cb580>] ? lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc] [15174.419074] [<ffffffffa0b73a20>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] [15174.419852] [<ffffffffa08cf4e0>] ? lustre_swab_generic_32s+0x20/0x20 [ptlrpc] [15174.420543] [<ffffffffa0b749a9>] ll_lookup_it_finish+0x109/0x1090 [lustre] [15174.421200] [<ffffffffa060cc10>] ? lmv_intent_lock+0x12d0/0x1b50 [lmv] [15174.421830] [<ffffffff81114b82>] ? from_kgid+0x12/0x20 [15174.422328] [<ffffffffa0b74504>] ? ll_i2gids+0x24/0xb0 [lustre] [15174.422904] [<ffffffff81114b82>] ? from_kgid+0x12/0x20 [15174.423403] [<ffffffffa0b73a20>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] [15174.424179] [<ffffffffa0b761ce>] ll_lookup_it+0x89e/0xee0 [lustre] [15174.424789] [<ffffffffa0b77b9b>] ll_lookup_nd+0xbb/0x190 [lustre] [15174.425368] [<ffffffff81208add>] lookup_real+0x1d/0x50 [15174.425864] [<ffffffff81209452>] __lookup_hash+0x42/0x60 [15174.426378] [<ffffffff816841f2>] lookup_slow+0x42/0xa7 [15174.426878] [<ffffffff8120c973>] path_lookupat+0x773/0x7a0 [15174.427409] [<ffffffff811de665>] ? kmem_cache_alloc+0x35/0x1e0 [15174.427971] [<ffffffff8120f2bf>] ? getname_flags+0x4f/0x1a0 [15174.428504] [<ffffffff8120c9cb>] filename_lookup+0x2b/0xc0 [15174.429036] [<ffffffff812103e7>] user_path_at_empty+0x67/0xc0 [15174.429587] [<ffffffff812104eb>] ? do_filp_open+0x4b/0xb0 [15174.430112] [<ffffffff81210451>] user_path_at+0x11/0x20 [15174.430631] [<ffffffff812038c3>] vfs_fstatat+0x63/0xc0 [15174.431135] [<ffffffff81203e91>] SYSC_newlstat+0x31/0x60 [15174.431648] [<ffffffff81692e66>] ? trace_do_page_fault+0x56/0x150 [15174.432238] [<ffffffff8169250b>] ? do_async_page_fault+0x1b/0xd0 [15174.432815] [<ffffffff8168eff8>] ? async_page_fault+0x28/0x30 [15174.433365] [<ffffffff8120411e>] SyS_newlstat+0xe/0x10 [15174.433864] [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-514.21.1.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Thu May 25 17:04:51 UTC 2017 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-514.21.1.el7.x86_64 root=UUID=f513f02d-54c7-42d0-bc3f-1a4b0d47ac07 ro console=tty0 LANG=en_US.UTF-8 console=ttyS0,115200 net.ifnames=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never disable_cpu_apicid=0 elfcorehdr=867700K |
| Comment by Zhenyu Xu [ 16/Jun/17 ] |
|
Would you please add the line back to test-framework.sh and apply the updated patch (patchset 2) and try to reproduce it? Thank you. |
| Comment by Sarah Liu [ 16/Jun/17 ] |
|
I update the patch base on yours with the problematic line reinstated and adding test-parameter. |
| Comment by Sarah Liu [ 16/Jun/17 ] |
|
I have run patchset #2 locally with the line added several times and cannot reproduce it this time, but the test failed as [19954.374299] Lustre: DEBUG MARKER: == conf-sanity test 46a: handle ost additional - wide striped file =================================== 21:51:35 (1497649895) [19956.043116] Lustre: Lustre: Build Version: 2.9.59_16_gbe2e8e5 [19956.068256] LNet: Added LNI 10.9.6.136@tcp [8/256/0/180] [19956.068913] LNet: Accept secure, port 988 [19957.576952] Lustre: 2814:0:(gss_svc_upcall.c:1178:gss_init_svc_upcall()) Init channel is not opened by lsvcgssd, following request might be dropped until lsvcgssd is active [19957.578575] Lustre: 2814:0:(gss_mech_switch.c:72:lgss_mech_register()) Register gssnull mechanism [19957.579450] Key type lgssc registered [19957.631952] Lustre: Echo OBD driver; http://www.lustre.org/ [19994.598809] Lustre: Mounted lustre-client [19999.823173] Lustre: DEBUG MARKER: trevis-58vm1.trevis.hpdd.intel.com: executing wait_import_state FULL osc.lustre-OST0000-osc-ffff8800ab927800.ost_server_uuid 40 [20000.027082] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff8800ab927800.ost_server_uuid in FULL state after 0 sec [20000.102424] Lustre: Mounted lustre-client [20000.202406] LustreError: 5182:0:(lov_ea.c:227:lsme_unpack()) lustre-clilov_UUID: OST index 1 more than OST count 1 [20000.203449] Lustre: 5182:0:(lov_pack.c:61:lov_dump_lmm_common()) objid 0x1:3024, magic 0x0bd10bd0, pattern 0x1 [20000.204419] Lustre: 5182:0:(lov_pack.c:65:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 2, layout_gen 0 [20000.205502] Lustre: 5182:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 0 idx 0 subobj 0x0:34 [20000.206337] Lustre: 5182:0:(lov_pack.c:84:lov_dump_lmm_objects()) stripe 1 idx 1 subobj 0x0:2 [20000.207177] LustreError: 5182:0:(lcommon_cl.c:181:cl_file_inode_init()) Failure to initialize cl object [0x200000bd0:0x1:0x0]: -22 [20000.208389] LustreError: 5182:0:(llite_lib.c:2273:ll_prep_inode()) new_inode -fatal: rc -22 [20000.368646] Lustre: DEBUG MARKER: conf-sanity test_46a: @@@@@@ FAIL: stat /mnt/lustre/widestripe failed [20002.819582] LustreError: 5512:0:(obd_config.c:1429:class_process_proc_param()) lov.: error writing proc entry 'stripecount': rc = -34 [20008.013856] Lustre: Unmounted lustre-client conf-sanity returned 0 Stopping clients: trevis-58vm1.trevis.hpdd.intel.com /mnt/lustre (opts:) Stopping clients: trevis-58vm1.trevis.hpdd.intel.com /mnt/lustre2 (opts:) Stopping /mnt/lustre-mds1 (opts:-f) on trevis-73vm11 [20031.824103] Key type lgssc unregistered [20031.824613] Lustre: 6228:0:(gss_mech_switch.c:81:lgss_mech_unregister()) Unregister krb5 mechanism [20037.824058] LNet: Removed LNI 10.9.6.136@tcp modules unloaded. Finished at Fri Jun 16 21:53:00 UTC 2017 in 176s auster: completed with rc 0 |
| Comment by Zhenyu Xu [ 17/Jun/17 ] |
|
that's a good news, meaning that the patch fixes the memory issue. |
| Comment by Peter Jones [ 17/Jun/17 ] |
|
Bobijam So patchset 2 was more than just a diagnostic patch - it was a possible (and now proven) fix? Does that mean that we should rollback to v2 of the patch and get reviews with a view to landing this patch? Peter |
| Comment by Zhenyu Xu [ 17/Jun/17 ] |
|
yes, patchset 2 should be the fix patch, I'll update it and request reviews for future landing. |
| Comment by Gerrit Updater [ 22/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27658/ |
| Comment by Peter Jones [ 22/Jun/17 ] |
|
Landed for 2.10 |