[LU-12246] sanity test 28 crashes with ‘Unable to handle kernel paging request’ Created: 29/Apr/19 Updated: 04/Jul/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Yang Sheng |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ppc | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We’ve seen sanity test_28 crashes three times this year all for PPC and all for 2.12.1. The first time we saw this crash is with 2.12.0.78. Looking at a recent crash, with logs at https://testing.whamcloud.com/test_sets/661044aa-668f-11e9-8bb1-52540065bddc , we see ============================================ 00:57:30 \(1555981050\) [ 3427.381628] Lustre: DEBUG MARKER: == sanity test 28: create/mknod/mkdir with bad file types ============================================ 00:57:30 (1555981050) [ 3427.761602] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [ 3428.823676] Unable to handle kernel paging request for data at address 0x406ef778000000c0 [ 3428.823785] Faulting instruction address: 0xc000000000337754 [ 3428.823844] Oops: Kernel access of bad area, sig: 11 [#1] [ 3428.823880] SMP NR_CPUS=2048 NUMA pSeries [ 3428.823925] Modules linked in: lnet_selftest(OE) lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core virtio_balloon auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 virtio_net virtio_blk virtio_pci virtio_ring virtio [ 3428.824556] CPU: 0 PID: 29737 Comm: bash Kdump: loaded Tainted: G OE ------------ 3.10.0-957.10.1.el7.ppc64 #1 [ 3428.824630] task: c0000000758571c0 ti: c00000007536c000 task.ti: c00000007536c000 [ 3428.824683] NIP: c000000000337754 LR: c0000000003378d4 CTR: c0000000003376a0 [ 3428.824737] REGS: c00000007536f650 TRAP: 0300 Tainted: G OE ------------ (3.10.0-957.10.1.el7.ppc64) [ 3428.824807] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 28242488 XER: 00000000 [ 3428.824931] CFAR: 0000000000002494 DAR: 406ef778000000c0 DSISR: 40000000 SOFTE: 1 GPR00: c0000000003378d4 c00000007536f8d0 c0000000016cd800 0000000000000000 GPR04: 00000000000080d0 0000000067533256 c0000000495df140 c000000003421ae0 GPR08: 00000000004073b0 0000000000000000 00000000024f0000 d000000000a6cf58 GPR12: c0000000003376a0 c000000007b80000 0000000000000008 0000000022000000 GPR16: 0000000000000000 c00000007945a4d0 0000000010133584 c000000075d34000 GPR20: 000000000000001f c00000003e7e2600 0000000000000000 000000000000000e GPR24: 0000000000000000 d000000000a73b88 c00000007e01fa00 d0000000009f4410 GPR28: 000000000000004f 00000000000080d0 406ef778000000c0 c00000007e01fa00 [ 3428.825686] NIP [c000000000337754] .__kmalloc+0xb4/0x350 [ 3428.825722] LR [c0000000003378d4] .__kmalloc+0x234/0x350 [ 3428.825758] Call Trace: [ 3428.825778] [c00000007536f8d0] [c0000000003378d4] .__kmalloc+0x234/0x350 (unreliable) [ 3428.825862] [c00000007536f980] [d0000000009f4410] .ext4_htree_store_dirent+0x50/0x1b0 [ext4] [ 3428.825934] [c00000007536fa20] [d000000000a0ce10] .htree_dirblock_to_tree+0x1a0/0x230 [ext4] [ 3428.826005] [c00000007536fb00] [d000000000a0e5b8] .ext4_htree_fill_tree+0x1c8/0x3e0 [ext4] [ 3428.826076] [c00000007536fc20] [d0000000009f3e1c] .ext4_readdir+0x95c/0xbc0 [ext4] [ 3428.826142] [c00000007536fd60] [c000000000396d3c] .SyS_getdents+0x1fc/0x2b0 [ 3428.826196] [c00000007536fe30] [c00000000000a284] system_call+0x38/0xfc [ 3428.826249] Instruction dump: [ 3428.826277] 7f5fd378 e94d0040 e93f0000 7ce95214 e9070008 7fc9502a e9270010 2fbe0000 [ 3428.826368] 41de006c 2fa90000 419e0064 e93f0022 <7f3e482a> 39200000 88cd02a2 992d02a2 [ 3428.826469] ---[ end trace 8c2d758ee3e2fa9b ]--- [ 3428.828981] [ 3428.829009] Sending IPI to other CPUs [ 3428.830044] IPI complete Logs for all other failures are at |
| Comments |
| Comment by Peter Jones [ 15/Jun/19 ] |
|
Yang Sheng Can you please investigate here? Thanks Peter |
| Comment by Yang Sheng [ 20/Jun/19 ] |
|
Hi, James, Looks like the kdump doesn't work well on ppc node? I think we need setup it first. Thanks, |
| Comment by James Nunez (Inactive) [ 11/Sep/19 ] |
|
The kdump issue should be fixed for ppc clients. The latest crash I can find for this test is from 01-August-2019 with logs at https://testing.whamcloud.com/test_sets/63f677ca-b581-11e9-b88c-52540065bddc , but the output in the kernel-crash file look the same as above. Would you please check this failure to see if you think kdump is working properly? ============================================ 19:10:53 \(1564686653\) [ 3461.270857] Lustre: DEBUG MARKER: == sanity test 28: create/mknod/mkdir with bad file types ============================================ 19:10:53 (1564686653) [ 3461.616469] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null [ 3462.590648] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc [ 3462.591721] Unable to handle kernel paging request for data at address 0x40d1483d000000c0 [ 3462.591785] Faulting instruction address: 0xc000000000337894 [ 3462.591869] Oops: Kernel access of bad area, sig: 11 [#1] [ 3462.591916] SMP NR_CPUS=2048 NUMA pSeries [ 3462.591961] Modules linked in: lnet_selftest(OE) lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core virtio_balloon auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 virtio_net virtio_blk virtio_pci virtio_ring virtio [ 3462.592659] CPU: 0 PID: 7504 Comm: bash Kdump: loaded Tainted: G OE ------------ 3.10.0-957.21.3.el7.ppc64 #1 [ 3462.592733] task: c000000042635c00 ti: c000000043f8c000 task.ti: c000000043f8c000 [ 3462.592794] NIP: c000000000337894 LR: c000000000337a14 CTR: c0000000003377e0 [ 3462.592847] REGS: c000000043f8f650 TRAP: 0300 Tainted: G OE ------------ (3.10.0-957.21.3.el7.ppc64) [ 3462.592925] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 28242488 XER: 00000000 [ 3462.593058] CFAR: 0000000000002494 DAR: 40d1483d000000c0 DSISR: 40000000 SOFTE: 1 GPR00: c000000000337a14 c000000043f8f8d0 c0000000016cda00 0000000000000000 GPR04: 00000000000080d0 000000003ffd4dc5 c00000004ad431dc c000000003021ae0 GPR08: 0000000000092535 0000000000000000 00000000020f0000 d000000000b2cf68 GPR12: c0000000003377e0 c000000007b80000 0000000000000008 0000000022000000 GPR16: 0000000000000000 c00000007b6be4d0 0000000010133584 c00000007651a800 GPR20: 000000000000001f c000000074946b00 0000000000000000 0000000000000014 GPR24: 0000000000000000 d000000000b33b98 c00000007e01fa00 d000000000ab4410 GPR28: 0000000000000049 00000000000080d0 40d1483d000000c0 c00000007e01fa00 [ 3462.593811] NIP [c000000000337894] .__kmalloc+0xb4/0x350 [ 3462.593851] LR [c000000000337a14] .__kmalloc+0x234/0x350 [ 3462.593887] Call Trace: [ 3462.593906] [c000000043f8f8d0] [c000000000337a14] .__kmalloc+0x234/0x350 (unreliable) [ 3462.593992] [c000000043f8f980] [d000000000ab4410] .ext4_htree_store_dirent+0x50/0x1b0 [ext4] [ 3462.594062] [c000000043f8fa20] [d000000000acce10] .htree_dirblock_to_tree+0x1a0/0x230 [ext4] [ 3462.594133] [c000000043f8fb00] [d000000000ace5b8] .ext4_htree_fill_tree+0x1c8/0x3e0 [ext4] [ 3462.594264] [c000000043f8fc20] [d000000000ab3e1c] .ext4_readdir+0x95c/0xbc0 [ext4] [ 3462.594330] [c000000043f8fd60] [c000000000396e7c] .SyS_getdents+0x1fc/0x2b0 [ 3462.594384] [c000000043f8fe30] [c00000000000a284] system_call+0x38/0xfc [ 3462.594445] Instruction dump: [ 3462.594473] 7f5fd378 e94d0040 e93f0000 7ce95214 e9070008 7fc9502a e9270010 2fbe0000 [ 3462.594564] 41de006c 2fa90000 419e0064 e93f0022 <7f3e482a> 39200000 88cd02a2 992d02a2 [ 3462.594665] ---[ end trace 3f100a350ddaa703 ]--- [ 3462.597230] [ 3462.597267] Sending IPI to other CPUs [ 3462.598302] IPI complete |
| Comment by Yang Sheng [ 12/Sep/19 ] |
|
Yes, Looks like the ppc kdump should work well now. But latest crash dump was created at 12-August. |