[LU-12246] sanity test 28 crashes with ‘Unable to handle kernel paging request’ Created: 29/Apr/19  Updated: 04/Jul/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Yang Sheng
Resolution: Unresolved Votes: 0
Labels: ppc

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We’ve seen sanity test_28 crashes three times this year all for PPC and all for 2.12.1. The first time we saw this crash is with 2.12.0.78.

Looking at a recent crash, with logs at https://testing.whamcloud.com/test_sets/661044aa-668f-11e9-8bb1-52540065bddc , we see

============================================ 00:57:30 \(1555981050\)
[ 3427.381628] Lustre: DEBUG MARKER: == sanity test 28: create/mknod/mkdir with bad file types ============================================ 00:57:30 (1555981050)
[ 3427.761602] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
[ 3428.823676] Unable to handle kernel paging request for data at address 0x406ef778000000c0
[ 3428.823785] Faulting instruction address: 0xc000000000337754
[ 3428.823844] Oops: Kernel access of bad area, sig: 11 [#1]
[ 3428.823880] SMP NR_CPUS=2048 NUMA pSeries
[ 3428.823925] Modules linked in: lnet_selftest(OE) lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core virtio_balloon auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 virtio_net virtio_blk virtio_pci virtio_ring virtio
[ 3428.824556] CPU: 0 PID: 29737 Comm: bash Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.10.1.el7.ppc64 #1
[ 3428.824630] task: c0000000758571c0 ti: c00000007536c000 task.ti: c00000007536c000
[ 3428.824683] NIP: c000000000337754 LR: c0000000003378d4 CTR: c0000000003376a0
[ 3428.824737] REGS: c00000007536f650 TRAP: 0300   Tainted: G           OE  ------------    (3.10.0-957.10.1.el7.ppc64)
[ 3428.824807] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 28242488  XER: 00000000
[ 3428.824931] CFAR: 0000000000002494 DAR: 406ef778000000c0 DSISR: 40000000 SOFTE: 1 
GPR00: c0000000003378d4 c00000007536f8d0 c0000000016cd800 0000000000000000 
GPR04: 00000000000080d0 0000000067533256 c0000000495df140 c000000003421ae0 
GPR08: 00000000004073b0 0000000000000000 00000000024f0000 d000000000a6cf58 
GPR12: c0000000003376a0 c000000007b80000 0000000000000008 0000000022000000 
GPR16: 0000000000000000 c00000007945a4d0 0000000010133584 c000000075d34000 
GPR20: 000000000000001f c00000003e7e2600 0000000000000000 000000000000000e 
GPR24: 0000000000000000 d000000000a73b88 c00000007e01fa00 d0000000009f4410 
GPR28: 000000000000004f 00000000000080d0 406ef778000000c0 c00000007e01fa00 
[ 3428.825686] NIP [c000000000337754] .__kmalloc+0xb4/0x350
[ 3428.825722] LR [c0000000003378d4] .__kmalloc+0x234/0x350
[ 3428.825758] Call Trace:
[ 3428.825778] [c00000007536f8d0] [c0000000003378d4] .__kmalloc+0x234/0x350 (unreliable)
[ 3428.825862] [c00000007536f980] [d0000000009f4410] .ext4_htree_store_dirent+0x50/0x1b0 [ext4]
[ 3428.825934] [c00000007536fa20] [d000000000a0ce10] .htree_dirblock_to_tree+0x1a0/0x230 [ext4]
[ 3428.826005] [c00000007536fb00] [d000000000a0e5b8] .ext4_htree_fill_tree+0x1c8/0x3e0 [ext4]
[ 3428.826076] [c00000007536fc20] [d0000000009f3e1c] .ext4_readdir+0x95c/0xbc0 [ext4]
[ 3428.826142] [c00000007536fd60] [c000000000396d3c] .SyS_getdents+0x1fc/0x2b0
[ 3428.826196] [c00000007536fe30] [c00000000000a284] system_call+0x38/0xfc
[ 3428.826249] Instruction dump:
[ 3428.826277] 7f5fd378 e94d0040 e93f0000 7ce95214 e9070008 7fc9502a e9270010 2fbe0000 
[ 3428.826368] 41de006c 2fa90000 419e0064 e93f0022 <7f3e482a> 39200000 88cd02a2 992d02a2 
[ 3428.826469] ---[ end trace 8c2d758ee3e2fa9b ]---
[ 3428.828981] 
[ 3428.829009] Sending IPI to other CPUs
[ 3428.830044] IPI complete

Logs for all other failures are at
https://testing.whamcloud.com/test_sets/02902ee0-6266-11e9-8bb1-52540065bddc
https://testing.whamcloud.com/test_sets/d0ff8300-5acf-11e9-a256-52540065bddc



 Comments   
Comment by Peter Jones [ 15/Jun/19 ]

Yang Sheng

Can you please investigate here?

Thanks

Peter

Comment by Yang Sheng [ 20/Jun/19 ]

Hi, James,

Looks like the kdump doesn't work well on ppc node? I think we need setup it first.

Thanks,
YangSheng

Comment by James Nunez (Inactive) [ 11/Sep/19 ]

The kdump issue should be fixed for ppc clients. The latest crash I can find for this test is from 01-August-2019 with logs at https://testing.whamcloud.com/test_sets/63f677ca-b581-11e9-b88c-52540065bddc , but the output in the kernel-crash file look the same as above.

Would you please check this failure to see if you think kdump is working properly?

============================================ 19:10:53 \(1564686653\)
[ 3461.270857] Lustre: DEBUG MARKER: == sanity test 28: create/mknod/mkdir with bad file types ============================================ 19:10:53 (1564686653)
[ 3461.616469] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
[ 3462.590648] Lustre: DEBUG MARKER: rc=0; val=$(/usr/sbin/lctl get_param -n catastrophe 2>&1); if [[ $? -eq 0 && $val -ne 0 ]]; then echo $(hostname -s): $val; rc=$val; fi; exit $rc
[ 3462.591721] Unable to handle kernel paging request for data at address 0x40d1483d000000c0
[ 3462.591785] Faulting instruction address: 0xc000000000337894
[ 3462.591869] Oops: Kernel access of bad area, sig: 11 [#1]
[ 3462.591916] SMP NR_CPUS=2048 NUMA pSeries
[ 3462.591961] Modules linked in: lnet_selftest(OE) lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core virtio_balloon auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 virtio_net virtio_blk virtio_pci virtio_ring virtio
[ 3462.592659] CPU: 0 PID: 7504 Comm: bash Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.21.3.el7.ppc64 #1
[ 3462.592733] task: c000000042635c00 ti: c000000043f8c000 task.ti: c000000043f8c000
[ 3462.592794] NIP: c000000000337894 LR: c000000000337a14 CTR: c0000000003377e0
[ 3462.592847] REGS: c000000043f8f650 TRAP: 0300   Tainted: G           OE  ------------    (3.10.0-957.21.3.el7.ppc64)
[ 3462.592925] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 28242488  XER: 00000000
[ 3462.593058] CFAR: 0000000000002494 DAR: 40d1483d000000c0 DSISR: 40000000 SOFTE: 1 
GPR00: c000000000337a14 c000000043f8f8d0 c0000000016cda00 0000000000000000 
GPR04: 00000000000080d0 000000003ffd4dc5 c00000004ad431dc c000000003021ae0 
GPR08: 0000000000092535 0000000000000000 00000000020f0000 d000000000b2cf68 
GPR12: c0000000003377e0 c000000007b80000 0000000000000008 0000000022000000 
GPR16: 0000000000000000 c00000007b6be4d0 0000000010133584 c00000007651a800 
GPR20: 000000000000001f c000000074946b00 0000000000000000 0000000000000014 
GPR24: 0000000000000000 d000000000b33b98 c00000007e01fa00 d000000000ab4410 
GPR28: 0000000000000049 00000000000080d0 40d1483d000000c0 c00000007e01fa00 
[ 3462.593811] NIP [c000000000337894] .__kmalloc+0xb4/0x350
[ 3462.593851] LR [c000000000337a14] .__kmalloc+0x234/0x350
[ 3462.593887] Call Trace:
[ 3462.593906] [c000000043f8f8d0] [c000000000337a14] .__kmalloc+0x234/0x350 (unreliable)
[ 3462.593992] [c000000043f8f980] [d000000000ab4410] .ext4_htree_store_dirent+0x50/0x1b0 [ext4]
[ 3462.594062] [c000000043f8fa20] [d000000000acce10] .htree_dirblock_to_tree+0x1a0/0x230 [ext4]
[ 3462.594133] [c000000043f8fb00] [d000000000ace5b8] .ext4_htree_fill_tree+0x1c8/0x3e0 [ext4]
[ 3462.594264] [c000000043f8fc20] [d000000000ab3e1c] .ext4_readdir+0x95c/0xbc0 [ext4]
[ 3462.594330] [c000000043f8fd60] [c000000000396e7c] .SyS_getdents+0x1fc/0x2b0
[ 3462.594384] [c000000043f8fe30] [c00000000000a284] system_call+0x38/0xfc
[ 3462.594445] Instruction dump:
[ 3462.594473] 7f5fd378 e94d0040 e93f0000 7ce95214 e9070008 7fc9502a e9270010 2fbe0000 
[ 3462.594564] 41de006c 2fa90000 419e0064 e93f0022 <7f3e482a> 39200000 88cd02a2 992d02a2 
[ 3462.594665] ---[ end trace 3f100a350ddaa703 ]---
[ 3462.597230] 
[ 3462.597267] Sending IPI to other CPUs
[ 3462.598302] IPI complete
Comment by Yang Sheng [ 12/Sep/19 ]

Yes, Looks like the ppc kdump should work well now. But latest crash dump was created at 12-August.

Generated at Sat Feb 10 02:50:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.