[LU-13207] sanity-pfl test 16b crashes in “Oops: Kernel access of bad area” Created: 05/Feb/20  Updated: 21/Feb/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: always_except, ppc
Environment:

PPC Clients


Issue Links:
Duplicate
duplicates LU-13205 sanity-pfl test 16a fails with “setst... Open
Related
is related to LU-13215 sanity-pfl test 17 hangs with “incorr... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-pfl test_16b the client crashes. Looking at a recent failure that skips test 16a, https://testing.whamcloud.com/test_sets/9833b176-47d8-11ea-b58e-52540065bddc, we see the following in the kernel-crash log

[ 1250.515939] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-pfl test 16b: Verify setstripe\/getstripe with YAML config file + overstriping ============== 04:44:17 \(1580877857\)
[ 1250.733732] Lustre: DEBUG MARKER: == sanity-pfl test 16b: Verify setstripe/getstripe with YAML config file + overstriping ============== 04:44:17 (1580877857)
[ 1251.230177] LustreError: 1992:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1
[ 1251.232551] Unable to handle kernel paging request for data at address 0xe8f506000000c0
[ 1251.232620] Faulting instruction address: 0xc0000000003675e4
[ 1251.232676] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1251.232711] SMP NR_CPUS=2048 NUMA pSeries
[ 1251.232757] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc virtio_balloon ip_tables ext4 mbcache jbd2 virtio_net virtio_blk virtio_pci virtio_ring virtio
[ 1251.233398] CPU: 0 PID: 10113 Comm: socknal_sd00_00 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.9.1.el7.ppc64 #1
[ 1251.233479] task: c0000000b5b93320 ti: c0000000b5be8000 task.ti: c0000000b5be8000
[ 1251.233532] NIP: c0000000003675e4 LR: c000000000367564 CTR: c0000000009ee780
[ 1251.233586] REGS: c0000000b5beb160 TRAP: 0300   Tainted: G           OE  ------------    (3.10.0-1062.9.1.el7.ppc64)
[ 1251.233658] MSR: 8000000100009032 <SF,EE,ME,IR,DR,RI>  CR: 24424122  XER: 20000000
[ 1251.233778] CFAR: 0000000000002494 DAR: 00e8f506000000c0 DSISR: 40000000 SOFTE: 1 
GPR00: c000000000367564 c0000000b5beb3e0 c000000001776200 0000000000000000 
GPR04: 0000000000010250 ffffffffffffffff c0000000009d21d8 c000000003121da0 
GPR08: 000000000003a1ef 0000000000000000 0000000002170000 d000000002d1e0f0 
GPR12: 0000000024424122 c000000007b80000 c0000000b6808210 0000000000000001 
GPR16: c0000000b679a200 c0000000bbfc2f00 0000000000000000 0000000000000001 
GPR20: 000000000000fe88 0000000000000000 c0000000b679a260 00000000000005a8 
GPR24: 0000000000000800 c0000000be01f400 c0000000009d21d8 ffffffffffffffff 
GPR28: 0000000000000800 0000000000010250 00e8f506000000c0 c0000000be01f400 
[ 1251.234543] NIP [c0000000003675e4] .__kmalloc_node_track_caller+0x234/0x470
[ 1251.234589] LR [c000000000367564] .__kmalloc_node_track_caller+0x1b4/0x470
[ 1251.234634] Call Trace:
[ 1251.234653] [c0000000b5beb3e0] [c000000000367564] .__kmalloc_node_track_caller+0x1b4/0x470 (unreliable)
[ 1251.234736] [c0000000b5beb490] [c000000000920b44] .__alloc_skb+0xb4/0x260
[ 1251.234792] [c0000000b5beb540] [c0000000009d21d8] .sk_stream_alloc_skb+0x78/0x230
[ 1251.234856] [c0000000b5beb5d0] [c0000000009d321c] .tcp_sendmsg+0x6cc/0xe50
[ 1251.234911] [c0000000b5beb720] [c000000000a18abc] .inet_sendmsg+0x9c/0x170
[ 1251.234966] [c0000000b5beb7b0] [c00000000090c700] .sock_sendmsg+0xf0/0x140
[ 1251.235021] [c0000000b5beb970] [c00000000090c7b4] .kernel_sendmsg+0x64/0x90
[ 1251.235088] [c0000000b5beba10] [d000000002d189b4] .ksocknal_lib_send_iov+0x114/0x180 [ksocklnd]
[ 1251.235163] [c0000000b5bebae0] [d000000002d0d134] .ksocknal_process_transmit+0x3c4/0x1260 [ksocklnd]
[ 1251.235238] [c0000000b5bebbc0] [d000000002d14378] .ksocknal_scheduler+0x408/0x14f0 [ksocklnd]
[ 1251.235302] [c0000000b5bebd30] [c00000000013edb0] .kthread+0xf0/0x100
[ 1251.235358] [c0000000b5bebe30] [c00000000000a628] .ret_from_kernel_thread+0x58/0x70
[ 1251.235420] Instruction dump:
[ 1251.235448] e9070008 7fc9502a e9270010 2fbe0000 2f290000 41defeb4 409afea4 4bfffeac 
[ 1251.235541] 60000000 60000000 60420000 e93f0022 <7f1e482a> 39200000 88cd02a2 992d02a2 
[ 1251.235649] ---[ end trace 42e7021bc48adc89 ]---
[ 1251.237326] 
[ 1251.237363] Sending IPI to other CPUs
[ 1251.238398] IPI complete

sanity-pfl test 16b crashes, hangs or fails for PPC client testing 100% of the time and started crashing on 30 JULY 2019 with Lustre 2.12.56.72 at https://testing.whamcloud.com/test_sets/11fc75f8-b37b-11e9-9f36-52540065bddc.

Logs for other crashes are at
https://testing.whamcloud.com/test_sets/07588ab8-2592-11ea-80b4-52540065bddc
https://testing.whamcloud.com/test_sets/d840dd04-1dba-11ea-80b4-52540065bddc
https://testing.whamcloud.com/test_sets/4c052244-fa17-11e9-a0ba-52540065bddc



 Comments   
Comment by Andreas Dilger [ 10/Feb/20 ]

Very likely a duplicate of LU-13205. See comments there.

Generated at Sat Feb 10 02:59:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.