Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4
-
PPC Clients
-
3
-
9223372036854775807
Description
sanity-pfl test_16b the client crashes. Looking at a recent failure that skips test 16a, https://testing.whamcloud.com/test_sets/9833b176-47d8-11ea-b58e-52540065bddc, we see the following in the kernel-crash log
[ 1250.515939] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-pfl test 16b: Verify setstripe\/getstripe with YAML config file + overstriping ============== 04:44:17 \(1580877857\) [ 1250.733732] Lustre: DEBUG MARKER: == sanity-pfl test 16b: Verify setstripe/getstripe with YAML config file + overstriping ============== 04:44:17 (1580877857) [ 1251.230177] LustreError: 1992:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1 [ 1251.232551] Unable to handle kernel paging request for data at address 0xe8f506000000c0 [ 1251.232620] Faulting instruction address: 0xc0000000003675e4 [ 1251.232676] Oops: Kernel access of bad area, sig: 11 [#1] [ 1251.232711] SMP NR_CPUS=2048 NUMA pSeries [ 1251.232757] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc virtio_balloon ip_tables ext4 mbcache jbd2 virtio_net virtio_blk virtio_pci virtio_ring virtio [ 1251.233398] CPU: 0 PID: 10113 Comm: socknal_sd00_00 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.9.1.el7.ppc64 #1 [ 1251.233479] task: c0000000b5b93320 ti: c0000000b5be8000 task.ti: c0000000b5be8000 [ 1251.233532] NIP: c0000000003675e4 LR: c000000000367564 CTR: c0000000009ee780 [ 1251.233586] REGS: c0000000b5beb160 TRAP: 0300 Tainted: G OE ------------ (3.10.0-1062.9.1.el7.ppc64) [ 1251.233658] MSR: 8000000100009032 <SF,EE,ME,IR,DR,RI> CR: 24424122 XER: 20000000 [ 1251.233778] CFAR: 0000000000002494 DAR: 00e8f506000000c0 DSISR: 40000000 SOFTE: 1 GPR00: c000000000367564 c0000000b5beb3e0 c000000001776200 0000000000000000 GPR04: 0000000000010250 ffffffffffffffff c0000000009d21d8 c000000003121da0 GPR08: 000000000003a1ef 0000000000000000 0000000002170000 d000000002d1e0f0 GPR12: 0000000024424122 c000000007b80000 c0000000b6808210 0000000000000001 GPR16: c0000000b679a200 c0000000bbfc2f00 0000000000000000 0000000000000001 GPR20: 000000000000fe88 0000000000000000 c0000000b679a260 00000000000005a8 GPR24: 0000000000000800 c0000000be01f400 c0000000009d21d8 ffffffffffffffff GPR28: 0000000000000800 0000000000010250 00e8f506000000c0 c0000000be01f400 [ 1251.234543] NIP [c0000000003675e4] .__kmalloc_node_track_caller+0x234/0x470 [ 1251.234589] LR [c000000000367564] .__kmalloc_node_track_caller+0x1b4/0x470 [ 1251.234634] Call Trace: [ 1251.234653] [c0000000b5beb3e0] [c000000000367564] .__kmalloc_node_track_caller+0x1b4/0x470 (unreliable) [ 1251.234736] [c0000000b5beb490] [c000000000920b44] .__alloc_skb+0xb4/0x260 [ 1251.234792] [c0000000b5beb540] [c0000000009d21d8] .sk_stream_alloc_skb+0x78/0x230 [ 1251.234856] [c0000000b5beb5d0] [c0000000009d321c] .tcp_sendmsg+0x6cc/0xe50 [ 1251.234911] [c0000000b5beb720] [c000000000a18abc] .inet_sendmsg+0x9c/0x170 [ 1251.234966] [c0000000b5beb7b0] [c00000000090c700] .sock_sendmsg+0xf0/0x140 [ 1251.235021] [c0000000b5beb970] [c00000000090c7b4] .kernel_sendmsg+0x64/0x90 [ 1251.235088] [c0000000b5beba10] [d000000002d189b4] .ksocknal_lib_send_iov+0x114/0x180 [ksocklnd] [ 1251.235163] [c0000000b5bebae0] [d000000002d0d134] .ksocknal_process_transmit+0x3c4/0x1260 [ksocklnd] [ 1251.235238] [c0000000b5bebbc0] [d000000002d14378] .ksocknal_scheduler+0x408/0x14f0 [ksocklnd] [ 1251.235302] [c0000000b5bebd30] [c00000000013edb0] .kthread+0xf0/0x100 [ 1251.235358] [c0000000b5bebe30] [c00000000000a628] .ret_from_kernel_thread+0x58/0x70 [ 1251.235420] Instruction dump: [ 1251.235448] e9070008 7fc9502a e9270010 2fbe0000 2f290000 41defeb4 409afea4 4bfffeac [ 1251.235541] 60000000 60000000 60420000 e93f0022 <7f1e482a> 39200000 88cd02a2 992d02a2 [ 1251.235649] ---[ end trace 42e7021bc48adc89 ]--- [ 1251.237326] [ 1251.237363] Sending IPI to other CPUs [ 1251.238398] IPI complete
sanity-pfl test 16b crashes, hangs or fails for PPC client testing 100% of the time and started crashing on 30 JULY 2019 with Lustre 2.12.56.72 at https://testing.whamcloud.com/test_sets/11fc75f8-b37b-11e9-9f36-52540065bddc.
Logs for other crashes are at
https://testing.whamcloud.com/test_sets/07588ab8-2592-11ea-80b4-52540065bddc
https://testing.whamcloud.com/test_sets/d840dd04-1dba-11ea-80b4-52540065bddc
https://testing.whamcloud.com/test_sets/4c052244-fa17-11e9-a0ba-52540065bddc