Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.14.0
-
PPC clients
-
3
-
9223372036854775807
Description
With sanity-pfl tests 16a (LU-13205) and 16b (LU-13207) being skipped (on the ALWAYS_EXCEPT list), now test 17 hangs repeatedly for PPC client testing. See
https://testing.whamcloud.com/test_sets/bac5babe-4908-11ea-b58e-52540065bddc
https://testing.whamcloud.com/test_sets/efe243fc-4908-11ea-b69a-52540065bddc
https://testing.whamcloud.com/test_sets/a9f0e7d2-4867-11ea-a1c8-52540065bddc
https://testing.whamcloud.com/test_sets/06861d14-4868-11ea-b69a-52540065bddc
The output for each of these hangs look different although root cause for the crash may be the same. For example looking at the client1 (77vm7) console log for the hang at https://testing.whamcloud.com/test_sets/a9f0e7d2-4867-11ea-a1c8-52540065bddc, we see
[ 1175.191331] Lustre: DEBUG MARKER: == sanity-pfl test 17: Verify LOVEA grows with more component inited ================================= 20:56:28 (1580936188) [ 1175.237552] LustreError: 2178:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1 [ 1175.269569] LustreError: 2188:0:(pack_generic.c:1161:lustre_msg_get_transno()) incorrect message magic: d30bd00b [ 1175.269697] LustreError: 2188:0:(pack_generic.c:1071:lustre_msg_get_opc()) incorrect message magic: d30bd00b (msg:c0000000b55af000) [ 1175.269781] LustreError: 2188:0:(pack_generic.c:1341:lustre_msg_get_jobid()) incorrect message magic: d30bd00b <ConMan> Console [trevis-77vm7] disconnected from <trevis-77:6006> at 02-05 22:26.
Looking at the client1 console log for the hang at https://testing.whamcloud.com/test_sets/bac5babe-4908-11ea-b58e-52540065bddc, we see
[ 1170.068238] Lustre: DEBUG MARKER: == sanity-pfl test 17: Verify LOVEA grows with more component inited ================================= 16:07:54 (1581005274) [ 1170.164526] LustreError: 2131:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1 [ 1170.185519] LustreError: 2141:0:(pack_generic.c:1161:lustre_msg_get_transno()) incorrect message magic: 00000200 [ 1170.185636] LustreError: 2141:0:(pack_generic.c:1071:lustre_msg_get_opc()) incorrect message magic: 00000200 (msg:c0000000b9a42000) [ 1170.185719] LustreError: 2141:0:(pack_generic.c:1341:lustre_msg_get_jobid()) incorrect message magic: 00000200 [ 1170.192095] swap_info_get: Bad swap file entry 8000002ac500000 [ 1170.192172] BUG: Bad page map in process lfs pte:ab14000007010088 pmd:c0000000b9a52000 [ 1170.192226] addr:0000000010000000 vm_flags:00000875 anon_vma: (null) mapping:c0000000b84c8a20 index:0 [ 1170.192337] vma->vm_ops->fault: ext4_filemap_fault+0x0/0xfffffffffffe74d8 [ext4] [ 1170.192401] vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0xfffffffffffe7a78 [ext4] [ 1170.192457] CPU: 1 PID: 2144 Comm: lfs Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.9.1.el7.ppc64 #1 [ 1170.192530] Call Trace: [ 1170.192583] [c0000000b42a76a0] [c00000000001f078] .show_stack+0x88/0x330 (unreliable) [ 1170.192667] [c0000000b42a7760] [c000000000b1c56c] .dump_stack+0x28/0x3c [ 1170.192730] [c0000000b42a77d0] [c00000000030b018] .print_bad_pte+0x198/0x260 [ 1170.192788] [c0000000b42a7880] [c00000000030c8e4] .unmap_page_range+0x414/0xc20 [ 1170.192863] [c0000000b42a7a40] [c00000000031087c] .unmap_vmas+0xec/0x1c0 [ 1170.192917] [c0000000b42a7af0] [c00000000031e318] .exit_mmap+0xf8/0x250 [ 1170.192972] [c0000000b42a7c20] [c0000000000ecd98] .mmput+0xa8/0x1a0 [ 1170.193026] [c0000000b42a7ca0] [c0000000000ffd10] .do_exit+0x310/0xbb0 [ 1170.193080] [c0000000b42a7da0] [c000000000100774] .SyS_exit_group+0x54/0x100 [ 1170.193134] [c0000000b42a7e30] [c00000000000a288] system_call+0x3c/0x100 [ 1170.193189] Disabling lock debugging due to kernel taint
On the other hand, there is obviously something wrong with the output for other hangs. Looking at the hang at https://testing.whamcloud.com/test_sets/06861d14-4868-11ea-b69a-52540065bddc, the most interesting output is for the client1 (77vm1) with
[ 1190.287068] LustreError: 2154:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1 [ 1190.317013] Oops: Exception in kernel mode, sig: 4 [#1] [ 1190.317109] SMP NR_CPUS=2048 NUMA pSeries [ 1190.317155] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt gratc_te [ 1190.332221] Unrecoverable exception 0 at d000000004362060 [ 1190.332347] Oops: Unrecoverable exception, sig: 6 [#2] [ 1190.332385] SMP NR_CPUS=2048 NUMA pSeries [ 1190.332428] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) �(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) () () () () () () () () () () () () () () () () …< several hundred () cut out to keep sane> () () () () () () () () () () () () () [ 1191.364389] �() �() �() �() �() �() �() �) �() �() �() () �() �() �() () �() �() �() () … < several hundred “ �(“ removed> �() �() () �() �() �() ��li� )�z( u�'[/���6Y@ C�x�LB��@m �*� ��^��c ����wkgN1dx3 d�G �^ �Bj �() �() �() �() �ҖAMT� �x�ġ�� ĥn d߭�>뙛��A�() �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() bute it and/or modify it …
and it only gets worse from there.