[LU-13215] sanity-pfl test 17 hangs with “incorrect message magic” Created: 07/Feb/20 Updated: 21/Feb/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | always_except, ppc | ||
| Environment: |
PPC clients |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
With sanity-pfl tests 16a (LU-13205) and 16b (LU-13207) being skipped (on the ALWAYS_EXCEPT list), now test 17 hangs repeatedly for PPC client testing. See The output for each of these hangs look different although root cause for the crash may be the same. For example looking at the client1 (77vm7) console log for the hang at https://testing.whamcloud.com/test_sets/a9f0e7d2-4867-11ea-a1c8-52540065bddc, we see [ 1175.191331] Lustre: DEBUG MARKER: == sanity-pfl test 17: Verify LOVEA grows with more component inited ================================= 20:56:28 (1580936188) [ 1175.237552] LustreError: 2178:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1 [ 1175.269569] LustreError: 2188:0:(pack_generic.c:1161:lustre_msg_get_transno()) incorrect message magic: d30bd00b [ 1175.269697] LustreError: 2188:0:(pack_generic.c:1071:lustre_msg_get_opc()) incorrect message magic: d30bd00b (msg:c0000000b55af000) [ 1175.269781] LustreError: 2188:0:(pack_generic.c:1341:lustre_msg_get_jobid()) incorrect message magic: d30bd00b <ConMan> Console [trevis-77vm7] disconnected from <trevis-77:6006> at 02-05 22:26. Looking at the client1 console log for the hang at https://testing.whamcloud.com/test_sets/bac5babe-4908-11ea-b58e-52540065bddc, we see [ 1170.068238] Lustre: DEBUG MARKER: == sanity-pfl test 17: Verify LOVEA grows with more component inited ================================= 16:07:54 (1581005274) [ 1170.164526] LustreError: 2131:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1 [ 1170.185519] LustreError: 2141:0:(pack_generic.c:1161:lustre_msg_get_transno()) incorrect message magic: 00000200 [ 1170.185636] LustreError: 2141:0:(pack_generic.c:1071:lustre_msg_get_opc()) incorrect message magic: 00000200 (msg:c0000000b9a42000) [ 1170.185719] LustreError: 2141:0:(pack_generic.c:1341:lustre_msg_get_jobid()) incorrect message magic: 00000200 [ 1170.192095] swap_info_get: Bad swap file entry 8000002ac500000 [ 1170.192172] BUG: Bad page map in process lfs pte:ab14000007010088 pmd:c0000000b9a52000 [ 1170.192226] addr:0000000010000000 vm_flags:00000875 anon_vma: (null) mapping:c0000000b84c8a20 index:0 [ 1170.192337] vma->vm_ops->fault: ext4_filemap_fault+0x0/0xfffffffffffe74d8 [ext4] [ 1170.192401] vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0xfffffffffffe7a78 [ext4] [ 1170.192457] CPU: 1 PID: 2144 Comm: lfs Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.9.1.el7.ppc64 #1 [ 1170.192530] Call Trace: [ 1170.192583] [c0000000b42a76a0] [c00000000001f078] .show_stack+0x88/0x330 (unreliable) [ 1170.192667] [c0000000b42a7760] [c000000000b1c56c] .dump_stack+0x28/0x3c [ 1170.192730] [c0000000b42a77d0] [c00000000030b018] .print_bad_pte+0x198/0x260 [ 1170.192788] [c0000000b42a7880] [c00000000030c8e4] .unmap_page_range+0x414/0xc20 [ 1170.192863] [c0000000b42a7a40] [c00000000031087c] .unmap_vmas+0xec/0x1c0 [ 1170.192917] [c0000000b42a7af0] [c00000000031e318] .exit_mmap+0xf8/0x250 [ 1170.192972] [c0000000b42a7c20] [c0000000000ecd98] .mmput+0xa8/0x1a0 [ 1170.193026] [c0000000b42a7ca0] [c0000000000ffd10] .do_exit+0x310/0xbb0 [ 1170.193080] [c0000000b42a7da0] [c000000000100774] .SyS_exit_group+0x54/0x100 [ 1170.193134] [c0000000b42a7e30] [c00000000000a288] system_call+0x3c/0x100 [ 1170.193189] Disabling lock debugging due to kernel taint On the other hand, there is obviously something wrong with the output for other hangs. Looking at the hang at https://testing.whamcloud.com/test_sets/06861d14-4868-11ea-b69a-52540065bddc, the most interesting output is for the client1 (77vm1) with [ 1190.287068] LustreError: 2154:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1 [ 1190.317013] Oops: Exception in kernel mode, sig: 4 [#1] [ 1190.317109] SMP NR_CPUS=2048 NUMA pSeries [ 1190.317155] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt gratc_te [ 1190.332221] Unrecoverable exception 0 at d000000004362060 [ 1190.332347] Oops: Unrecoverable exception, sig: 6 [#2] [ 1190.332385] SMP NR_CPUS=2048 NUMA pSeries [ 1190.332428] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) �(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) () () () () () () () () () () () () () () () () …< several hundred () cut out to keep sane> () () () () () () () () () () () () () [ 1191.364389] �() �() �() �() �() �() �() �) �() �() �() () �() �() �() () �() �() �() () … < several hundred “ �(“ removed> �() �() () �() �() �() ��li� )�z( u�'[/���6Y@ C�x�LB��@m �*� ��^��c ����wkgN1dx3 d�G �^ �Bj �() �() �() �() �ҖAMT� �x�ġ�� ĥn d߭�>뙛��A�() �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() bute it and/or modify it … and it only gets worse from there. |
| Comments |
| Comment by Andreas Dilger [ 07/Feb/20 ] |
|
It looks like there is either missing swabbing, or incorrect swabbing of RPC message buffers. Did the (or the previously skipped test) failures start after specific patch landings? That would make it relatively easy to track down the source, otherwise it needs to be done by looking through the code. |
| Comment by James Nunez (Inactive) [ 10/Feb/20 ] |
|
Here's what I can see and this information is only for PPC testing. It looks like sanity-pfl test 17 has failed, with "Create /mnt/lustre/d16b.sanity-pfl/f16b.sanity-pfl failed", since before January 2019 through 26 JULY 2019 and didn’t run (and hung) again until 17 OCT 19 and again on 06 FEB 2020 until added to the ALWAYS_EXCEPT list patch. sanity-pfl test 16b as added on 03 JUNE 2019 by patch https://review.whamcloud.com/28425/ and failed with "Create /mnt/lustre/d16b.sanity-pfl/f16b.sanity-pfl failed" until it started crashing continuously from 30 JULY 2019 with one failure on 17 OCT 2019 until it was added to the ALWAYS_EXCEPT list patch on 05 FEB 2020. |