[LU-13215] sanity-pfl test 17 hangs with “incorrect message magic” Created: 07/Feb/20  Updated: 21/Feb/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: always_except, ppc
Environment:

PPC clients


Issue Links:
Related
is related to LU-13205 sanity-pfl test 16a fails with “setst... Open
is related to LU-13207 sanity-pfl test 16b crashes in “Oops:... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With sanity-pfl tests 16a (LU-13205) and 16b (LU-13207) being skipped (on the ALWAYS_EXCEPT list), now test 17 hangs repeatedly for PPC client testing. See
https://testing.whamcloud.com/test_sets/bac5babe-4908-11ea-b58e-52540065bddc
https://testing.whamcloud.com/test_sets/efe243fc-4908-11ea-b69a-52540065bddc
https://testing.whamcloud.com/test_sets/a9f0e7d2-4867-11ea-a1c8-52540065bddc
https://testing.whamcloud.com/test_sets/06861d14-4868-11ea-b69a-52540065bddc

The output for each of these hangs look different although root cause for the crash may be the same. For example looking at the client1 (77vm7) console log for the hang at https://testing.whamcloud.com/test_sets/a9f0e7d2-4867-11ea-a1c8-52540065bddc, we see

[ 1175.191331] Lustre: DEBUG MARKER: == sanity-pfl test 17: Verify LOVEA grows with more component inited ================================= 20:56:28 (1580936188)
[ 1175.237552] LustreError: 2178:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1
[ 1175.269569] LustreError: 2188:0:(pack_generic.c:1161:lustre_msg_get_transno()) incorrect message magic: d30bd00b
[ 1175.269697] LustreError: 2188:0:(pack_generic.c:1071:lustre_msg_get_opc()) incorrect message magic: d30bd00b (msg:c0000000b55af000)
[ 1175.269781] LustreError: 2188:0:(pack_generic.c:1341:lustre_msg_get_jobid()) incorrect message magic: d30bd00b

<ConMan> Console [trevis-77vm7] disconnected from <trevis-77:6006> at 02-05 22:26.

Looking at the client1 console log for the hang at https://testing.whamcloud.com/test_sets/bac5babe-4908-11ea-b58e-52540065bddc, we see

[ 1170.068238] Lustre: DEBUG MARKER: == sanity-pfl test 17: Verify LOVEA grows with more component inited ================================= 16:07:54 (1581005274)
[ 1170.164526] LustreError: 2131:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1
[ 1170.185519] LustreError: 2141:0:(pack_generic.c:1161:lustre_msg_get_transno()) incorrect message magic: 00000200
[ 1170.185636] LustreError: 2141:0:(pack_generic.c:1071:lustre_msg_get_opc()) incorrect message magic: 00000200 (msg:c0000000b9a42000)
[ 1170.185719] LustreError: 2141:0:(pack_generic.c:1341:lustre_msg_get_jobid()) incorrect message magic: 00000200
[ 1170.192095] swap_info_get: Bad swap file entry 8000002ac500000
[ 1170.192172] BUG: Bad page map in process lfs  pte:ab14000007010088 pmd:c0000000b9a52000
[ 1170.192226] addr:0000000010000000 vm_flags:00000875 anon_vma:          (null) mapping:c0000000b84c8a20 index:0
[ 1170.192337] vma->vm_ops->fault: ext4_filemap_fault+0x0/0xfffffffffffe74d8 [ext4]
[ 1170.192401] vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0xfffffffffffe7a78 [ext4]
[ 1170.192457] CPU: 1 PID: 2144 Comm: lfs Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.9.1.el7.ppc64 #1
[ 1170.192530] Call Trace:
[ 1170.192583] [c0000000b42a76a0] [c00000000001f078] .show_stack+0x88/0x330 (unreliable)
[ 1170.192667] [c0000000b42a7760] [c000000000b1c56c] .dump_stack+0x28/0x3c
[ 1170.192730] [c0000000b42a77d0] [c00000000030b018] .print_bad_pte+0x198/0x260
[ 1170.192788] [c0000000b42a7880] [c00000000030c8e4] .unmap_page_range+0x414/0xc20
[ 1170.192863] [c0000000b42a7a40] [c00000000031087c] .unmap_vmas+0xec/0x1c0
[ 1170.192917] [c0000000b42a7af0] [c00000000031e318] .exit_mmap+0xf8/0x250
[ 1170.192972] [c0000000b42a7c20] [c0000000000ecd98] .mmput+0xa8/0x1a0
[ 1170.193026] [c0000000b42a7ca0] [c0000000000ffd10] .do_exit+0x310/0xbb0
[ 1170.193080] [c0000000b42a7da0] [c000000000100774] .SyS_exit_group+0x54/0x100
[ 1170.193134] [c0000000b42a7e30] [c00000000000a288] system_call+0x3c/0x100
[ 1170.193189] Disabling lock debugging due to kernel taint

On the other hand, there is obviously something wrong with the output for other hangs. Looking at the hang at https://testing.whamcloud.com/test_sets/06861d14-4868-11ea-b69a-52540065bddc, the most interesting output is for the client1 (77vm1) with

[ 1190.287068] LustreError: 2154:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1
[ 1190.317013] Oops: Exception in kernel mode, sig: 4 [#1]
[ 1190.317109] SMP NR_CPUS=2048 NUMA pSeries
[ 1190.317155] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt gratc_te
[ 1190.332221] Unrecoverable exception 0 at d000000004362060
[ 1190.332347] Oops: Unrecoverable exception, sig: 6 [#2]
[ 1190.332385] SMP NR_CPUS=2048 NUMA pSeries
[ 1190.332428] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) �(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) () () () () () () () () () () () () () () () () …< several hundred () cut out to keep sane>
 () () () () () () () () () () () () ()
[ 1191.364389]  �() �() �() �() �() �() �() �) �() �() �() () �() �() �() () �() �() �() () … < several hundred “ �(“ removed>
 �() �() () �() �() �() 
��li� )�z( u�'[/���6Y@ C�x�LB��@m �*� ��^��c ����wkgN1dx3 d�G	
�^	�Bj 
�() �() �() �() �ҖAMT� �x�ġ�� ĥn d߭�>뙛��A�() �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() () �() �() �() bute it and/or modify it
…

and it only gets worse from there.



 Comments   
Comment by Andreas Dilger [ 07/Feb/20 ]

It looks like there is either missing swabbing, or incorrect swabbing of RPC message buffers. Did the (or the previously skipped test) failures start after specific patch landings? That would make it relatively easy to track down the source, otherwise it needs to be done by looking through the code.

Comment by James Nunez (Inactive) [ 10/Feb/20 ]

Here's what I can see and this information is only for PPC testing.

It looks like sanity-pfl test 17 has failed, with "Create /mnt/lustre/d16b.sanity-pfl/f16b.sanity-pfl failed", since before January 2019 through 26 JULY 2019 and didn’t run (and hung) again until 17 OCT 19 and again on 06 FEB 2020 until added to the ALWAYS_EXCEPT list patch.

sanity-pfl test 16b as added on 03 JUNE 2019 by patch https://review.whamcloud.com/28425/ and failed with "Create /mnt/lustre/d16b.sanity-pfl/f16b.sanity-pfl failed" until it started crashing continuously from 30 JULY 2019 with one failure on 17 OCT 2019 until it was added to the ALWAYS_EXCEPT list patch on 05 FEB 2020.

Generated at Sat Feb 10 02:59:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.