[LU-13205] sanity-pfl test 16a fails with “setstripe /mnt/lustre/d16.sanity-pfl/f16.sanity-pfl.copy failed“ Created: 05/Feb/20 Updated: 24/Feb/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | always_except, ppc | ||
| Environment: |
PPC clients |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
Looking at a recent failure, https://testing.whamcloud.com/test_sets/52073ba0-4715-11ea-b69a-52540065bddc, sanity-pfl test 16a fails with the following in the client test_log == sanity-pfl test 16a: Verify setstripe/getstripe with YAML config file ============================= 05:25:02 (1580793902) CMD: trevis-4vm12 dumpe2fs -h /dev/mapper/mds1_flakey 2>&1 | grep -E -q '(ea_inode|large_xattr)' 1. PFL file getstripe --yaml /mnt/lustre/d16a.sanity-pfl/f16a.sanity-pfl setstripe --yaml=/mnt/lustre/d16a.sanity-pfl/template /mnt/lustre/d16a.sanity-pfl/f16a.sanity-pfl.copy Set stripe size 4096 failed: Invalid argument lfs setstripe: cannot build layout from YAML file /mnt/lustre/d16a.sanity-pfl/template. error: setstripe: can't create composite layout from template file /mnt/lustre/d16a.sanity-pfl/template sanity-pfl test_16a: @@@@@@ FAIL: setstripe /mnt/lustre/d16a.sanity-pfl/f16a.sanity-pfl.copy failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6121:error() = /usr/lib64/lustre/tests/test-framework.sh:9710:verify_yaml_layout() = /usr/lib64/lustre/tests/sanity-pfl.sh:761:test_16a() The only hint of a problem in the console logs is for client 2 (vm9) where we see [ 1212.789487] LustreError: 1758:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1 We’ve seen this same error message in sanity-pfl test 14 failures in LU-13186. Patch https://review.whamcloud.com/#/c/28425/ for sanity-pfl test 16a started failing with this error message on 30 JULY 2019 with Lustre 2.12.56.72 and test 16 started failing for b2_12 on 13 AUG 2019 for Lustre 2.12.2.115. sanity-pfl test 16a fails only for PPC clients and fails 100% of the time for PPC. Logs for recent failures are at |
| Comments |
| Comment by Andreas Dilger [ 10/Feb/20 ] |
|
Based on the date of the start of failures and the area affected, it looks like this was very likely caused by the landing of: commit 9d17996766e0fa93b1029d2422d45d087edde389
CommitDate: Sat Jul 27 00:21:20 2019 +0000
LU-10100 llite: swab LOV EA user data
Many sub-tests failed with "Invalid argument" failures
on PPC client because of the endianness issue.
This patch fixes the issue by adding a common function
lustre_swab_lov_user_md() to swab the LOV EA user data.
Test-Parameters: clientarch=ppc64 envdefinitions=ONLY=27 testlist=sanity
Reviewed-on: https://review.whamcloud.com/35291
|
| Comment by Andreas Dilger [ 10/Feb/20 ] |
|
It seems that "caused" might be a strong word. Prior to the landing of the I'm going to push a debug patch to see if it can identify where the missing swabbing is, which will hopefully allow us to solve the problem. The current issue is that the machine state is so fatally corrupted that the machine reboots without actually printing any stack traces of what is being run. |
| Comment by Gerrit Updater [ 10/Feb/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37494 |
| Comment by Andreas Dilger [ 24/Feb/20 ] |
|
I've captured a stack trace from the crash. It looks like it is calling DEBUG_REQ() in mdc_close() with a bad message buffer: [ 1188.641162] Lustre: DEBUG MARKER: == sanity-pfl test 16b: Verify setstripe/getstripe with YAML config file + overstriping == 23:01:06 (1581548466) [ 1189.084215] LustreError: 3150:0:(pack_generic.c:2488:lustre_swab_lov_comp_md_v1()) lov: invalid magic 0x1 for component 3 [ 1189.087595] LustreError: 3150:0:(pack_generic.c:1199:lustre_msg_get_transno()) incorrect message magic: 00000200 [ 1189.087680] LustreError: 3150:0:(pack_generic.c:1089:lustre_msg_get_opc()) incorrect message magic: 00000200 (msg:c0000000049cb800) [ 1189.087807] ------------[ cut here ]------------ [ 1189.087904] WARNING: CPU: 1 PID: 3150 at lustre/ptlrpc/pack_generic.c:1090 .lustre_msg_get_opc+0x108/0x330 [ptlrpc] CPU: 1 PID: 3150 Comm: lfs Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.9.1.el7.ppc64 #1 Call Trace: .lustre_msg_get_opc+0xf8/0x330 [ptlrpc] (unreliable) ._debug_req+0x674/0x830 [ptlrpc] .mdc_close+0x568/0x1150 [mdc] .lmv_close+0x238/0x580 [lmv] .ll_close_inode_openhandle+0xb7c/0x1380 [lustre] .ll_md_real_close+0x190/0x330 [lustre] .ll_file_release+0x728/0xcf0 [lustre] .ll_dir_release+0x15c/0x210 [lustre] .__fput+0xe8/0x330 |