[LU-13205] sanity-pfl test 16a fails with “setstripe /mnt/lustre/d16.sanity-pfl/f16.sanity-pfl.copy failed“ Created: 05/Feb/20  Updated: 24/Feb/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: always_except, ppc
Environment:

PPC clients


Issue Links:
Duplicate
is duplicated by LU-13207 sanity-pfl test 16b crashes in “Oops:... Open
Related
is related to LU-10100 sanity test_27a: setstripe failed wit... Resolved
is related to LU-13215 sanity-pfl test 17 hangs with “incorr... Open
is related to LU-13186 sanity-pfl test 14 fails with '/mnt/l... Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Looking at a recent failure, https://testing.whamcloud.com/test_sets/52073ba0-4715-11ea-b69a-52540065bddc, sanity-pfl test 16a fails with the following in the client test_log

== sanity-pfl test 16a: Verify setstripe/getstripe with YAML config file ============================= 05:25:02 (1580793902)
CMD: trevis-4vm12 dumpe2fs -h /dev/mapper/mds1_flakey 2>&1 |
		grep -E -q '(ea_inode|large_xattr)'
1. PFL file
getstripe --yaml /mnt/lustre/d16a.sanity-pfl/f16a.sanity-pfl
setstripe --yaml=/mnt/lustre/d16a.sanity-pfl/template /mnt/lustre/d16a.sanity-pfl/f16a.sanity-pfl.copy
Set stripe size 4096 failed: Invalid argument
lfs setstripe: cannot build layout from YAML file /mnt/lustre/d16a.sanity-pfl/template.
error: setstripe: can't create composite layout from template file /mnt/lustre/d16a.sanity-pfl/template
 sanity-pfl test_16a: @@@@@@ FAIL: setstripe /mnt/lustre/d16a.sanity-pfl/f16a.sanity-pfl.copy failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6121:error()
  = /usr/lib64/lustre/tests/test-framework.sh:9710:verify_yaml_layout()
  = /usr/lib64/lustre/tests/sanity-pfl.sh:761:test_16a()

The only hint of a problem in the console logs is for client 2 (vm9) where we see

[ 1212.789487] LustreError: 1758:0:(pack_generic.c:2447:lustre_swab_lov_comp_md_v1()) Invalid magic 0x1

We’ve seen this same error message in sanity-pfl test 14 failures in LU-13186.

Patch https://review.whamcloud.com/#/c/28425/ for LU-9846 landed to Lustre 2.12.54 on 01 JUNE 2019 and was not back ported to b2_12. This patch moved test_16 to test_16a and created a new test 16b.

sanity-pfl test 16a started failing with this error message on 30 JULY 2019 with Lustre 2.12.56.72 and test 16 started failing for b2_12 on 13 AUG 2019 for Lustre 2.12.2.115.

sanity-pfl test 16a fails only for PPC clients and fails 100% of the time for PPC.

Logs for recent failures are at
https://testing.whamcloud.com/test_sets/fd407770-4706-11ea-a1c8-52540065bddc
https://testing.whamcloud.com/test_sets/07588ab8-2592-11ea-80b4-52540065bddc
https://testing.whamcloud.com/test_sets/a2a89ace-1fdb-11ea-adca-52540065bddc



 Comments   
Comment by Andreas Dilger [ 10/Feb/20 ]

Based on the date of the start of failures and the area affected, it looks like this was very likely caused by the landing of:

commit 9d17996766e0fa93b1029d2422d45d087edde389
CommitDate: Sat Jul 27 00:21:20 2019 +0000

    LU-10100 llite: swab LOV EA user data
    
    Many sub-tests failed with "Invalid argument" failures
    on PPC client because of the endianness issue.
    
    This patch fixes the issue by adding a common function
    lustre_swab_lov_user_md() to swab the LOV EA user data.
    
    Test-Parameters: clientarch=ppc64 envdefinitions=ONLY=27 testlist=sanity

    Reviewed-on: https://review.whamcloud.com/35291
Comment by Andreas Dilger [ 10/Feb/20 ]

It seems that "caused" might be a strong word. Prior to the landing of the LU-10100 patch, sanity-pfl just failed every test. After the landing of LU-10100, many of the tests started passing, and test_16a, test_16b, test_17 began crashing, but only because they actually started doing something rather than exiting as soon as the "lfs setstripe" command failed. That means reverting LU-10100 would not improve the situation, as much as revert to a state where 100% of the tests are broken on ppc64 compared to only the PFL tests.

I'm going to push a debug patch to see if it can identify where the missing swabbing is, which will hopefully allow us to solve the problem. The current issue is that the machine state is so fatally corrupted that the machine reboots without actually printing any stack traces of what is being run.

Comment by Gerrit Updater [ 10/Feb/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37494
Subject: LU-13205 lov: debug lov buffer swabbing crash
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e17fe2322008ce7520f4496932ea8638018a4261

Comment by Andreas Dilger [ 24/Feb/20 ]

I've captured a stack trace from the crash. It looks like it is calling DEBUG_REQ() in mdc_close() with a bad message buffer:

[ 1188.641162] Lustre: DEBUG MARKER: == sanity-pfl test 16b: Verify setstripe/getstripe with YAML config file + overstriping == 23:01:06 (1581548466)
[ 1189.084215] LustreError: 3150:0:(pack_generic.c:2488:lustre_swab_lov_comp_md_v1()) lov: invalid magic 0x1 for component 3
[ 1189.087595] LustreError: 3150:0:(pack_generic.c:1199:lustre_msg_get_transno()) incorrect message magic: 00000200
[ 1189.087680] LustreError: 3150:0:(pack_generic.c:1089:lustre_msg_get_opc()) incorrect message magic: 00000200 (msg:c0000000049cb800)
[ 1189.087807] ------------[ cut here ]------------
[ 1189.087904] WARNING: CPU: 1 PID: 3150 at lustre/ptlrpc/pack_generic.c:1090 .lustre_msg_get_opc+0x108/0x330 [ptlrpc]
 CPU: 1 PID: 3150 Comm: lfs Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.9.1.el7.ppc64 #1
 Call Trace:
 .lustre_msg_get_opc+0xf8/0x330 [ptlrpc] (unreliable)
 ._debug_req+0x674/0x830 [ptlrpc]
 .mdc_close+0x568/0x1150 [mdc]
 .lmv_close+0x238/0x580 [lmv]
 .ll_close_inode_openhandle+0xb7c/0x1380 [lustre]
 .ll_md_real_close+0x190/0x330 [lustre]
 .ll_file_release+0x728/0xcf0 [lustre]
 .ll_dir_release+0x15c/0x210 [lustre]
 .__fput+0xe8/0x330
Generated at Sat Feb 10 02:59:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.