[LU-10100] sanity test_27a: setstripe failed with "error on ioctl 0x8008669a for '*' (3): Invalid argument" Created: 06/Oct/17  Updated: 10/Feb/20  Resolved: 20/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Critical
Reporter: James Casper Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: ppc
Environment:

trevis, full, x86_64 servers, ppc clients
servers: el7.4, ldiskfs, branch master, v2.10.53.1, b3642
clients: el7.4, branch master, v2.10.53.1, b3642


Issue Links:
Related
is related to LU-2590 lfs setstripe broken on ppc Resolved
is related to LU-10094 sanity test_17f: 'ls' fails with "ls:... Resolved
is related to LU-12589 sanity test_102a: setfattr: /mnt/lust... Resolved
is related to LU-10097 sanity test_24u: error() without usef... Closed
is related to LU-10984 ‘lfs setstripe’ fails with “lfs setst... Closed
is related to LU-12673 swab big-endian format layout receive... Open
is related to LU-13205 sanity-pfl test 16a fails with “setst... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.whamcloud.com/test_sessions/ba995751-659c-4e63-9b5b-fbf101137b78

From test_log:

stripe_count:  1 stripe_size:   1048576 stripe_offset: -1
error on ioctl 0x8008669a for '/mnt/lustre/d27/f0' (3): Invalid argument
error: setstripe: create striped file '/mnt/lustre/d27/f0' failed: Invalid argument
 sanity test_27a: @@@@@@ FAIL: setstripe failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5289:error()
  = /usr/lib64/lustre/tests/sanity.sh:1357:test_27a()
  = /usr/lib64/lustre/tests/test-framework.sh:5565:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5451:run_test()
  = /usr/lib64/lustre/tests/sanity.sh:1361:main()


 Comments   
Comment by James Nunez (Inactive) [ 02/May/18 ]

Several sanity tests fail 'lfs setstripe' for PPC architectures including test_27a, 27b, 27d, 27e, 27k, 27r, 27w, 27wa, 27z, 27C, 27E, 56t, 56u, 56x, 56xa, 56xb, 65k, 78, 101b, …

For full test group results, the first time we see these tests fail for PPC is on 2017-09-17 20:43:36 UTC for master build # 3642, version 2.10.53.1.

Comment by James Nunez (Inactive) [ 29/Apr/19 ]

Similar setstripe error for ppc for sanityn test 16a; https://testing.whamcloud.com/test_sets/7d68fa34-668f-11e9-8bb1-52540065bddc .

== sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 03:45:53 (1555991153)
CMD: trevis-55vm12 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
lfs setstripe: setstripe error for '/mnt/lustre/f16a.sanityn': Invalid argument
7+0 records in
7+0 records out
7340032 bytes (7.3 MB) copied, 1.41347 s, 5.2 MB/s
7+0 records in
7+0 records out
7340032 bytes (7.3 MB) copied, 0.707808 s, 10.4 MB/s
lfs setstripe: setstripe error for '/mnt/lustre/f16a.sanityn': Invalid argument
Chance of close/open is 1 in 50
...
dowrite: write: Invalid argument
LOG DUMP (1 total operations):
1[0]: 1555991453.674708 WRITE    0x237000 thru 0x238fff (0x2000 bytes) HOLE
Correct content saved for comparison
(maybe hexdump "/mnt/lustre/f16a.sanityn" vs "/mnt/lustre/f16a.sanityn.fsxgood")
 sanityn test_16a: @@@@@@ FAIL: fsx with O_DIRECT failed. 
Comment by James Nunez (Inactive) [ 30/Apr/19 ]

Similar issue seen on sanity-hsm tests 11a, 11b, 12a, 12b, 12c, 12n, 13, 25a, 27a, 30a, 31a, 72, 77, 110a, 111a, 201, 222a, 222c, and 223a fail with

trevis-77vm2: lhsmtool_posix: 1555997033.356277 lhsmtool_posix[26740]: importing '/mnt/lustre/d11a.sanity-hsm/f11a.sanity-hsm' from '/tmp/arc1/sanity-hsm.test_11a//d11a.sanity-hsm/f11a.sanity-hsm'
trevis-77vm2: lhsmtool_posix: setstripe error for '/mnt/lustre/d11a.sanity-hsm/f11a.sanity-hsm': Invalid argument
trevis-77vm2: lhsmtool_posix: cannot create '/mnt/lustre/d11a.sanity-hsm/f11a.sanity-hsm' for import: Invalid argument (22)
trevis-77vm2: lhsmtool_posix: 1555997033.361745 lhsmtool_posix[26740]: cannot import '/mnt/lustre/d11a.sanity-hsm/f11a.sanity-hsm' from '/tmp/arc1/sanity-hsm.test_11a//d11a.sanity-hsm/f11a.sanity-hsm': Invalid argument (22)
trevis-77vm2: lhsmtool_posix: 1555997033.361757 lhsmtool_posix[26740]: process finished, errs: 0 major, 0 minor, rc=-22 (Invalid argument)
 sanity-hsm test_11a: @@@@@@ FAIL: Failed to import 'd11a.sanity-hsm/f11a.sanity-hsm' to '/mnt/lustre/d11a.sanity-hsm/f11a.sanity-hsm' 
Comment by James Nunez (Inactive) [ 30/Apr/19 ]

sanity-flr test_0a, 0b, 0c, 0e, 0f, 0g, 1, and 42 all fail with 'create mirrored file /mnt/lustre/d0a.sanity-flr/f*.sanity-flr failed'
. We see this for PPC client testing only.

Looking at the suite_log for a recent failure, https://testing.whamcloud.com/test_sets/a1786810-668f-11e9-8bb1-52540065bddc, we see all these tests fail with an ‘Invalid argument’ message

== sanity-flr test 0a: lfs mirror create with -N option ============================================== 06:36:59 (1556001419)
lfs mirror create: cannot create composite file '/mnt/lustre/d0a.sanity-flr/f0a.sanity-flr': Invalid argument
 sanity-flr test_0a: @@@@@@ FAIL: create mirrored file /mnt/lustre/d0a.sanity-flr/f0a.sanity-flr failed 

sanity-flr test_0d has a similar failure

== sanity-flr test 0d: lfs mirror extend with -N option ============================================== 06:39:48 (1556001588)
lfs mirror extend: cannot create composite file '/mnt/lustre/d0d.sanity-flr/. 
   :VOLATILE:0000:70DEDB39': Invalid argument
error: lfs mirror extend: /mnt/lustre/d0d.sanity-flr/f0d.sanity-flr: cannot create volatile file: Operation not permitted
 sanity-flr test_0d: @@@@@@ FAIL: convert and extend /mnt/lustre/d0d.sanity-flr/f0d.sanity-flr failed 
Comment by Andreas Dilger [ 30/May/19 ]

I did some investigation into the debug logs of one of the many failed tests to see where the problem is coming from. It appears that the client is able to create the file with open(O_LOV_DELAY_CREATE) and then calls ioctl(LL_IOC_LOV_SETSTRIPE), but the MDS returns -EINVAL without much information in the logs (I don't think the debug=-1 mask is being set on the MDS for sanity.sh):

mdc_finish_enqueue() @@@ op: 1 disposition: 3, status: -22  req@c000000074ed5100 x1634660749735520/t0(0) o101->lustre-MDT0000-mdc-c00000007457a800@10.9.5.36@tcp:12/10 lens 648/568 e 0 to 0 dl 1558936451 ref 1 fl Complete:R/0/0 rc 301/301
mdc_finish_intent_lock() D_IT dentry  intent: open status -22 disp 3 rc -22
mdc_intent_lock() Process leaving (rc=-22)
ll_intent_file_open() lock enqueue: err: -22
ll_intent_file_open() Process leaving via out (rc=-22)
ll_lov_setstripe_ea_info() Process leaving via out_unlock (rc=-22)
ll_lov_setstripe() Process leaving (rc=-22)
ll_file_ioctl() Process leaving (rc=-22)

so it definitely seems that the MDS is not swabbing part or all of the incoming request and/or the client is not doing the same.

My preference would be to fix this on both ends, if possible and depending on what the problem is, so that we have maximum coverage for new/old clients talking to old/new servers.

Comment by Andreas Dilger [ 31/May/19 ]

Looking on the MDS I see that it is already failing when checking the lmm_magic:

lod_verify_striping()) Process entered
lod_verify_striping()) bad userland LOV MAGIC: 0xd00bd10b
lod_verify_striping()) Process leaving (rc= -22)
lod_qos_parse_config()) Process leaving (rc= -22)
lod_prepare_create()) Process leaving (rc= -22)
lod_declare_striped_create()) Process leaving via out (rc= -22)
lod_declare_xattr_set()) Process leaving (rc= -22)
mdd_create_data()) Process leaving via stop (rc= -22)
mdt_mfd_open()) Process leaving (rc= -22)
mdt_finish_open()) Process leaving (rc= -22)
mdt_open_by_fid_lock()) Process leaving via out_unlock (rc= -22)
mdt_reint_open()) no object for [0x200000405:0x15:0x0]: -22
Comment by Patrick Farrell (Inactive) [ 31/May/19 ]

Andreas didn't call this out specifically (If your brain works the right way, I guess it's obvious  ), but this is an endian-ness/lack of swabbing issue:

lod_verify_striping()) bad userland LOV MAGIC: 0xd00bd10b 

Which we can see because the magic is:

#define LOV_MAGIC_MAGIC 0x0BD0
#define LOV_MAGIC_V1 (0x0BD10000 | LOV_MAGIC_MAGIC) 
Comment by Peter Jones [ 31/May/19 ]

Jian

Can you please follow up on this?

Thanks

Peter

Comment by Jian Yu [ 21/Jun/19 ]

In llapi_file_open_param(), while setting lmm_magic, we need use cpu_to_le32() to convert the format into little-endian form. I'm creating the patch.

Comment by Gerrit Updater [ 22/Jun/19 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35291
Subject: LU-10100 utils: convert magic number into little-endian form
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dd838b085d88d25f1091b12b88eb160dc8bf7edc

Comment by Jian Yu [ 23/Jun/19 ]

Besides lmm_magic, other fields in struct lov_user_md also need to be converted.

Comment by Andreas Dilger [ 24/Jul/19 ]

Jian, can you please also make a separate patch for the MDS to swab the layout received from the client if it is in big-endian format. This will simplify interop for deployment on systems where there are lots of PPC clients that have not been upgraded.

Comment by Jian Yu [ 24/Jul/19 ]

Sure, Andreas. Let me work on this.

Comment by Gerrit Updater [ 27/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35291/
Subject: LU-10100 llite: swab LOV EA user data
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9d17996766e0fa93b1029d2422d45d087edde389

Comment by Gerrit Updater [ 29/Jul/19 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35633
Subject: LU-10100 llite: swab LOV EA user data
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4df8c490ac6e4c40043e1e4dc991617e6cb61599

Comment by Gerrit Updater [ 11/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35633/
Subject: LU-10100 llite: swab LOV EA user data
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 7995349a2032f39ed83005b977462523864d72fa

Comment by Jian Yu [ 20/Aug/19 ]

The patch for client has landed for Lustre 2.13.0. Patch for MDS will be worked in LU-12673.

Generated at Sat Feb 10 02:32:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.