[LU-15551] Interop: Various tests fail with fallocate “Protocol error” errors Created: 11/Feb/22  Updated: 22/Apr/22  Resolved: 18/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Blocker
Reporter: James Nunez (Inactive) Assignee: Arshad Hussain
Resolution: Fixed Votes: 0
Labels: interop

Issue Links:
Duplicate
is duplicated by LU-15698 Interop sanity-flr test_50c: fallocat... Resolved
Related
is related to LU-14160 Implement fallocate FALLOCATE_FL_PUNC... Resolved
is related to LU-15748 interop: sanity test_150b: fallocate ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Several tests are failing with the a variety error messages, but they all have the message “do_preallocate: fallocate: Protocol error” . These failures started September 11, 2021.

sanity-benchmark test_fsx and sanityn test_16a, test_16b, test_16c fail with the following in the suite_log:

do_preallocate: fallocate: Protocol error

sanity test_150b fail with the following in the suite_log:

fallocate failed, error Protocol error, mode 0, offset 62914560, len 4194304

sanity-benchmark test_fsx fails with 'fsx failed' starting with 2.14.54.45 servers and 2.14.0 clients; https://testing.whamcloud.com/test_sets/daab9d17-e641-427e-9572-168989a4b23b. The following in the suite_log:

== sanity-benchmark test fsx: fsx ==================================================================== 12:02:50 (1631361770)
keep default fallocate mode: 0
debug=0
Using: /usr/lib64/lustre/tests/fsx -c 50 -p 1000 -S 5151 -P /tmp -l 5461544 	     -N 100000  /mnt/lustre/f0.fsxfile
Chance of close/open is 1 in 50
Seed set to 5151
fsx: test_fallocate: filesystem does not support fallocate mode 0x3, disabling!: Operation not supported
fsx: test_fallocate: filesystem does not support fallocate mode 0x10, disabling!: Operation not supported
truncating to largest ever: 0x3531dc
fallocate: 118b1 to 469048
do_preallocate: fallocate: Protocol error
LOG DUMP (10 total operations):
1: 1631361772.173266 TRUNCATE UP	from 0x00000 to 0x3531dc
2: 1631361772.176123 WRITE    0x475dee thru 0x476c9e (0x00eb1 bytes) HOLE
3: 1631361772.176366 TRUNCATE DOWN	from 0x476c9f to 0x18202b
4: 1631361772.178621 MAPWRITE 0xd455f thru 0xd6920 (0x023c2 bytes)
5: 1631361772.246526 SKIPPED (no operation)
6: 1631361772.246564 READ     0x1b680 thru 0x26b75 (0x0b4f6 bytes)
7: 1631361772.247881 MAPWRITE 0x22e11f thru 0x2313a0 (0x03282 bytes)
8: 1631361772.280197 MAPWRITE 0x374883 thru 0x3810ae (0x0c82c bytes)
9: 1631361772.315333 MAPWRITE 0x4796a2 thru 0x47a8f8 (0x01257 bytes)
10: 1631361772.346572 FALLOC  	from 0x118b1 to 0x1efe9	(0x0d738 bytes)INTERIOR
Correct content saved for comparison
(maybe hexdump "/mnt/lustre/f0.fsxfile" vs "/tmp//f0.fsxfile.fsxgood")
 sanity-benchmark test_fsx: @@@@@@ FAIL: fsx failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib64/lustre/tests/sanity-benchmark.sh:196:test_fsx()

sanityn test_16a, test_16b, test_16c fails with 'fsx failed' starting on same data and same client/server versions;

== sanityn test 16a: 12500 iterations of dual-mount fsx ============================================== 18:58:04 (1631386684)
CMD: onyx-23vm7 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
keep default fallocate mode: 0
7+0 records in
7+0 records out
7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.577856 s, 12.7 MB/s
7+0 records in
7+0 records out
7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.46945 s, 15.6 MB/s
Chance of close/open is 1 in 50
Seed set to 6688
fd 0: /mnt/lustre/f16a.sanityn
fd 1: /mnt/lustre2/f16a.sanityn
fsx: test_fallocate: filesystem does not support fallocate mode 0x3, disabling!: Operation not supported
fsx: test_fallocate: filesystem does not support fallocate mode 0x10, disabling!: Operation not supported
fallocate: 5233c8 to 4b288
do_preallocate: fallocate: Protocol error
LOG DUMP (2 total operations):
1[1]: 1631386688.420122 WRITE    0x561c9c thru 0x56e64f (0x0c9b4 bytes) HOLE
2[0]: 1631386688.427553 FALLOC  	from 0x5233c8 to 0x52350f	(0x00147 bytes)INTERIOR
Correct content saved for comparison
(maybe hexdump "/mnt/lustre/f16a.sanityn" vs "/mnt/lustre/f16a.sanityn.fsxgood")
 sanityn test_16a: @@@@@@ FAIL: fsx failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib64/lustre/tests/sanityn.sh:402:test_16a()
…
== sanityn test 16b: 12500 iterations of dual-mount fsx at small size ================================ 18:58:12 (1631386692)
keep default fallocate mode: 0
7+0 records in
7+0 records out
7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.515146 s, 14.2 MB/s
7+0 records in
7+0 records out
7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.453771 s, 16.2 MB/s
Chance of close/open is 1 in 50
Seed set to 6694
fd 0: /mnt/lustre/f16b.sanityn
fd 1: /mnt/lustre2/f16b.sanityn
fsx: test_fallocate: filesystem does not support fallocate mode 0x3, disabling!: Operation not supported
fsx: test_fallocate: filesystem does not support fallocate mode 0x10, disabling!: Operation not supported
skipping zero size read
fallocate: 3a858e to 5e8cc6
do_preallocate: fallocate: Protocol error
LOG DUMP (4 total operations):
1[18446744073708685781]: 1631386695.071343 SKIPPED (no operation)
2[1]: 1631386695.073361 SKIPPED (no operation)
3[1]: 1631386695.073642 WRITE    0x98f7df thru 0x991253 (0x01a75 bytes) HOLE
4[0]: 1631386695.081362 FALLOC  	from 0x3a858e to 0x3aa51d	(0x01f8f bytes)INTERIOR
Correct content saved for comparison
(maybe hexdump "/mnt/lustre/f16b.sanityn" vs "/mnt/lustre/f16b.sanityn.fsxgood")
 sanityn test_16b: @@@@@@ FAIL: fsx with tiny write failed. 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib64/lustre/tests/sanityn.sh:431:test_16b()
…
== sanityn test 16c: verify data consistency on ldiskfs with cache disabled (b=17397) ================ 18:58:18 (1631386698)
CMD: onyx-23vm7 /usr/sbin/lctl get_param -n lod.lustre-MDT0000*.stripesize
keep default fallocate mode: 0
7+0 records in
7+0 records out
7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.632607 s, 11.6 MB/s
7+0 records in
7+0 records out
7340032 bytes (7.3 MB, 7.0 MiB) copied, 0.479825 s, 15.3 MB/s
CMD: onyx-23vm7 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
CMD: onyx-23vm7 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
Waiting for local destroys to complete
CMD: onyx-23vm6 /usr/sbin/lctl get_param -n obdfilter.lustre-OST*.read_cache_enable 		osd-*.lustre-OST*.read_cache_enable 2>&1
pdsh@onyx-23vm4: onyx-23vm6: ssh exited with exit code 2
CMD: onyx-23vm6 /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.read_cache_enable=0 		osd-*.lustre-OST*.read_cache_enable=0 2>&1
pdsh@onyx-23vm4: onyx-23vm6: ssh exited with exit code 2
CMD: onyx-23vm6 /usr/sbin/lctl set_param -n obdfilter.lustre-OST*.writethrough_cache_enable=0 		osd-*.lustre-OST*.writethrough_cache_enable=0 2>&1
pdsh@onyx-23vm4: onyx-23vm6: ssh exited with exit code 2
Chance of close/open is 1 in 50
Seed set to 6716
fd 0: /mnt/lustre/f16c.sanityn
fd 1: /mnt/lustre2/f16c.sanityn
fsx: test_fallocate: filesystem does not support fallocate mode 0x3, disabling!: Operation not supported
fsx: test_fallocate: filesystem does not support fallocate mode 0x10, disabling!: Operation not supported
fallocate: 4154fd to ffdea255
do_preallocate: fallocate: Protocol error
LOG DUMP (5 total operations):
1[18446744073708595669]: 1631386716.729518 SKIPPED (no operation)
2[18446744073708595669]: 1631386716.730419 SKIPPED (no operation)
3[1]: 1631386716.736190 MAPWRITE 0x1f481f thru 0x1ff751 (0x0af33 bytes)
4[1]: 1631386716.938927 SKIPPED (no operation)
5[1]: 1631386716.939194 FALLOC  	from 0x4154fd to 0x4246d1	(0x0f1d4 bytes)INTERIOR
Correct content saved for comparison
(maybe hexdump "/mnt/lustre/f16c.sanityn" vs "/mnt/lustre/f16c.sanityn.fsxgood")
 sanityn test_16c: @@@@@@ FAIL: fsx failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib64/lustre/tests/sanityn.sh:462:test_16c()

sanity test_150b fails with a similar message ever since the test was introduced; https://testing.whamcloud.com/test_sets/9ea27d14-cf74-483f-98b4-46fc0efa0146.

== sanity test 150b: Verify fallocate (prealloc) functionality ======================================= 07:55:05 (1623225305)
keep default fallocate mode: 0
fallocate failed, error Protocol error, mode 0, offset 62914560, len 4194304
 sanity test_150b: @@@@@@ FAIL: fallocate failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib64/lustre/tests/sanity.sh:13441:test_150b()
 


 Comments   
Comment by Peter Jones [ 11/Feb/22 ]

Arshad

Can you please advise? Does this seem related to LU-14160?

Peter

Comment by Patrick Farrell [ 11/Feb/22 ]

(Specifically https://review.whamcloud.com/40877 )

Comment by Arshad Hussain [ 12/Feb/22 ]

>Can you please advise? Does this seem related to LU-14160?

Hi Peter, I am checking on this.

Comment by Arshad Hussain [ 12/Feb/22 ]

>(Specifically https://review.whamcloud.com/#/c/46221/ )

Hi Patrick, The URL fails to open with "Not found " and/or permisison. Sorry I cannot check this out. I am however, looking into other info in bug report.

Comment by Andreas Dilger [ 12/Feb/22 ]

I've fixed the URL to https://review.whamcloud.com/40877 "LU-14160 fallocate: Add punch mode to fallocate".

It probably needs the server to return -EOPNOTSUPP instead of -EPROTO for unsupported fallocate modes. On the client (for interop) it should convert -EPROTO to -EOPNOTSUPP in fallocate. I thought there was already a patch to do this, but I guess it is not in the right place - https://review.whamcloud.com/41148 "LU-14301 llite: return EOPNOTSUPP if fallocate is not supported", but it looks like that patch is converting -ENOTSUPP to -EOPNOTSUPP and should be enhanced to do the same for -EPROTO?

Comment by Gerrit Updater [ 14/Feb/22 ]

"Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/46516
Subject: LU-15551 ofd: Return EOPNOTSUPP instead of EPROTO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 454de1abd0de78d628147259f44c3eaabe6b1414

Comment by Arshad Hussain [ 14/Feb/22 ]

Hi Andreas,

Thanks for your comments and analysis. The client is getting -EPROTO from ofd_fallocate_hdl functin (server side) Changed it to return EOPNOTSUPP. The client part looks ok to me. LU-14301 does not need further changes IMO.

Thanks!

Comment by Gerrit Updater [ 18/Mar/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46516/
Subject: LU-15551 ofd: Return EOPNOTSUPP instead of EPROTO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2f496148c31dd435681e595129834cecdaec82dc

Comment by Peter Jones [ 18/Mar/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:19:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.