[LU-15816] sanity test_398m: FAIL: parallel dio write with failure on first stripe succeeded Created: 03/May/22  Updated: 20/Dec/23  Resolved: 13/Dec/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Minor
Reporter: Alena Nikitenko Assignee: Neil Brown
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File sanity.test_398m.debug_log.trevis-87vm2.1651532382.log     Text File sanity.test_398m.debug_log.trevis-87vm3.1651532382.log     Text File sanity.test_398m.debug_log.trevis-87vm4.1651532382.log     Text File sanity.test_398m.debug_log.trevis-87vm5.1651532382.log     Text File sanity.test_398m.dmesg.trevis-87vm2.1651532382.log     Text File sanity.test_398m.dmesg.trevis-87vm3.1651532382.log     Text File sanity.test_398m.dmesg.trevis-87vm4.1651532382.log     Text File sanity.test_398m.dmesg.trevis-87vm5.1651532382.log    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After clean upgrade from Lustre 2.12.8 on el7.9 to Lustre 2.15 on el8.5, sanity test 398m encountered an error:

parallel dio write with failure on first stripe succeeded

 

== sanity test 398m: test RPC failures with parallel dio ========================================================== 22:57:43 (1651532263)
fail_loc=0x20e
fail_val=1
dd: error writing '/mnt/lustre/f398m.sanity': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 56.4822 s, 0.0 kB/s
fail_loc=0
fail_val=0
8+0 records in
8+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 2.49378 s, 26.9 MB/s
fail_loc=0x20f
fail_val=1
dd: error reading '/mnt/lustre/f398m.sanity': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 56.0012 s, 0.0 kB/s
fail_loc=0
fail_val=0
fail_loc=0x20e
fail_val=2
8+0 records in
8+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 2.83038 s, 23.7 MB/s
 sanity test_398m: @@@@@@ FAIL: parallel dio write with failure on first stripe succeeded
  Trace dump:
  = /lib64/lustre/tests/test-framework.sh:6406:error()
  = /lib64/lustre/tests/sanity.sh:24681:test_398m()
  = /lib64/lustre/tests/test-framework.sh:6723:run_one()
  = /lib64/lustre/tests/test-framework.sh:6770:run_one_logged()
  = /lib64/lustre/tests/test-framework.sh:6611:run_test()
  = /lib64/lustre/tests/sanity.sh:24697:main()
Dumping lctl log to /tmp/test_logs/2022-05-02/211139/sanity.test_398m.*.1651532382.log
fail_loc=0
fail_loc=0

 



 Comments   
Comment by Neil Brown [ 25/Nov/22 ]

The error message is misleading - it is the second stripe that fails the test.

It will always fail if the second OST (called OST1 in the script) is on a different host than the first (OST0).

This is because  while "do_facet ost1" is correctly used to access OST0, it is ALSO used to access OST1, which might be wrong.

We need "do_facet ost1" to manage OST1.

 

Comment by Gerrit Updater [ 25/Nov/22 ]

"Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49248
Subject: LU-15816 tests: use correct ost host to manage failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5ea236e3b450fea6aea7eab4c4a7d5333de6436f

Comment by Gerrit Updater [ 13/Dec/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49248/
Subject: LU-15816 tests: use correct ost host to manage failure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6e66cbdb5c8c08193c36262649667747127b6d90

Comment by Peter Jones [ 13/Dec/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 24/Aug/23 ]

"xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52066
Subject: LU-15816 tests: use correct ost host to manage failure
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: cd866dcf14c18dd650d407fedbdaa18c4bb1d2ac

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52066/
Subject: LU-15816 tests: use correct ost host to manage failure
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 5db1bc57996b674b3df19a1ae0ee6b20f4668586

Generated at Sat Feb 10 03:21:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.