[LU-15816] sanity test_398m: FAIL: parallel dio write with failure on first stripe succeeded Created: 03/May/22 Updated: 20/Dec/23 Resolved: 13/Dec/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Alena Nikitenko | Assignee: | Neil Brown |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
After clean upgrade from Lustre 2.12.8 on el7.9 to Lustre 2.15 on el8.5, sanity test 398m encountered an error: parallel dio write with failure on first stripe succeeded
== sanity test 398m: test RPC failures with parallel dio ========================================================== 22:57:43 (1651532263) fail_loc=0x20e fail_val=1 dd: error writing '/mnt/lustre/f398m.sanity': Input/output error 1+0 records in 0+0 records out 0 bytes copied, 56.4822 s, 0.0 kB/s fail_loc=0 fail_val=0 8+0 records in 8+0 records out 67108864 bytes (67 MB, 64 MiB) copied, 2.49378 s, 26.9 MB/s fail_loc=0x20f fail_val=1 dd: error reading '/mnt/lustre/f398m.sanity': Input/output error 0+0 records in 0+0 records out 0 bytes copied, 56.0012 s, 0.0 kB/s fail_loc=0 fail_val=0 fail_loc=0x20e fail_val=2 8+0 records in 8+0 records out 67108864 bytes (67 MB, 64 MiB) copied, 2.83038 s, 23.7 MB/s sanity test_398m: @@@@@@ FAIL: parallel dio write with failure on first stripe succeeded Trace dump: = /lib64/lustre/tests/test-framework.sh:6406:error() = /lib64/lustre/tests/sanity.sh:24681:test_398m() = /lib64/lustre/tests/test-framework.sh:6723:run_one() = /lib64/lustre/tests/test-framework.sh:6770:run_one_logged() = /lib64/lustre/tests/test-framework.sh:6611:run_test() = /lib64/lustre/tests/sanity.sh:24697:main() Dumping lctl log to /tmp/test_logs/2022-05-02/211139/sanity.test_398m.*.1651532382.log fail_loc=0 fail_loc=0
|
| Comments |
| Comment by Neil Brown [ 25/Nov/22 ] |
|
The error message is misleading - it is the second stripe that fails the test. It will always fail if the second OST (called OST1 in the script) is on a different host than the first (OST0). This is because while "do_facet ost1" is correctly used to access OST0, it is ALSO used to access OST1, which might be wrong. We need "do_facet ost1" to manage OST1.
|
| Comment by Gerrit Updater [ 25/Nov/22 ] |
|
"Neil Brown <neilb@suse.de>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49248 |
| Comment by Gerrit Updater [ 13/Dec/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49248/ |
| Comment by Peter Jones [ 13/Dec/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 24/Aug/23 ] |
|
"xinliang <xinliang.liu@linaro.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52066 |
| Comment by Gerrit Updater [ 20/Dec/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52066/ |