[LU-14205] RHEL8.3: sanity test 398c fails with 'fio write error' Created: 09/Dec/20 Updated: 22/Dec/20 Resolved: 22/Dec/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James Nunez (Inactive) | Assignee: | Wang Shilong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | rhel8.3 | ||
| Environment: |
RHEL8.3 servers/clients |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
sanity test_398c fails with 'fio write error'. This failure is different from LU-13897. == sanity test 398c: run fio to test AIO ============================================================= 21:25:37 (1607462737)
/usr/bin/fio
debug=0
40+0 records in
40+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 0.0712628 s, 589 MB/s
osc.lustre-OST0000-osc-ffff8a2144fe3000.rpc_stats=clear
writing 40M to OST0 by fio with 4 jobs...
rand-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 4 processes
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=2654208, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=2682880, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=2686976, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=2695168, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3534848, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3543040, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3547136, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3567616, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3575808, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3588096, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3596288, buflen=4096
fio: pid=892877, err=71/file:io_u.c:1803, func=io_u error, error=Protocol error
rand-write: (groupid=0, jobs=1): err= 0: pid=892876: Tue Dec 8 21:28:37 2020
write: IOPS=14, BW=58.4KiB/s (59.8kB/s)(10.0MiB/175370msec); 0 zone resets
slat (usec): min=14, max=321, avg=37.10, stdev=35.86
clat (msec): min=116, max=1944, avg=1095.64, stdev=174.44
lat (msec): min=116, max=1944, avg=1095.68, stdev=174.44
clat percentiles (msec):
| 1.00th=[ 617], 5.00th=[ 877], 10.00th=[ 927], 20.00th=[ 978],
| 30.00th=[ 1011], 40.00th=[ 1053], 50.00th=[ 1083], 60.00th=[ 1116],
| 70.00th=[ 1150], 80.00th=[ 1217], 90.00th=[ 1284], 95.00th=[ 1385],
| 99.00th=[ 1737], 99.50th=[ 1854], 99.90th=[ 1905], 99.95th=[ 1921],
| 99.99th=[ 1938]
bw ( KiB/s): min= 8, max= 128, per=25.59%, avg=59.36, stdev=25.99, samples=343
iops : min= 2, max= 32, avg=14.84, stdev= 6.50, samples=343
lat (msec) : 250=0.04%, 500=0.23%, 750=2.30%, 1000=24.30%, 2000=73.12%
cpu : usr=0.01%, sys=0.05%, ctx=756, majf=0, minf=10
IO depths : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=99.4%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2560,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
rand-write: (groupid=0, jobs=1): err=71 (file:io_u.c:1803, func=io_u error, error=Protocol error): pid=892877: Tue Dec 8 21:28:37 2020
…
Run status group 0 (all jobs):
WRITE: bw=233KiB/s (239kB/s), 58.2KiB/s-58.4KiB/s (59.6kB/s-59.8kB/s), io=39.0MiB (41.9MB), run=175199-175612msec
sanity test_398c: @@@@@@ FAIL: fio write error
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:6257:error()
= /usr/lib64/lustre/tests/sanity.sh:22338:test_398c()
Looking at the console logs, client1 (vm1) has as error [19342.934999] Lustre: DEBUG MARKER: == sanity test 398c: run fio to test AIO ============================================================= 21:25:37 (1607462737) [19520.867822] LustreError: 558753:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0005-osc-ffff8a2144fe3000: unexpected positive size 1 [19523.015054] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_398c: @@@@@@ FAIL: fio write error [19523.582640] Lustre: DEBUG MARKER: sanity test_398c: @@@@@@ FAIL: fio write error |
| Comments |
| Comment by Peter Jones [ 09/Dec/20 ] |
|
Shilong Could you please advise on this one? Thanks Peter |
| Comment by James Nunez (Inactive) [ 10/Dec/20 ] |
|
I think we are seeing the same issue for sanity-pfl test_6 at https://testing.whamcloud.com/test_sets/cee68701-50f3-4729-b843-34935b3129d2 : == sanity-pfl test 6: Migrate composite file ========================================================= 20:02:44 (1607371364) striped dir -i2 -c2 -H crush /mnt/lustre/d6.sanity-pfl 5+0 records in 5+0 records out 5242880 bytes (5.2 MB, 5.0 MiB) copied, 3.02798 s, 1.7 MB/s error: lfs migrate: /mnt/lustre/d6.sanity-pfl/f6.sanity-pfl: data copy failed: Protocol error sanity-pfl test_6: @@@@@@ FAIL: Migrate(compsoite -> composite) /mnt/lustre/d6.sanity-pfl/f6.sanity-pfl failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6257:error() = /usr/lib64/lustre/tests/sanity-pfl.sh:364:test_6() and test 18 == sanity-pfl test 18: check component distribution ================================================== 20:18:46 (1607372326) dd: error writing '/mnt/lustre/f18.sanity-pfl-1': Protocol error 3+0 records in 2+0 records out 2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.0616448 s, 34.0 MB/s sanity-pfl test_18: @@@@@@ FAIL: dd failed for /mnt/lustre/f18.sanity-pfl-1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6257:error() = /usr/lib64/lustre/tests/sanity-pfl.sh:982:test_18() |
| Comment by Wang Shilong (Inactive) [ 11/Dec/20 ] |
|
This looks not really specific to AIO/DIO, test 18 of sanity-pfl is a generic buffer write, maybe write from server side somehow return error. |
| Comment by Gerrit Updater [ 11/Dec/20 ] |
|
Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40944 |
| Comment by John Hammond [ 11/Dec/20 ] |
|
I can see this in a very vanilla setup on RHEL 8.3. Using /bin/cp to copy a 1G file: [root@trevis-60vm1 lustre]# ls -lh f0 -rw-r--r-- 1 root root 1.0G Dec 11 19:38 f0 [root@trevis-60vm1 lustre]# rm -f f1; echo 3 > /proc/sys/vm/drop_caches [root@trevis-60vm1 lustre]# /bin/cp f0 f1 /bin/cp: failed to close 'f1': Input/output error [root@trevis-60vm1 lustre]# dmesg | tail -1 [ 6672.574315] LustreError: 12513:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0003-osc-ffff9d2eaeb92800: unexpected positive size 1 |
| Comment by Wang Shilong (Inactive) [ 16/Dec/20 ] |
|
jhammond I tried your steps without any luck, would you try above patch if it help fix your problem? |
| Comment by Gerrit Updater [ 16/Dec/20 ] |
|
Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40989 |
| Comment by Wang Shilong (Inactive) [ 16/Dec/20 ] |
|
jhammond Never mind, i reproduced problem and finally figured out.. |
| Comment by John Hammond [ 16/Dec/20 ] |
|
wshilong are both changes needed? |
| Comment by Wang Shilong (Inactive) [ 17/Dec/20 ] |
|
jhammond Acutally second patch is root cause of your problem. |
| Comment by John Hammond [ 17/Dec/20 ] |
|
wshilong Right. So should the first patch be landed or abandoned? |
| Comment by Wang Shilong (Inactive) [ 17/Dec/20 ] |
|
First patch indeed fix a bug which should be landed as well. |
| Comment by Gerrit Updater [ 17/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40944/ |
| Comment by Gerrit Updater [ 22/Dec/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40989/ |