[LU-14205] RHEL8.3: sanity test 398c fails with 'fio write error' Created: 09/Dec/20  Updated: 22/Dec/20  Resolved: 22/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Blocker
Reporter: James Nunez (Inactive) Assignee: Wang Shilong (Inactive)
Resolution: Fixed Votes: 0
Labels: rhel8.3
Environment:

RHEL8.3 servers/clients


Issue Links:
Duplicate
is duplicated by LU-14202 sanity-sec test 16 fails with 'unable... Resolved
is duplicated by LU-14200 ost-pools test 23b fails with 'dd did... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_398c fails with 'fio write error'. This failure is different from LU-13897.
We’ve only seen this failure once for RHEL8.3 server/client testing at https://testing.whamcloud.com/test_sets/fe2f2ee3-067d-431b-b69c-01858086821c . Looking at the suite_log output, we see a couple of fio calls with ‘err=71’

== sanity test 398c: run fio to test AIO ============================================================= 21:25:37 (1607462737)
/usr/bin/fio
debug=0
40+0 records in
40+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 0.0712628 s, 589 MB/s
osc.lustre-OST0000-osc-ffff8a2144fe3000.rpc_stats=clear
writing 40M to OST0 by fio with 4 jobs...
rand-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 4 processes
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=2654208, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=2682880, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=2686976, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=2695168, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3534848, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3543040, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3547136, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3567616, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3575808, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3588096, buflen=4096
fio: io_u error on file /mnt/lustre/f398c.sanity: Protocol error: write offset=3596288, buflen=4096
fio: pid=892877, err=71/file:io_u.c:1803, func=io_u error, error=Protocol error

rand-write: (groupid=0, jobs=1): err= 0: pid=892876: Tue Dec  8 21:28:37 2020
  write: IOPS=14, BW=58.4KiB/s (59.8kB/s)(10.0MiB/175370msec); 0 zone resets
    slat (usec): min=14, max=321, avg=37.10, stdev=35.86
    clat (msec): min=116, max=1944, avg=1095.64, stdev=174.44
     lat (msec): min=116, max=1944, avg=1095.68, stdev=174.44
    clat percentiles (msec):
     |  1.00th=[  617],  5.00th=[  877], 10.00th=[  927], 20.00th=[  978],
     | 30.00th=[ 1011], 40.00th=[ 1053], 50.00th=[ 1083], 60.00th=[ 1116],
     | 70.00th=[ 1150], 80.00th=[ 1217], 90.00th=[ 1284], 95.00th=[ 1385],
     | 99.00th=[ 1737], 99.50th=[ 1854], 99.90th=[ 1905], 99.95th=[ 1921],
     | 99.99th=[ 1938]
   bw (  KiB/s): min=    8, max=  128, per=25.59%, avg=59.36, stdev=25.99, samples=343
   iops        : min=    2, max=   32, avg=14.84, stdev= 6.50, samples=343
  lat (msec)   : 250=0.04%, 500=0.23%, 750=2.30%, 1000=24.30%, 2000=73.12%
  cpu          : usr=0.01%, sys=0.05%, ctx=756, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.3%, 16=99.4%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2560,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
rand-write: (groupid=0, jobs=1): err=71 (file:io_u.c:1803, func=io_u error, error=Protocol error): pid=892877: Tue Dec  8 21:28:37 2020
…
Run status group 0 (all jobs):
  WRITE: bw=233KiB/s (239kB/s), 58.2KiB/s-58.4KiB/s (59.6kB/s-59.8kB/s), io=39.0MiB (41.9MB), run=175199-175612msec
 sanity test_398c: @@@@@@ FAIL: fio write error 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6257:error()
  = /usr/lib64/lustre/tests/sanity.sh:22338:test_398c()

Looking at the console logs, client1 (vm1) has as error

[19342.934999] Lustre: DEBUG MARKER: == sanity test 398c: run fio to test AIO ============================================================= 21:25:37 (1607462737)
[19520.867822] LustreError: 558753:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0005-osc-ffff8a2144fe3000: unexpected positive size 1
[19523.015054] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_398c: @@@@@@ FAIL: fio write error 
[19523.582640] Lustre: DEBUG MARKER: sanity test_398c: @@@@@@ FAIL: fio write error


 Comments   
Comment by Peter Jones [ 09/Dec/20 ]

Shilong

Could you please advise on this one?

Thanks

Peter

Comment by James Nunez (Inactive) [ 10/Dec/20 ]

I think we are seeing the same issue for sanity-pfl test_6 at https://testing.whamcloud.com/test_sets/cee68701-50f3-4729-b843-34935b3129d2 :

== sanity-pfl test 6: Migrate composite file ========================================================= 20:02:44 (1607371364)
striped dir -i2 -c2 -H crush /mnt/lustre/d6.sanity-pfl
5+0 records in
5+0 records out
5242880 bytes (5.2 MB, 5.0 MiB) copied, 3.02798 s, 1.7 MB/s
error: lfs migrate: /mnt/lustre/d6.sanity-pfl/f6.sanity-pfl: data copy failed: Protocol error
 sanity-pfl test_6: @@@@@@ FAIL: Migrate(compsoite -> composite) /mnt/lustre/d6.sanity-pfl/f6.sanity-pfl failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6257:error()
  = /usr/lib64/lustre/tests/sanity-pfl.sh:364:test_6()

and test 18

== sanity-pfl test 18: check component distribution ================================================== 20:18:46 (1607372326)
dd: error writing '/mnt/lustre/f18.sanity-pfl-1': Protocol error
3+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.0616448 s, 34.0 MB/s
 sanity-pfl test_18: @@@@@@ FAIL: dd failed for /mnt/lustre/f18.sanity-pfl-1 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6257:error()
  = /usr/lib64/lustre/tests/sanity-pfl.sh:982:test_18()
Comment by Wang Shilong (Inactive) [ 11/Dec/20 ]

This looks not really specific to AIO/DIO, test 18 of sanity-pfl is a generic buffer write, maybe write from server side somehow return error.

Comment by Gerrit Updater [ 11/Dec/20 ]

Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40944
Subject: LU-14205 osd-ldiskfs: return correct error after end io
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 296545189f9ebeccf0bcff8b068f2d87ce78b1ed

Comment by John Hammond [ 11/Dec/20 ]

I can see this in a very vanilla setup on RHEL 8.3. Using /bin/cp to copy a 1G file:

[root@trevis-60vm1 lustre]# ls -lh f0
-rw-r--r-- 1 root root 1.0G Dec 11 19:38 f0
[root@trevis-60vm1 lustre]# rm -f f1; echo 3 > /proc/sys/vm/drop_caches
[root@trevis-60vm1 lustre]# /bin/cp f0 f1
/bin/cp: failed to close 'f1': Input/output error
[root@trevis-60vm1 lustre]# dmesg | tail -1
[ 6672.574315] LustreError: 12513:0:(osc_request.c:1947:osc_brw_fini_request()) lustre-OST0003-osc-ffff9d2eaeb92800: unexpected positive size 1
Comment by Wang Shilong (Inactive) [ 16/Dec/20 ]

jhammond I tried your steps without any luck, would you try above patch if it help fix your problem?

Comment by Gerrit Updater [ 16/Dec/20 ]

Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40989
Subject: LU-14205 osd-ldiskfs: fix return of osd_extend_restart_trans()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: df32229ddafdd4696c33e8790b93a3db57997de7

Comment by Wang Shilong (Inactive) [ 16/Dec/20 ]

jhammond Never mind, i reproduced problem and finally figured out..

Comment by John Hammond [ 16/Dec/20 ]

wshilong are both changes needed?

Comment by Wang Shilong (Inactive) [ 17/Dec/20 ]

jhammond Acutally second patch is root cause of your problem.

Comment by John Hammond [ 17/Dec/20 ]

wshilong Right. So should the first patch be landed or abandoned?

Comment by Wang Shilong (Inactive) [ 17/Dec/20 ]

First patch indeed fix a bug which should be landed as well.

Comment by Gerrit Updater [ 17/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40944/
Subject: LU-14205 osd-ldiskfs: return correct error after end io
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 31a1afade08f5bc36b3ebb1d1bace103f902c155

Comment by Gerrit Updater [ 22/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40989/
Subject: LU-14205 osd-ldiskfs: fix return of osd_extend_restart_trans()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4aa17923f08bea425a20961cb6eaa72ad9af38c1

Generated at Sat Feb 10 03:07:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.