[LU-4037] Failure on test suite sanity test_78: rdwr failed Created: 01/Oct/13  Updated: 10/Oct/21  Resolved: 10/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: zfs
Environment:

server and client: lustre-master build # 1687


Issue Links:
Related
is related to LU-7139 Sanity test_78 defect: wrong file siz... Resolved
Severity: 3
Rank (Obsolete): 10846

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/32766b8c-26c7-11e3-83d1-52540035b04c.

The sub-test test_78 failed with the following error:

rdwr failed

test log

== sanity test 78: handle large O_DIRECT writes correctly ============== 20:40:26 (1380166826)
MemFree: 1247, Max file size: 1400000
MemTotal: 1877
Mem to use for directio: 810
Smallest OST: 169728
File size: 512
directIO rdwr round 1 of 5
directio on /mnt/lustre/f.sanity.78 for 102x1048576 bytes 
PASS
directIO rdwr round 2 of 5
directio on /mnt/lustre/f.sanity.78 for 128x1048576 bytes 
Write error Success (rc = 114294784, len = 134217728)
 sanity test_78: @@@@@@ FAIL: rdwr failed 


 Comments   
Comment by Andreas Dilger [ 02/Oct/13 ]

So the system call returned 109 * 1048576 = 114294784 instead of the expected 128 * 1048576 = 134217728. This test has failed a few times in the past month, but is typically skipped because it is marked SLOW.

Comment by Andreas Dilger [ 02/Oct/13 ]

My first guess would be that the size of the O_DIRECT call is being limited for some reason, and it is returning a short write to the caller. The returned value is the same in the four test failures that I can check, but there are more failures dating back to 2013-04-10 (https://maloo.whamcloud.com/sub_tests/c637000e-a204-11e2-bdac-52540035b04c) that do not have logs.

Comment by Mark Mansk [ 04/Jun/14 ]

We're seeing this start to fail at Cray with 2.5.1.

MemFree: 30795, Max file size: 400000
MemTotal: 32217
Mem to use for directio: 15980
Smallest OST: 240436
File size: 389
...
directIO rdwr round 5 of 5
directio on /tmp/dal/f78.sanity for 389x1048576 bytes
Write error Success (rc = 387973120, len = 407896064)
sanity test_78: @@@@@@ FAIL: rdwr failed

off the console logs:

2014-06-04T02:48:50.946766-05:00 c0-0c0s3n2 LNet: 6646:0:(gnilnd_cb.c:867:kgnilnd_verify_rdma_cksum()) $$ no RDMA payload checksum when enabled 
from 14@gni4  msg@0xffff8807aa5cd118 m/v/ty/ck/pck/pl b00fbabe/8/16/fae0/0/0 x948646:GNILND_MSG_GET_DONE_REV
2014-06-04T02:48:50.946814-05:00 c0-0c0s3n2 LNet: 6646:0:(gnilnd_cb.c:867:kgnilnd_verify_rdma_cksum()) Skipped 1645 previous similar messages
2014-06-04T02:49:21.361769-05:00 c0-0c0s3n2 LustreError: 11251:0:(ofd_grant.c:255:ofd_grant_space_left()) dal-OST0000: cli 86a844b0-7844-5210-51
0c-2101e0354cd0/ffff880871927c00 left 51163136 < tot_grant 52605696 unstable 0 pending 0
2014-06-04T02:49:21.361821-05:00 c0-0c0s3n2 LustreError: 11251:0:(ofd_grant.c:255:ofd_grant_space_left()) Skipped 5 previous similar messages
2014-06-04T02:49:21.884851-05:00 c0-0c0s2n2 Lustre: DEBUG MARKER: sanity test_78: @@@@@@ FAIL: rdwr failed

This test hasn't failed before.

Generated at Sat Feb 10 01:39:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.