[LU-3438] replay-ost-single test_5 failed with error int check_write_rcs() "Unexpected # bytes transferred" Created: 05/Jun/13  Updated: 10/Jun/13  Resolved: 10/Jun/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Artem Blagodarenko (Inactive) Assignee: Keith Mannthey (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre master branch


Issue Links:
Related
is related to LU-2817 Failure on test suite replay-ost-sing... Resolved
is related to LU-1431 Support for larger than 1MB sequentia... Resolved
Severity: 3
Rank (Obsolete): 8566

 Description   

Our testing system shows, that there is failed test eplay-ost-single.test_5

Lustre: DEBUG MARKER: == replay-ost-single test 5: Fail OST during iozone == 21:21:13 (1369851673)
Lustre: Failing over lustre-OST0000
LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -19
LustreError: Skipped 1 previous similar message
Lustre: lustre-OST0000-osc-ffff8800514d3400: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0000: shutting down for failover; client state will be preserved.
Lustre: OST lustre-OST0000 has stopped.
Lustre: server umount lustre-OST0000 complete
LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available for connect (no target)
LustreError: Skipped 1 previous similar message
LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: 
Lustre: 16962:0:(ldlm_lib.c:2195:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 1322
Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/loop1 with recovery enabled
Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@0@lo recovering/t0 exp ffff88005ca19c00 cur 1369851700 last 1369851697
Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) Skipped 3 previous similar messages
Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
Lustre: lustre-OST0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
Lustre: lustre-OST0000-osc-MDT0000: Connection restored to lustre-OST0000 (at 0@lo)
Lustre: Skipped 1 previous similar message
LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768)
LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576)
Lustre: lustre-OST0000: received MDS connection from 0@lo
Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
Lustre: DEBUG MARKER: iozone rc=1
Lustre: DEBUG MARKER: replay-ost-single test_5: @@@@@@ FAIL: iozone failed

This messages looks related to 4mb IO patch

LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768)
LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576)

I believe, that this test is failed in master branch, but they skip it as SLOW during testing
https://maloo.whamcloud.com/test_sets/dd033a98-7264-11e2-aad1-52540035b04c

test_5	SKIP	0	0	skipping SLOW test 5


 Comments   
Comment by Artem Blagodarenko (Inactive) [ 05/Jun/13 ]

Could you, please, start this test (it marked as SLOW) and check if it failed?

Comment by Artem Blagodarenko (Inactive) [ 05/Jun/13 ]

Issue related to LU-1431.

Comment by Keith Mannthey (Inactive) [ 05/Jun/13 ]

This test runs just not with every patch review. You can do a subtest search in Maloo (it is a little slow but works). https://maloo.whamcloud.com/sub_tests/query

This is an 2.4 -RC1 run that passed:
https://maloo.whamcloud.com/test_sets/92b8f0d2-cdf3-11e2-ba28-52540035b04c

There was some trouble with this test a while ago. Please see LU-2817: our testing has not failed since http://review.whamcloud.com/#change,5532 landed.

Comment by Keith Mannthey (Inactive) [ 05/Jun/13 ]

Earlier encounter of this issue.

Comment by Andreas Dilger [ 10/Jun/13 ]

Artem, are you actually hitting this with the 2.4.0 release code, or is this a 2.1 branch with the 4MB patch applied?

Comment by Artem Blagodarenko (Inactive) [ 10/Jun/13 ]

Andreas, after we applied http://review.whamcloud.com/#change,5532 this problem is gone. Thanks!
I think we can close this issue.

Comment by Andreas Dilger [ 10/Jun/13 ]

Artem, in the future, if the problem being reported is not actually matching the release version as tagged in git please at a minimum include a description of the version (e.g. "git describe" in your tree, including the change numbers of relevant patches already applied) in the Environment section of the bug, and ideally a pointer to a git repo with the actual tree being tested.

To file a bug marked "2.4.0" and described as "master branch" in June after the 2.4.0 release, but missing a patch that was actually landed to master in March before 2.3.62 tag makes it difficult for us to determine what the actual problem is. I'm glad that Keith could isolate this problem so quickly with the Maloo test logs, and that the patch fixed the problem for you, but for other bugs this may not be so easily done.

Generated at Sat Feb 10 01:33:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.