[LU-3438] replay-ost-single test_5 failed with error int check_write_rcs() "Unexpected # bytes transferred" Created: 05/Jun/13 Updated: 10/Jun/13 Resolved: 10/Jun/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Artem Blagodarenko (Inactive) | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre master branch |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 8566 | ||||||||||||
| Description |
|
Our testing system shows, that there is failed test eplay-ost-single.test_5 Lustre: DEBUG MARKER: == replay-ost-single test 5: Fail OST during iozone == 21:21:13 (1369851673) Lustre: Failing over lustre-OST0000 LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -19 LustreError: Skipped 1 previous similar message Lustre: lustre-OST0000-osc-ffff8800514d3400: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 1 previous similar message Lustre: lustre-OST0000: shutting down for failover; client state will be preserved. Lustre: OST lustre-OST0000 has stopped. Lustre: server umount lustre-OST0000 complete LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available for connect (no target) LustreError: Skipped 1 previous similar message LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: LDISKFS-fs (loop1): mounted filesystem with ordered data mode. Opts: Lustre: 16962:0:(ldlm_lib.c:2195:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 1322 Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/loop1 with recovery enabled Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) lustre-OST0000: connection from lustre-MDT0000-mdtlov_UUID@0@lo recovering/t0 exp ffff88005ca19c00 cur 1369851700 last 1369851697 Lustre: 2398:0:(ldlm_lib.c:1021:target_handle_connect()) Skipped 3 previous similar messages Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 2 clients reconnect Lustre: lustre-OST0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. Lustre: lustre-OST0000-osc-MDT0000: Connection restored to lustre-OST0000 (at 0@lo) Lustre: Skipped 1 previous similar message LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768) LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576) Lustre: lustre-OST0000: received MDS connection from 0@lo Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans Lustre: DEBUG MARKER: iozone rc=1 Lustre: DEBUG MARKER: replay-ost-single test_5: @@@@@@ FAIL: iozone failed This messages looks related to 4mb IO patch LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 65536 (requested 32768) LustreError: 1716:0:(osc_request.c:1232:check_write_rcs()) Unexpected # bytes transferred: 2097152 (requested 1048576) I believe, that this test is failed in master branch, but they skip it as SLOW during testing test_5 SKIP 0 0 skipping SLOW test 5 |
| Comments |
| Comment by Artem Blagodarenko (Inactive) [ 05/Jun/13 ] |
|
Could you, please, start this test (it marked as SLOW) and check if it failed? |
| Comment by Artem Blagodarenko (Inactive) [ 05/Jun/13 ] |
|
Issue related to |
| Comment by Keith Mannthey (Inactive) [ 05/Jun/13 ] |
|
This test runs just not with every patch review. You can do a subtest search in Maloo (it is a little slow but works). https://maloo.whamcloud.com/sub_tests/query This is an 2.4 -RC1 run that passed: There was some trouble with this test a while ago. Please see |
| Comment by Keith Mannthey (Inactive) [ 05/Jun/13 ] |
|
Earlier encounter of this issue. |
| Comment by Andreas Dilger [ 10/Jun/13 ] |
|
Artem, are you actually hitting this with the 2.4.0 release code, or is this a 2.1 branch with the 4MB patch applied? |
| Comment by Artem Blagodarenko (Inactive) [ 10/Jun/13 ] |
|
Andreas, after we applied http://review.whamcloud.com/#change,5532 this problem is gone. Thanks! |
| Comment by Andreas Dilger [ 10/Jun/13 ] |
|
Artem, in the future, if the problem being reported is not actually matching the release version as tagged in git please at a minimum include a description of the version (e.g. "git describe" in your tree, including the change numbers of relevant patches already applied) in the Environment section of the bug, and ideally a pointer to a git repo with the actual tree being tested. To file a bug marked "2.4.0" and described as "master branch" in June after the 2.4.0 release, but missing a patch that was actually landed to master in March before 2.3.62 tag makes it difficult for us to determine what the actual problem is. I'm glad that Keith could isolate this problem so quickly with the Maloo test logs, and that the patch fixed the problem for you, but for other bugs this may not be so easily done. |