[LU-11729] ARM: sanity test_810: BAD WRITE CHECKSUM with adler Created: 04/Dec/18 Updated: 23/Sep/19 Resolved: 17/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | arm | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a298010c-f721-11e8-86c0-52540065bddc The test_810 added for osc.lustre-OST0006-osc-ffff800039d7f800.checksum_type=adler fail_loc=0x411 dd: error writing '/mnt/lustre/f810.sanity': Input/output error 6bf5f3489c417a2e6f9e223278d93278 /mnt/lustre/f810.sanity != d375c4c8a12ae6de34e09e696c3725b1 /mnt/lustre/f810.sanity The client console logs show the checksums do not match between the client and server so there is still some kind of alignment problem: LustreError: 132-0: lustre-OST0000-osc-ffff800039d7f800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.9.3.137@tcp inode [0x200006991:0xc:0x0] object 0x0:27458 extent [10240-20479], original client csum 6a00237 (type 20), server csum ab0036d (type 20), client csum now 6a00237 LustreError: 22024:0:(osc_request.c:1923:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff80003069f300 x1618848577499536/t42949672971(42949672971) o4->lustre-OST0000-osc-ffff800039d7f800@10.9.3.137@tcp:6/4 lens 488/416 e 0 to 0 dl 1543857802 ref 2 fl Interpret:RM/0/0 rc 0/0 LustreError: 22024:0:(osc_request.c:2048:brw_interpret()) lustre-OST0000-osc-ffff800039d7f800: too many resent retries for object: 0:27458, rc = -11. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 04/Dec/18 ] |
|
I thought this would be a 100% failure on ARM, but it seems like there are a few test failing and the majority of test_810 tests pass. |
| Comment by Li Xi [ 04/Dec/18 ] |
|
It is possible that the speed of T10PI4K (type=20) is not 100% the fastest checksum type? In that case, only a few test select T10PI4K as the checksum type. And in these tests, the test will 100% fail. I think this failure might be caused by the bug fixed by patch: The client is ARM with 64K page size right? I feel more tests with 64KB client + 4KB server will help us to find more problems like this... |
| Comment by Peter Jones [ 04/Dec/18 ] |
|
ok - assuming to be a duplicate of |
| Comment by Andreas Dilger [ 05/Dec/18 ] |
|
This is definitely not a duplicate of the T10 checksum bug. The sanity test_810 explicitly uses adler as the checksum type. |
| Comment by Li Xi [ 05/Dec/18 ] |
|
> This is definitely not a duplicate of the T10 checksum bug. The sanity test_810 explicitly uses adler as the checksum type. Oh, It is strange then, because according the following log, the checksum type used is T10PI4K (type 0x20), not adler (type 0x2). I guess the problem might be caused by the delay/failure of setting checksum type? LustreError: 132-0: lustre-OST0000-osc-ffff800039d7f800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.9.3.137@tcp inode [0x200006991:0xc:0x0] object 0x0:27458 extent [10240-20479], original client csum 6a00237 (type 20), server csum ab0036d (type 20), client csum now 6a00237 |
| Comment by Andreas Dilger [ 05/Dec/18 ] |
|
You are right, it may be a duplicate of the T10-PI ticket then. It may be that "lfs set_param" is trying to set the checksum type to adler, but this type is not available? It should always be one of the supported checksum types, but possibly this has been lost from the code. Before this bug is closed again it would make sense to improve test_810 to test all of the available checksum types listed from "lctl get_param osc.*OST0000*.checksum_type" to ensure they are all working for this test case. |
| Comment by Gerrit Updater [ 14/Dec/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33864 |
| Comment by Gerrit Updater [ 17/Dec/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33864/ |
| Comment by Peter Jones [ 07/Jan/19 ] |
|
Dongyang Could you please investigate? Thanks Peter |
| Comment by Gerrit Updater [ 16/Jan/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34043 |
| Comment by Gerrit Updater [ 16/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34043/ |
| Comment by Peter Jones [ 17/Sep/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 17/Sep/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36205 |
| Comment by Gerrit Updater [ 23/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36205/ |