[LU-11729] ARM: sanity test_810: BAD WRITE CHECKSUM with adler Created: 04/Dec/18  Updated: 23/Sep/19  Resolved: 17/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: arm

Issue Links:
Related
is related to LU-11663 corrupt data after page-unaligned wri... Resolved
is related to LU-11697 BAD WRITE CHECKSUM with t10ip4K and t... Resolved
is related to LU-10300 Can the Lustre 2.10.x clients support... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a298010c-f721-11e8-86c0-52540065bddc

The test_810 added for LU-11663 fails on ARM on ldiskfs with the following test error:

osc.lustre-OST0006-osc-ffff800039d7f800.checksum_type=adler
fail_loc=0x411
dd: error writing '/mnt/lustre/f810.sanity': Input/output error
6bf5f3489c417a2e6f9e223278d93278  /mnt/lustre/f810.sanity != d375c4c8a12ae6de34e09e696c3725b1  /mnt/lustre/f810.sanity

The client console logs show the checksums do not match between the client and server so there is still some kind of alignment problem:

LustreError: 132-0: lustre-OST0000-osc-ffff800039d7f800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.9.3.137@tcp inode [0x200006991:0xc:0x0] object 0x0:27458 extent [10240-20479], original client csum 6a00237 (type 20), server csum ab0036d (type 20), client csum now 6a00237
LustreError: 22024:0:(osc_request.c:1923:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff80003069f300 x1618848577499536/t42949672971(42949672971) o4->lustre-OST0000-osc-ffff800039d7f800@10.9.3.137@tcp:6/4 lens 488/416 e 0 to 0 dl 1543857802 ref 2 fl Interpret:RM/0/0 rc 0/0
LustreError: 22024:0:(osc_request.c:2048:brw_interpret()) lustre-OST0000-osc-ffff800039d7f800: too many resent retries for object: 0:27458, rc = -11.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity test_810 - 6bf5f3489c417a2e6f9e223278d93278 /mnt/lustre/f810.sanity != d375c4c8a12ae6de34e09e696c3725b1 /mnt/lustre/f810.sanity



 Comments   
Comment by Andreas Dilger [ 04/Dec/18 ]

I thought this would be a 100% failure on ARM, but it seems like there are a few test failing and the majority of test_810 tests pass.

Comment by Li Xi [ 04/Dec/18 ]

It is possible that the speed of T10PI4K (type=20) is not 100% the fastest checksum type? In that case, only a few test select T10PI4K as the checksum type. And in these tests, the test will 100% fail.

I think this failure might be caused by the bug fixed by patch:
https://review.whamcloud.com/#/c/33727/4

The client is ARM with 64K page size right? I feel more tests with 64KB client + 4KB server will help us to find more problems like this...

Comment by Peter Jones [ 04/Dec/18 ]

ok - assuming to be a duplicate of LU-11697 unless proven otherwise

Comment by Andreas Dilger [ 05/Dec/18 ]

This is definitely not a duplicate of the T10 checksum bug. The sanity test_810 explicitly uses adler as the checksum type.

Comment by Li Xi [ 05/Dec/18 ]

> This is definitely not a duplicate of the T10 checksum bug. The sanity test_810 explicitly uses adler as the checksum type.

Oh, It is strange then, because according the following log, the checksum type used is T10PI4K (type 0x20), not adler (type 0x2).

I guess the problem might be caused by the delay/failure of setting checksum type?

LustreError: 132-0: lustre-OST0000-osc-ffff800039d7f800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.9.3.137@tcp inode [0x200006991:0xc:0x0] object 0x0:27458 extent [10240-20479], original client csum 6a00237 (type 20), server csum ab0036d (type 20), client csum now 6a00237
Comment by Andreas Dilger [ 05/Dec/18 ]

You are right, it may be a duplicate of the T10-PI ticket then. It may be that "lfs set_param" is trying to set the checksum type to adler, but this type is not available? It should always be one of the supported checksum types, but possibly this has been lost from the code.

Before this bug is closed again it would make sense to improve test_810 to test all of the available checksum types listed from "lctl get_param osc.*OST0000*.checksum_type" to ensure they are all working for this test case.

Comment by Gerrit Updater [ 14/Dec/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33864
Subject: LU-11729 tests: skip sanity test 810 for ARM
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e5498fc90e2b7809a00220b0cf18c1ac9a730a86

Comment by Gerrit Updater [ 17/Dec/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33864/
Subject: LU-11729 tests: skip sanity test 810 for ARM
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a6239c48da38ff0da4564da496766deebc88923f

Comment by Peter Jones [ 07/Jan/19 ]

Dongyang

Could you please investigate?

Thanks

Peter

Comment by Gerrit Updater [ 16/Jan/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34043
Subject: LU-11729 tests: verify checksum types in sanity test_810
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6cbddb97cf05cdd2c9d229cf218912e6881cc64a

Comment by Gerrit Updater [ 16/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34043/
Subject: LU-11729 obdclass: align to T10 sector size when generating guard
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 98ceaf854bb4738305769c5cd1df556ee99aa859

Comment by Peter Jones [ 17/Sep/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 17/Sep/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36205
Subject: LU-11729 obdclass: align to T10 sector size when generating guard
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: c78fa4e0f024d4823c6d867d04c108293f6d0859

Comment by Gerrit Updater [ 23/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36205/
Subject: LU-11729 obdclass: align to T10 sector size when generating guard
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 38e83f124a41b633da02073b76cf20495bef3919

Generated at Sat Feb 10 02:46:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.