Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11729

ARM: sanity test_810: BAD WRITE CHECKSUM with adler

Details

    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a298010c-f721-11e8-86c0-52540065bddc

      The test_810 added for LU-11663 fails on ARM on ldiskfs with the following test error:

      osc.lustre-OST0006-osc-ffff800039d7f800.checksum_type=adler
      fail_loc=0x411
      dd: error writing '/mnt/lustre/f810.sanity': Input/output error
      6bf5f3489c417a2e6f9e223278d93278  /mnt/lustre/f810.sanity != d375c4c8a12ae6de34e09e696c3725b1  /mnt/lustre/f810.sanity
      

      The client console logs show the checksums do not match between the client and server so there is still some kind of alignment problem:

      LustreError: 132-0: lustre-OST0000-osc-ffff800039d7f800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.9.3.137@tcp inode [0x200006991:0xc:0x0] object 0x0:27458 extent [10240-20479], original client csum 6a00237 (type 20), server csum ab0036d (type 20), client csum now 6a00237
      LustreError: 22024:0:(osc_request.c:1923:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff80003069f300 x1618848577499536/t42949672971(42949672971) o4->lustre-OST0000-osc-ffff800039d7f800@10.9.3.137@tcp:6/4 lens 488/416 e 0 to 0 dl 1543857802 ref 2 fl Interpret:RM/0/0 rc 0/0
      LustreError: 22024:0:(osc_request.c:2048:brw_interpret()) lustre-OST0000-osc-ffff800039d7f800: too many resent retries for object: 0:27458, rc = -11.
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_810 - 6bf5f3489c417a2e6f9e223278d93278 /mnt/lustre/f810.sanity != d375c4c8a12ae6de34e09e696c3725b1 /mnt/lustre/f810.sanity

      Attachments

        Issue Links

          Activity

            [LU-11729] ARM: sanity test_810: BAD WRITE CHECKSUM with adler

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34043
            Subject: LU-11729 tests: verify checksum types in sanity test_810
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6cbddb97cf05cdd2c9d229cf218912e6881cc64a

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34043 Subject: LU-11729 tests: verify checksum types in sanity test_810 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6cbddb97cf05cdd2c9d229cf218912e6881cc64a
            pjones Peter Jones added a comment -

            Dongyang

            Could you please investigate?

            Thanks

            Peter

            pjones Peter Jones added a comment - Dongyang Could you please investigate? Thanks Peter

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33864/
            Subject: LU-11729 tests: skip sanity test 810 for ARM
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a6239c48da38ff0da4564da496766deebc88923f

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33864/ Subject: LU-11729 tests: skip sanity test 810 for ARM Project: fs/lustre-release Branch: master Current Patch Set: Commit: a6239c48da38ff0da4564da496766deebc88923f

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33864
            Subject: LU-11729 tests: skip sanity test 810 for ARM
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e5498fc90e2b7809a00220b0cf18c1ac9a730a86

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33864 Subject: LU-11729 tests: skip sanity test 810 for ARM Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e5498fc90e2b7809a00220b0cf18c1ac9a730a86
            adilger Andreas Dilger added a comment - - edited

            You are right, it may be a duplicate of the T10-PI ticket then. It may be that "lfs set_param" is trying to set the checksum type to adler, but this type is not available? It should always be one of the supported checksum types, but possibly this has been lost from the code.

            Before this bug is closed again it would make sense to improve test_810 to test all of the available checksum types listed from "lctl get_param osc.*OST0000*.checksum_type" to ensure they are all working for this test case.

            adilger Andreas Dilger added a comment - - edited You are right, it may be a duplicate of the T10-PI ticket then. It may be that " lfs set_param " is trying to set the checksum type to adler , but this type is not available? It should always be one of the supported checksum types, but possibly this has been lost from the code. Before this bug is closed again it would make sense to improve test_810 to test all of the available checksum types listed from " lctl get_param osc.*OST0000*.checksum_type " to ensure they are all working for this test case.
            lixi_wc Li Xi added a comment - - edited

            > This is definitely not a duplicate of the T10 checksum bug. The sanity test_810 explicitly uses adler as the checksum type.

            Oh, It is strange then, because according the following log, the checksum type used is T10PI4K (type 0x20), not adler (type 0x2).

            I guess the problem might be caused by the delay/failure of setting checksum type?

            LustreError: 132-0: lustre-OST0000-osc-ffff800039d7f800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.9.3.137@tcp inode [0x200006991:0xc:0x0] object 0x0:27458 extent [10240-20479], original client csum 6a00237 (type 20), server csum ab0036d (type 20), client csum now 6a00237
            
            lixi_wc Li Xi added a comment - - edited > This is definitely not a duplicate of the T10 checksum bug. The sanity test_810 explicitly uses adler as the checksum type. Oh, It is strange then, because according the following log, the checksum type used is T10PI4K (type 0x20), not adler (type 0x2). I guess the problem might be caused by the delay/failure of setting checksum type? LustreError: 132-0: lustre-OST0000-osc-ffff800039d7f800: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.9.3.137@tcp inode [0x200006991:0xc:0x0] object 0x0:27458 extent [10240-20479], original client csum 6a00237 (type 20), server csum ab0036d (type 20), client csum now 6a00237

            This is definitely not a duplicate of the T10 checksum bug. The sanity test_810 explicitly uses adler as the checksum type.

            adilger Andreas Dilger added a comment - This is definitely not a duplicate of the T10 checksum bug. The sanity test_810 explicitly uses adler as the checksum type.
            pjones Peter Jones added a comment -

            ok - assuming to be a duplicate of LU-11697 unless proven otherwise

            pjones Peter Jones added a comment - ok - assuming to be a duplicate of LU-11697 unless proven otherwise
            lixi_wc Li Xi added a comment -

            It is possible that the speed of T10PI4K (type=20) is not 100% the fastest checksum type? In that case, only a few test select T10PI4K as the checksum type. And in these tests, the test will 100% fail.

            I think this failure might be caused by the bug fixed by patch:
            https://review.whamcloud.com/#/c/33727/4

            The client is ARM with 64K page size right? I feel more tests with 64KB client + 4KB server will help us to find more problems like this...

            lixi_wc Li Xi added a comment - It is possible that the speed of T10PI4K (type=20) is not 100% the fastest checksum type? In that case, only a few test select T10PI4K as the checksum type. And in these tests, the test will 100% fail. I think this failure might be caused by the bug fixed by patch: https://review.whamcloud.com/#/c/33727/4 The client is ARM with 64K page size right? I feel more tests with 64KB client + 4KB server will help us to find more problems like this...

            I thought this would be a 100% failure on ARM, but it seems like there are a few test failing and the majority of test_810 tests pass.

            adilger Andreas Dilger added a comment - I thought this would be a 100% failure on ARM, but it seems like there are a few test failing and the majority of test_810 tests pass.

            People

              dongyang Dongyang Li
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: