Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4937

LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: - Unable to send checksum

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.4.2
    • Lustre 2.4.2
      Red Hat Enterprise Linux Server release 6.5 (Santiago)
    • 3
    • 13646

    Description

      We noticed the following error on three of the lustre clients, we like to know why its happening and does it need any fix..

      Apr 4 15:34:07 uc1n274 kernel: LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/
      Apr 4 16:38:55 uc1n484 kernel: LustreError: 2981:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/
      Apr 5 01:14:44 uc1n251 kernel: LustreError: 2914:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/

      Attachments

        Activity

          [LU-4937] LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: - Unable to send checksum

          Yes. We are tracking this externally to land in b2_5 as well.
          Thank you!

          jlevi Jodi Levi (Inactive) added a comment - Yes. We are tracking this externally to land in b2_5 as well. Thank you!
          lflis Lukasz Flis added a comment -

          @Jodi: Could you please land the patch to b2_5 also?

          lflis Lukasz Flis added a comment - @Jodi: Could you please land the patch to b2_5 also?

          Patch landed to Master. Please reopen ticket if more work is needed.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.
          lflis Lukasz Flis added a comment -

          @Andreas:
          It seems that crc32 is much better than adler on AMD 6276 cpus:

          Using crypto hash: crc32 (crc32-pclmul) speed 3196 MB/s
          Using crypto hash: adler32 (adler32-zlib) speed 2335 MB/s

          lflis Lukasz Flis added a comment - @Andreas: It seems that crc32 is much better than adler on AMD 6276 cpus: Using crypto hash: crc32 (crc32-pclmul) speed 3196 MB/s Using crypto hash: adler32 (adler32-zlib) speed 2335 MB/s

          Lukaz, you will probably get better performance with adler32 for the checksum on the AMD nodes instead of crc32:

          lctl set_param osc.*.checksum_type=adler32
          

          Bobijam, I don't know enough details of the CRC32c checksum to say for sure if "0xffffffff" is a valid value or not. I know that for most checksums "0" is an invalid value because this would mean that the checksum for any length of zero bytes would be the same, which is why they always initialize the starting value to 0xffffffff. The use of ~0 for indicating the checksum is not sent was originally added in 1.4.x when the checksum feature was just new, and is no longer need. A mismatch between the client and server checksum (as is done with your patch) will detect corruption, and there is no reason to special-case the 0xffffffff checksum value.

          I've also submitted a version of your patch for master: http://review.whamcloud.com/10354

          adilger Andreas Dilger added a comment - Lukaz, you will probably get better performance with adler32 for the checksum on the AMD nodes instead of crc32: lctl set_param osc.*.checksum_type=adler32 Bobijam, I don't know enough details of the CRC32c checksum to say for sure if "0xffffffff" is a valid value or not. I know that for most checksums "0" is an invalid value because this would mean that the checksum for any length of zero bytes would be the same, which is why they always initialize the starting value to 0xffffffff. The use of ~0 for indicating the checksum is not sent was originally added in 1.4.x when the checksum feature was just new, and is no longer need. A mismatch between the client and server checksum (as is done with your patch) will detect corruption, and there is no reason to special-case the 0xffffffff checksum value. I've also submitted a version of your patch for master: http://review.whamcloud.com/10354
          bobijam Zhenyu Xu added a comment -

          Andreas,

          Is it possible that crc32c has an issue in this case?

          bobijam Zhenyu Xu added a comment - Andreas, Is it possible that crc32c has an issue in this case?
          lflis Lukasz Flis added a comment -

          I have found out the difference which makes AMD nodes not affected by the problem

          Intel nodes are using: crc32c
          AMD nodes default to: crc32

          As a workaround we switched Intel nodes to crc32 (which is a bit slower in that case) and problem is gone

          for i in /proc/fs/lustre/osc/*/checksum_type;do echo crc32 > $i; done;

          lflis Lukasz Flis added a comment - I have found out the difference which makes AMD nodes not affected by the problem Intel nodes are using: crc32c AMD nodes default to: crc32 As a workaround we switched Intel nodes to crc32 (which is a bit slower in that case) and problem is gone for i in /proc/fs/lustre/osc/*/checksum_type;do echo crc32 > $i; done;
          lflis Lukasz Flis added a comment -

          this issue affects 2.5.1 clients also so i'd suggest updating affected version list

          lflis Lukasz Flis added a comment - this issue affects 2.5.1 clients also so i'd suggest updating affected version list
          lflis Lukasz Flis added a comment -

          I will ask one of our users to verify sanity of results generated by a patched client.

          The most interesting thing is why AMD processor based client with 256GB ram with identical client version and
          identical inputs is not affected by ~0 problem.

          Assuming that CRC algorithm works in the same way on the client and server only explanation that come to my mind is the different data alignment/content of requests
          but i'm just a user here

          Is there anything else we can check here?

          lflis Lukasz Flis added a comment - I will ask one of our users to verify sanity of results generated by a patched client. The most interesting thing is why AMD processor based client with 256GB ram with identical client version and identical inputs is not affected by ~0 problem. Assuming that CRC algorithm works in the same way on the client and server only explanation that come to my mind is the different data alignment/content of requests but i'm just a user here Is there anything else we can check here?
          bobijam Zhenyu Xu added a comment -

          I'm no expert on the checksum algorithm, if it's possible to get a ~0 checksum from the algorithm, I prefer to think that's not a data corruption here. What's your opinion from your gaussian application, does the data generated correct and valid?

          bobijam Zhenyu Xu added a comment - I'm no expert on the checksum algorithm, if it's possible to get a ~0 checksum from the algorithm, I prefer to think that's not a data corruption here. What's your opinion from your gaussian application, does the data generated correct and valid?

          People

            bobijam Zhenyu Xu
            rganesan@ddn.com Rajeshwaran Ganesan
            Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: