Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4937

LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: - Unable to send checksum

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.4.2
    • Lustre 2.4.2
      Red Hat Enterprise Linux Server release 6.5 (Santiago)
    • 3
    • 13646

    Description

      We noticed the following error on three of the lustre clients, we like to know why its happening and does it need any fix..

      Apr 4 15:34:07 uc1n274 kernel: LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/
      Apr 4 16:38:55 uc1n484 kernel: LustreError: 2981:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/
      Apr 5 01:14:44 uc1n251 kernel: LustreError: 2914:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/

      Attachments

        Activity

          [LU-4937] LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: - Unable to send checksum
          lflis Lukasz Flis added a comment -

          this issue affects 2.5.1 clients also so i'd suggest updating affected version list

          lflis Lukasz Flis added a comment - this issue affects 2.5.1 clients also so i'd suggest updating affected version list
          lflis Lukasz Flis added a comment -

          I will ask one of our users to verify sanity of results generated by a patched client.

          The most interesting thing is why AMD processor based client with 256GB ram with identical client version and
          identical inputs is not affected by ~0 problem.

          Assuming that CRC algorithm works in the same way on the client and server only explanation that come to my mind is the different data alignment/content of requests
          but i'm just a user here

          Is there anything else we can check here?

          lflis Lukasz Flis added a comment - I will ask one of our users to verify sanity of results generated by a patched client. The most interesting thing is why AMD processor based client with 256GB ram with identical client version and identical inputs is not affected by ~0 problem. Assuming that CRC algorithm works in the same way on the client and server only explanation that come to my mind is the different data alignment/content of requests but i'm just a user here Is there anything else we can check here?
          bobijam Zhenyu Xu added a comment -

          I'm no expert on the checksum algorithm, if it's possible to get a ~0 checksum from the algorithm, I prefer to think that's not a data corruption here. What's your opinion from your gaussian application, does the data generated correct and valid?

          bobijam Zhenyu Xu added a comment - I'm no expert on the checksum algorithm, if it's possible to get a ~0 checksum from the algorithm, I prefer to think that's not a data corruption here. What's your opinion from your gaussian application, does the data generated correct and valid?
          lflis Lukasz Flis added a comment -

          Applied http://review.whamcloud.com/10309 to the client.
          Our testing shows that I/O issue is gone.

          Can we assume this patch as final solution?
          Are we sure that the checksum problem is not a result of a data corruption?

          Thanks

          lflis Lukasz Flis added a comment - Applied http://review.whamcloud.com/10309 to the client. Our testing shows that I/O issue is gone. Can we assume this patch as final solution? Are we sure that the checksum problem is not a result of a data corruption? Thanks
          bobijam Zhenyu Xu added a comment -

          If client read pops this error, the read could return error. Would you mind trying this patch to see whether this issue persists w/ it? http://review.whamcloud.com/10309

          bobijam Zhenyu Xu added a comment - If client read pops this error, the read could return error. Would you mind trying this patch to see whether this issue persists w/ it? http://review.whamcloud.com/10309
          lflis Lukasz Flis added a comment -

          It looks like a kind of data corruption. After the second occurence of a log message gaussian application is exiting with I/O error:

          [pid 3283] read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1600) = 1600
          [pid 3283] read(4, "\307\317\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0ZN phthalocyanine, MP2/6-31G* op"..., 40856) = 40856
          [pid 3283] read(4, ">\10]\341\275\240\1@\31-u\221%8\347?*\321\377n]\226\4@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0C,\303Z\225@\2@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1787568) = 1787568
          [pid 3283] read(4, 0x7fff2f949b68, 8) = -1 EIO (Input/output error)

          lflis Lukasz Flis added a comment - It looks like a kind of data corruption. After the second occurence of a log message gaussian application is exiting with I/O error: [pid 3283] read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1600) = 1600 [pid 3283] read(4, "\307\317\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0ZN phthalocyanine, MP2/6-31G* op"..., 40856) = 40856 [pid 3283] read(4, ">\10]\341\275\240\1@\31-u\221%8\347?*\321\377n]\226\4@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0C,\303Z\225@\2@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1787568) = 1787568 [pid 3283] read(4, 0x7fff2f949b68, 8) = -1 EIO (Input/output error)
          bobijam Zhenyu Xu added a comment -

          this line raised my eyebrows

          00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff

          this shows that the checksum for the bulk data on the OST to be transferred to client is ~0, which makes the client checking code invalid

          osc_brw_fini_request
          ...
                          if (server_cksum == ~0 && rc > 0) {
                                  CERROR("Protocol error: server %s set the 'checksum' "
                                         "bit, but didn't send a checksum.  Not fatal, "
                                         "but please notify on http://bugs.whamcloud.com/\n",
                                         libcfs_nid2str(peer->nid));
                          }
          ...
          

          I am no expert here, but is it possible that the checksum of a chunk of data to be ~0?

          bobijam Zhenyu Xu added a comment - this line raised my eyebrows 00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff this shows that the checksum for the bulk data on the OST to be transferred to client is ~0, which makes the client checking code invalid osc_brw_fini_request ... if (server_cksum == ~0 && rc > 0) { CERROR( "Protocol error: server %s set the 'checksum' " "bit, but didn't send a checksum. Not fatal, " "but please notify on http: //bugs.whamcloud.com/\n" , libcfs_nid2str(peer->nid)); } ... I am no expert here, but is it possible that the checksum of a chunk of data to be ~0?
          lflis Lukasz Flis added a comment -

          Additional info:

          • problem has been observed on clients with 12 cores and 16-24GB of RAM, easy to reproduce
          • it never occured on clients with 64 cores and 256GB nor we were able to reproduce it on this type of node
          lflis Lukasz Flis added a comment - Additional info: problem has been observed on clients with 12 cores and 16-24GB of RAM, easy to reproduce it never occured on clients with 64 cores and 256GB nor we were able to reproduce it on this type of node
          lflis Lukasz Flis added a comment -

          Uploaded another log set:
          LU-4937-good.tar.gz
          this time it's the view from ost and client for the same events

          00000001:00000040:1.0:1399928810.277979:0:15665:0:(linux-crypto.c:82:cfs_crypto_hash_alloc()) Using crypto hash: crc32c (crc32c-intel) speed 2673 MB/s
          00000010:00000040:1.0:1399928810.277982:0:15665:0:(ost_handler.c:626:ost_checksum_bulk()) Checksum for algo crc32c
          00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff
          00010000:00000040:1.0:1399928810.278234:0:15665:0:(ldlm_resource.c:1223:ldlm_resource_getref()) getref res: ffff88016d6c25c0 count: 2
          00010000:00000040:1.0:1399928810.278251:0:15665:0:(ldlm_resource.c:1262:ldlm_resource_putref()) putref res: ffff88016d6c25c0 count: 1
          00000020:00000040:1.0:1399928810.278255:0:15665:0:(genops.c:814:class_export_put()) PUTting export ffff880528ffd400 : new refcount 7
          00010000:00000040:1.0:1399928810.278258:0:15665:0:(ldlm_lib.c:2377:target_committed_to_req()) last_committed 38665664797, transno 0, xid 1467919094435880
          00000100:00000040:1.0:1399928810.278263:0:15665:0:(connection.c:141:ptlrpc_connection_addref()) conn=ffff880605fb16c0 refcount 4 to 172.16.202.160@o2ib
          00000100:00000040:1.0:1399928810.278266:0:15665:0:(niobuf.c:64:ptl_send_buf()) conn=ffff880605fb16c0 id 12345-172.16.202.160@o2ib
          00000100:00000040:1.0:1399928810.278276:0:15665:0:(connection.c:127:ptlrpc_connection_put()) PUT conn=ffff880605fb16c0 refcount 3 to 172.16.202.160@o2ib
          00000100:00000040:1.0:1399928810.278280:0:15665:0:(lustre_net.h:3283:ptlrpc_rqphase_move()) @@@ move req "Interpret" > "Complete" req@ffff88058e0fb000 x1467919094435880/t0(0) o3>4b54d2df-5ff6-9771-58f4-76c9ae6ad354@172.16.202.160@o2ib:0/0 lens 488/400 e 0 to 0 dl 1399928816 ref 1 fl Interpret:/0/0 rc 4096/4096
          00000100:00000040:1.0:1399928810.278288:0:15665:0:(service.c:1046:ptlrpc_server_finish_active_request()) RPC PUTting export ffff880528ffd400 : new rpc_count 0

          lflis Lukasz Flis added a comment - Uploaded another log set: LU-4937 -good.tar.gz this time it's the view from ost and client for the same events 00000001:00000040:1.0:1399928810.277979:0:15665:0:(linux-crypto.c:82:cfs_crypto_hash_alloc()) Using crypto hash: crc32c (crc32c-intel) speed 2673 MB/s 00000010:00000040:1.0:1399928810.277982:0:15665:0:(ost_handler.c:626:ost_checksum_bulk()) Checksum for algo crc32c 00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff 00010000:00000040:1.0:1399928810.278234:0:15665:0:(ldlm_resource.c:1223:ldlm_resource_getref()) getref res: ffff88016d6c25c0 count: 2 00010000:00000040:1.0:1399928810.278251:0:15665:0:(ldlm_resource.c:1262:ldlm_resource_putref()) putref res: ffff88016d6c25c0 count: 1 00000020:00000040:1.0:1399928810.278255:0:15665:0:(genops.c:814:class_export_put()) PUTting export ffff880528ffd400 : new refcount 7 00010000:00000040:1.0:1399928810.278258:0:15665:0:(ldlm_lib.c:2377:target_committed_to_req()) last_committed 38665664797, transno 0, xid 1467919094435880 00000100:00000040:1.0:1399928810.278263:0:15665:0:(connection.c:141:ptlrpc_connection_addref()) conn=ffff880605fb16c0 refcount 4 to 172.16.202.160@o2ib 00000100:00000040:1.0:1399928810.278266:0:15665:0:(niobuf.c:64:ptl_send_buf()) conn=ffff880605fb16c0 id 12345-172.16.202.160@o2ib 00000100:00000040:1.0:1399928810.278276:0:15665:0:(connection.c:127:ptlrpc_connection_put()) PUT conn=ffff880605fb16c0 refcount 3 to 172.16.202.160@o2ib 00000100:00000040:1.0:1399928810.278280:0:15665:0:(lustre_net.h:3283:ptlrpc_rqphase_move()) @@@ move req "Interpret" > "Complete" req@ffff88058e0fb000 x1467919094435880/t0(0) o3 >4b54d2df-5ff6-9771-58f4-76c9ae6ad354@172.16.202.160@o2ib:0/0 lens 488/400 e 0 to 0 dl 1399928816 ref 1 fl Interpret:/0/0 rc 4096/4096 00000100:00000040:1.0:1399928810.278288:0:15665:0:(service.c:1046:ptlrpc_server_finish_active_request()) RPC PUTting export ffff880528ffd400 : new rpc_count 0
          lflis Lukasz Flis added a comment - - edited

          I have uploaded two logs: one for OST and one from Client.
          The logs are from different runs but show the same bug. We are able to reproduce it
          every time.

          Files have been uploaded to FTP: uploads/LU-4937

          Flags: rpctrace,info,page

          LU-cksum-client-2.5.1.cyfronet.log.gz
          LU-cksum-ost-2.5.1.cyfronet.log.gz

          Could you please have a look. If something is missing please let me know.
          Problem is present in 2.4 as well as in 2.5.1 clients

          Affected client IP: 172.16.202.160

          lflis Lukasz Flis added a comment - - edited I have uploaded two logs: one for OST and one from Client. The logs are from different runs but show the same bug. We are able to reproduce it every time. Files have been uploaded to FTP: uploads/ LU-4937 Flags: rpctrace,info,page LU-cksum-client-2.5.1.cyfronet.log.gz LU-cksum-ost-2.5.1.cyfronet.log.gz Could you please have a look. If something is missing please let me know. Problem is present in 2.4 as well as in 2.5.1 clients Affected client IP: 172.16.202.160
          bobijam Zhenyu Xu added a comment -

          please add +info as well.

          bobijam Zhenyu Xu added a comment - please add +info as well.

          People

            bobijam Zhenyu Xu
            rganesan@ddn.com Rajeshwaran Ganesan
            Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: