Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4937

LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: - Unable to send checksum

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.6.0, Lustre 2.5.3
    • Lustre 2.4.2
    • Lustre 2.4.2
      Red Hat Enterprise Linux Server release 6.5 (Santiago)
    • 3
    • 13646

    Description

      We noticed the following error on three of the lustre clients, we like to know why its happening and does it need any fix..

      Apr 4 15:34:07 uc1n274 kernel: LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/
      Apr 4 16:38:55 uc1n484 kernel: LustreError: 2981:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/
      Apr 5 01:14:44 uc1n251 kernel: LustreError: 2914:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/

      Attachments

        Activity

          [LU-4937] LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: - Unable to send checksum
          bobijam Zhenyu Xu added a comment -

          I'm no expert on the checksum algorithm, if it's possible to get a ~0 checksum from the algorithm, I prefer to think that's not a data corruption here. What's your opinion from your gaussian application, does the data generated correct and valid?

          bobijam Zhenyu Xu added a comment - I'm no expert on the checksum algorithm, if it's possible to get a ~0 checksum from the algorithm, I prefer to think that's not a data corruption here. What's your opinion from your gaussian application, does the data generated correct and valid?
          lflis Lukasz Flis added a comment -

          Applied http://review.whamcloud.com/10309 to the client.
          Our testing shows that I/O issue is gone.

          Can we assume this patch as final solution?
          Are we sure that the checksum problem is not a result of a data corruption?

          Thanks

          lflis Lukasz Flis added a comment - Applied http://review.whamcloud.com/10309 to the client. Our testing shows that I/O issue is gone. Can we assume this patch as final solution? Are we sure that the checksum problem is not a result of a data corruption? Thanks
          bobijam Zhenyu Xu added a comment -

          If client read pops this error, the read could return error. Would you mind trying this patch to see whether this issue persists w/ it? http://review.whamcloud.com/10309

          bobijam Zhenyu Xu added a comment - If client read pops this error, the read could return error. Would you mind trying this patch to see whether this issue persists w/ it? http://review.whamcloud.com/10309
          lflis Lukasz Flis added a comment -

          It looks like a kind of data corruption. After the second occurence of a log message gaussian application is exiting with I/O error:

          [pid 3283] read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1600) = 1600
          [pid 3283] read(4, "\307\317\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0ZN phthalocyanine, MP2/6-31G* op"..., 40856) = 40856
          [pid 3283] read(4, ">\10]\341\275\240\1@\31-u\221%8\347?*\321\377n]\226\4@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0C,\303Z\225@\2@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1787568) = 1787568
          [pid 3283] read(4, 0x7fff2f949b68, 8) = -1 EIO (Input/output error)

          lflis Lukasz Flis added a comment - It looks like a kind of data corruption. After the second occurence of a log message gaussian application is exiting with I/O error: [pid 3283] read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1600) = 1600 [pid 3283] read(4, "\307\317\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0ZN phthalocyanine, MP2/6-31G* op"..., 40856) = 40856 [pid 3283] read(4, ">\10]\341\275\240\1@\31-u\221%8\347?*\321\377n]\226\4@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0C,\303Z\225@\2@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1787568) = 1787568 [pid 3283] read(4, 0x7fff2f949b68, 8) = -1 EIO (Input/output error)
          bobijam Zhenyu Xu added a comment -

          this line raised my eyebrows

          00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff

          this shows that the checksum for the bulk data on the OST to be transferred to client is ~0, which makes the client checking code invalid

          osc_brw_fini_request
          ...
                          if (server_cksum == ~0 && rc > 0) {
                                  CERROR("Protocol error: server %s set the 'checksum' "
                                         "bit, but didn't send a checksum.  Not fatal, "
                                         "but please notify on http://bugs.whamcloud.com/\n",
                                         libcfs_nid2str(peer->nid));
                          }
          ...
          

          I am no expert here, but is it possible that the checksum of a chunk of data to be ~0?

          bobijam Zhenyu Xu added a comment - this line raised my eyebrows 00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff this shows that the checksum for the bulk data on the OST to be transferred to client is ~0, which makes the client checking code invalid osc_brw_fini_request ... if (server_cksum == ~0 && rc > 0) { CERROR( "Protocol error: server %s set the 'checksum' " "bit, but didn't send a checksum. Not fatal, " "but please notify on http: //bugs.whamcloud.com/\n" , libcfs_nid2str(peer->nid)); } ... I am no expert here, but is it possible that the checksum of a chunk of data to be ~0?
          lflis Lukasz Flis added a comment -

          Additional info:

          • problem has been observed on clients with 12 cores and 16-24GB of RAM, easy to reproduce
          • it never occured on clients with 64 cores and 256GB nor we were able to reproduce it on this type of node
          lflis Lukasz Flis added a comment - Additional info: problem has been observed on clients with 12 cores and 16-24GB of RAM, easy to reproduce it never occured on clients with 64 cores and 256GB nor we were able to reproduce it on this type of node
          lflis Lukasz Flis added a comment -

          Uploaded another log set:
          LU-4937-good.tar.gz
          this time it's the view from ost and client for the same events

          00000001:00000040:1.0:1399928810.277979:0:15665:0:(linux-crypto.c:82:cfs_crypto_hash_alloc()) Using crypto hash: crc32c (crc32c-intel) speed 2673 MB/s
          00000010:00000040:1.0:1399928810.277982:0:15665:0:(ost_handler.c:626:ost_checksum_bulk()) Checksum for algo crc32c
          00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff
          00010000:00000040:1.0:1399928810.278234:0:15665:0:(ldlm_resource.c:1223:ldlm_resource_getref()) getref res: ffff88016d6c25c0 count: 2
          00010000:00000040:1.0:1399928810.278251:0:15665:0:(ldlm_resource.c:1262:ldlm_resource_putref()) putref res: ffff88016d6c25c0 count: 1
          00000020:00000040:1.0:1399928810.278255:0:15665:0:(genops.c:814:class_export_put()) PUTting export ffff880528ffd400 : new refcount 7
          00010000:00000040:1.0:1399928810.278258:0:15665:0:(ldlm_lib.c:2377:target_committed_to_req()) last_committed 38665664797, transno 0, xid 1467919094435880
          00000100:00000040:1.0:1399928810.278263:0:15665:0:(connection.c:141:ptlrpc_connection_addref()) conn=ffff880605fb16c0 refcount 4 to 172.16.202.160@o2ib
          00000100:00000040:1.0:1399928810.278266:0:15665:0:(niobuf.c:64:ptl_send_buf()) conn=ffff880605fb16c0 id 12345-172.16.202.160@o2ib
          00000100:00000040:1.0:1399928810.278276:0:15665:0:(connection.c:127:ptlrpc_connection_put()) PUT conn=ffff880605fb16c0 refcount 3 to 172.16.202.160@o2ib
          00000100:00000040:1.0:1399928810.278280:0:15665:0:(lustre_net.h:3283:ptlrpc_rqphase_move()) @@@ move req "Interpret" > "Complete" req@ffff88058e0fb000 x1467919094435880/t0(0) o3>4b54d2df-5ff6-9771-58f4-76c9ae6ad354@172.16.202.160@o2ib:0/0 lens 488/400 e 0 to 0 dl 1399928816 ref 1 fl Interpret:/0/0 rc 4096/4096
          00000100:00000040:1.0:1399928810.278288:0:15665:0:(service.c:1046:ptlrpc_server_finish_active_request()) RPC PUTting export ffff880528ffd400 : new rpc_count 0

          lflis Lukasz Flis added a comment - Uploaded another log set: LU-4937 -good.tar.gz this time it's the view from ost and client for the same events 00000001:00000040:1.0:1399928810.277979:0:15665:0:(linux-crypto.c:82:cfs_crypto_hash_alloc()) Using crypto hash: crc32c (crc32c-intel) speed 2673 MB/s 00000010:00000040:1.0:1399928810.277982:0:15665:0:(ost_handler.c:626:ost_checksum_bulk()) Checksum for algo crc32c 00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff 00010000:00000040:1.0:1399928810.278234:0:15665:0:(ldlm_resource.c:1223:ldlm_resource_getref()) getref res: ffff88016d6c25c0 count: 2 00010000:00000040:1.0:1399928810.278251:0:15665:0:(ldlm_resource.c:1262:ldlm_resource_putref()) putref res: ffff88016d6c25c0 count: 1 00000020:00000040:1.0:1399928810.278255:0:15665:0:(genops.c:814:class_export_put()) PUTting export ffff880528ffd400 : new refcount 7 00010000:00000040:1.0:1399928810.278258:0:15665:0:(ldlm_lib.c:2377:target_committed_to_req()) last_committed 38665664797, transno 0, xid 1467919094435880 00000100:00000040:1.0:1399928810.278263:0:15665:0:(connection.c:141:ptlrpc_connection_addref()) conn=ffff880605fb16c0 refcount 4 to 172.16.202.160@o2ib 00000100:00000040:1.0:1399928810.278266:0:15665:0:(niobuf.c:64:ptl_send_buf()) conn=ffff880605fb16c0 id 12345-172.16.202.160@o2ib 00000100:00000040:1.0:1399928810.278276:0:15665:0:(connection.c:127:ptlrpc_connection_put()) PUT conn=ffff880605fb16c0 refcount 3 to 172.16.202.160@o2ib 00000100:00000040:1.0:1399928810.278280:0:15665:0:(lustre_net.h:3283:ptlrpc_rqphase_move()) @@@ move req "Interpret" > "Complete" req@ffff88058e0fb000 x1467919094435880/t0(0) o3 >4b54d2df-5ff6-9771-58f4-76c9ae6ad354@172.16.202.160@o2ib:0/0 lens 488/400 e 0 to 0 dl 1399928816 ref 1 fl Interpret:/0/0 rc 4096/4096 00000100:00000040:1.0:1399928810.278288:0:15665:0:(service.c:1046:ptlrpc_server_finish_active_request()) RPC PUTting export ffff880528ffd400 : new rpc_count 0
          lflis Lukasz Flis added a comment - - edited

          I have uploaded two logs: one for OST and one from Client.
          The logs are from different runs but show the same bug. We are able to reproduce it
          every time.

          Files have been uploaded to FTP: uploads/LU-4937

          Flags: rpctrace,info,page

          LU-cksum-client-2.5.1.cyfronet.log.gz
          LU-cksum-ost-2.5.1.cyfronet.log.gz

          Could you please have a look. If something is missing please let me know.
          Problem is present in 2.4 as well as in 2.5.1 clients

          Affected client IP: 172.16.202.160

          lflis Lukasz Flis added a comment - - edited I have uploaded two logs: one for OST and one from Client. The logs are from different runs but show the same bug. We are able to reproduce it every time. Files have been uploaded to FTP: uploads/ LU-4937 Flags: rpctrace,info,page LU-cksum-client-2.5.1.cyfronet.log.gz LU-cksum-ost-2.5.1.cyfronet.log.gz Could you please have a look. If something is missing please let me know. Problem is present in 2.4 as well as in 2.5.1 clients Affected client IP: 172.16.202.160
          bobijam Zhenyu Xu added a comment -

          please add +info as well.

          bobijam Zhenyu Xu added a comment - please add +info as well.
          lflis Lukasz Flis added a comment -

          We can confirm the same problem exists with 2.4.1 client and OSS at 2.5.1 and 2.4.x

          It happens during Gaussian application runs.
          Do you need any additional debug info other than +page?

          lflis Lukasz Flis added a comment - We can confirm the same problem exists with 2.4.1 client and OSS at 2.5.1 and 2.4.x It happens during Gaussian application runs. Do you need any additional debug info other than +page?
          bobijam Zhenyu Xu added a comment -

          The log collected does not reveal relevant information in it. Please enable "+page" debug mask on OST and try to reproduce the issue and collect OST log again.

          Since the error on client is from code as follows

                  if (body->oa.o_valid & OBD_MD_FLCKSUM) {
                          static int cksum_counter;
                          __u32      server_cksum = body->oa.o_cksum;
          ...
                          if (server_cksum == ~0 && rc > 0) {
                                  CERROR("Protocol error: server %s set the 'checksum' "
                                         "bit, but didn't send a checksum.  Not fatal, "
                                         "but please notify on http://bugs.whamcloud.com/\n",
                                         libcfs_nid2str(peer->nid));
                          }
          

          which means the server_cksum is -1 which is unlikely, I've checked OST code, it set server cksum when it set OBD_MD_FLCKSUM flag, I want to check the server side of story

          ost_brw_read()
                  if (body->oa.o_valid & OBD_MD_FLCKSUM) {
                          cksum_type_t cksum_type =
                                  cksum_type_unpack(repbody->oa.o_valid & OBD_MD_FLFLAGS ?
                                                    repbody->oa.o_flags : 0);
                          repbody->oa.o_flags = cksum_type_pack(cksum_type);
                          repbody->oa.o_valid = OBD_MD_FLCKSUM | OBD_MD_FLFLAGS;
                          repbody->oa.o_cksum = ost_checksum_bulk(desc, OST_READ,cksum_type);
                          CDEBUG(D_PAGE, "checksum at read origin: %x\n",
                                 repbody->oa.o_cksum);
                  }
          
          bobijam Zhenyu Xu added a comment - The log collected does not reveal relevant information in it. Please enable "+page" debug mask on OST and try to reproduce the issue and collect OST log again. Since the error on client is from code as follows if (body->oa.o_valid & OBD_MD_FLCKSUM) { static int cksum_counter; __u32 server_cksum = body->oa.o_cksum; ... if (server_cksum == ~0 && rc > 0) { CERROR( "Protocol error: server %s set the 'checksum' " "bit, but didn't send a checksum. Not fatal, " "but please notify on http: //bugs.whamcloud.com/\n" , libcfs_nid2str(peer->nid)); } which means the server_cksum is -1 which is unlikely, I've checked OST code, it set server cksum when it set OBD_MD_FLCKSUM flag, I want to check the server side of story ost_brw_read() if (body->oa.o_valid & OBD_MD_FLCKSUM) { cksum_type_t cksum_type = cksum_type_unpack(repbody->oa.o_valid & OBD_MD_FLFLAGS ? repbody->oa.o_flags : 0); repbody->oa.o_flags = cksum_type_pack(cksum_type); repbody->oa.o_valid = OBD_MD_FLCKSUM | OBD_MD_FLFLAGS; repbody->oa.o_cksum = ost_checksum_bulk(desc, OST_READ,cksum_type); CDEBUG(D_PAGE, "checksum at read origin: %x\n" , repbody->oa.o_cksum); }

          People

            bobijam Zhenyu Xu
            rganesan@ddn.com Rajeshwaran Ganesan
            Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: