[LU-4937]  LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: - Unable to send checksum Created: 21/Apr/14  Updated: 31/Aug/15  Resolved: 29/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3

Type: Bug Priority: Minor
Reporter: Rajeshwaran Ganesan Assignee: Zhenyu Xu
Resolution: Fixed Votes: 2
Labels: llnl, mn4
Environment:

Lustre 2.4.2
Red Hat Enterprise Linux Server release 6.5 (Santiago)


Issue Links:
Related
Severity: 3
Rank (Obsolete): 13646

 Description   

We noticed the following error on three of the lustre clients, we like to know why its happening and does it need any fix..

Apr 4 15:34:07 uc1n274 kernel: LustreError: 2926:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/
Apr 4 16:38:55 uc1n484 kernel: LustreError: 2981:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/
Apr 5 01:14:44 uc1n251 kernel: LustreError: 2914:0:(osc_request.c:1608:osc_brw_fini_request()) Protocol error: server 172.26.8.15@o2ib set the 'checksum' bit, but didn't send a checksum. Not fatal, but please notify on http://bugs.whamcloud.com/



 Comments   
Comment by Peter Jones [ 21/Apr/14 ]

Bobijam

Could you please comment on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 23/Apr/14 ]

what the Lustre version used on client and OST server?

Comment by Rajeshwaran Ganesan [ 23/Apr/14 ]

Hello,
OST /MDS servers are in 2.4.1
Clients 2.4.2

I have uploaded the client logs in the ftp.whamlocud.com

Thanks

Comment by Zhenyu Xu [ 23/Apr/14 ]

please also upload OST logs.

btw, I couldn't find the client log

ls -a ftp/uploads/LU-4937
. ..

Comment by Rajeshwaran Ganesan [ 23/Apr/14 ]

Please try now..!

Comment by Zhenyu Xu [ 23/Apr/14 ]

still need OST logs

Comment by Rajeshwaran Ganesan [ 29/Apr/14 ]

I have added the OST logs into the FTP site, Please verify.

Comment by Zhenyu Xu [ 29/Apr/14 ]

The log collected does not reveal relevant information in it. Please enable "+page" debug mask on OST and try to reproduce the issue and collect OST log again.

Since the error on client is from code as follows

        if (body->oa.o_valid & OBD_MD_FLCKSUM) {
                static int cksum_counter;
                __u32      server_cksum = body->oa.o_cksum;
...
                if (server_cksum == ~0 && rc > 0) {
                        CERROR("Protocol error: server %s set the 'checksum' "
                               "bit, but didn't send a checksum.  Not fatal, "
                               "but please notify on http://bugs.whamcloud.com/\n",
                               libcfs_nid2str(peer->nid));
                }

which means the server_cksum is -1 which is unlikely, I've checked OST code, it set server cksum when it set OBD_MD_FLCKSUM flag, I want to check the server side of story

ost_brw_read()
        if (body->oa.o_valid & OBD_MD_FLCKSUM) {
                cksum_type_t cksum_type =
                        cksum_type_unpack(repbody->oa.o_valid & OBD_MD_FLFLAGS ?
                                          repbody->oa.o_flags : 0);
                repbody->oa.o_flags = cksum_type_pack(cksum_type);
                repbody->oa.o_valid = OBD_MD_FLCKSUM | OBD_MD_FLFLAGS;
                repbody->oa.o_cksum = ost_checksum_bulk(desc, OST_READ,cksum_type);
                CDEBUG(D_PAGE, "checksum at read origin: %x\n",
                       repbody->oa.o_cksum);
        }
Comment by Lukasz Flis [ 30/Apr/14 ]

We can confirm the same problem exists with 2.4.1 client and OSS at 2.5.1 and 2.4.x

It happens during Gaussian application runs.
Do you need any additional debug info other than +page?

Comment by Zhenyu Xu [ 30/Apr/14 ]

please add +info as well.

Comment by Lukasz Flis [ 12/May/14 ]

I have uploaded two logs: one for OST and one from Client.
The logs are from different runs but show the same bug. We are able to reproduce it
every time.

Files have been uploaded to FTP: uploads/LU-4937

Flags: rpctrace,info,page

LU-cksum-client-2.5.1.cyfronet.log.gz
LU-cksum-ost-2.5.1.cyfronet.log.gz

Could you please have a look. If something is missing please let me know.
Problem is present in 2.4 as well as in 2.5.1 clients

Affected client IP: 172.16.202.160

Comment by Lukasz Flis [ 12/May/14 ]

Uploaded another log set:
LU-4937-good.tar.gz
this time it's the view from ost and client for the same events

00000001:00000040:1.0:1399928810.277979:0:15665:0:(linux-crypto.c:82:cfs_crypto_hash_alloc()) Using crypto hash: crc32c (crc32c-intel) speed 2673 MB/s
00000010:00000040:1.0:1399928810.277982:0:15665:0:(ost_handler.c:626:ost_checksum_bulk()) Checksum for algo crc32c
00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff
00010000:00000040:1.0:1399928810.278234:0:15665:0:(ldlm_resource.c:1223:ldlm_resource_getref()) getref res: ffff88016d6c25c0 count: 2
00010000:00000040:1.0:1399928810.278251:0:15665:0:(ldlm_resource.c:1262:ldlm_resource_putref()) putref res: ffff88016d6c25c0 count: 1
00000020:00000040:1.0:1399928810.278255:0:15665:0:(genops.c:814:class_export_put()) PUTting export ffff880528ffd400 : new refcount 7
00010000:00000040:1.0:1399928810.278258:0:15665:0:(ldlm_lib.c:2377:target_committed_to_req()) last_committed 38665664797, transno 0, xid 1467919094435880
00000100:00000040:1.0:1399928810.278263:0:15665:0:(connection.c:141:ptlrpc_connection_addref()) conn=ffff880605fb16c0 refcount 4 to 172.16.202.160@o2ib
00000100:00000040:1.0:1399928810.278266:0:15665:0:(niobuf.c:64:ptl_send_buf()) conn=ffff880605fb16c0 id 12345-172.16.202.160@o2ib
00000100:00000040:1.0:1399928810.278276:0:15665:0:(connection.c:127:ptlrpc_connection_put()) PUT conn=ffff880605fb16c0 refcount 3 to 172.16.202.160@o2ib
00000100:00000040:1.0:1399928810.278280:0:15665:0:(lustre_net.h:3283:ptlrpc_rqphase_move()) @@@ move req "Interpret" > "Complete" req@ffff88058e0fb000 x1467919094435880/t0(0) o3>4b54d2df-5ff6-9771-58f4-76c9ae6ad354@172.16.202.160@o2ib:0/0 lens 488/400 e 0 to 0 dl 1399928816 ref 1 fl Interpret:/0/0 rc 4096/4096
00000100:00000040:1.0:1399928810.278288:0:15665:0:(service.c:1046:ptlrpc_server_finish_active_request()) RPC PUTting export ffff880528ffd400 : new rpc_count 0

Comment by Lukasz Flis [ 12/May/14 ]

Additional info:

  • problem has been observed on clients with 12 cores and 16-24GB of RAM, easy to reproduce
  • it never occured on clients with 64 cores and 256GB nor we were able to reproduce it on this type of node
Comment by Zhenyu Xu [ 13/May/14 ]

this line raised my eyebrows

00000010:00008000:1.0:1399928810.277985:0:15665:0:(ost_handler.c:909:ost_brw_read()) checksum at read origin: ffffffff

this shows that the checksum for the bulk data on the OST to be transferred to client is ~0, which makes the client checking code invalid

osc_brw_fini_request
...
                if (server_cksum == ~0 && rc > 0) {
                        CERROR("Protocol error: server %s set the 'checksum' "
                               "bit, but didn't send a checksum.  Not fatal, "
                               "but please notify on http://bugs.whamcloud.com/\n",
                               libcfs_nid2str(peer->nid));
                }
...

I am no expert here, but is it possible that the checksum of a chunk of data to be ~0?

Comment by Lukasz Flis [ 13/May/14 ]

It looks like a kind of data corruption. After the second occurence of a log message gaussian application is exiting with I/O error:

[pid 3283] read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1600) = 1600
[pid 3283] read(4, "\307\317\377\377\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0ZN phthalocyanine, MP2/6-31G* op"..., 40856) = 40856
[pid 3283] read(4, ">\10]\341\275\240\1@\31-u\221%8\347?*\321\377n]\226\4@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0C,\303Z\225@\2@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1787568) = 1787568
[pid 3283] read(4, 0x7fff2f949b68, 8) = -1 EIO (Input/output error)

Comment by Zhenyu Xu [ 13/May/14 ]

If client read pops this error, the read could return error. Would you mind trying this patch to see whether this issue persists w/ it? http://review.whamcloud.com/10309

Comment by Lukasz Flis [ 13/May/14 ]

Applied http://review.whamcloud.com/10309 to the client.
Our testing shows that I/O issue is gone.

Can we assume this patch as final solution?
Are we sure that the checksum problem is not a result of a data corruption?

Thanks

Comment by Zhenyu Xu [ 13/May/14 ]

I'm no expert on the checksum algorithm, if it's possible to get a ~0 checksum from the algorithm, I prefer to think that's not a data corruption here. What's your opinion from your gaussian application, does the data generated correct and valid?

Comment by Lukasz Flis [ 13/May/14 ]

I will ask one of our users to verify sanity of results generated by a patched client.

The most interesting thing is why AMD processor based client with 256GB ram with identical client version and
identical inputs is not affected by ~0 problem.

Assuming that CRC algorithm works in the same way on the client and server only explanation that come to my mind is the different data alignment/content of requests
but i'm just a user here

Is there anything else we can check here?

Comment by Lukasz Flis [ 13/May/14 ]

this issue affects 2.5.1 clients also so i'd suggest updating affected version list

Comment by Lukasz Flis [ 14/May/14 ]

I have found out the difference which makes AMD nodes not affected by the problem

Intel nodes are using: crc32c
AMD nodes default to: crc32

As a workaround we switched Intel nodes to crc32 (which is a bit slower in that case) and problem is gone

for i in /proc/fs/lustre/osc/*/checksum_type;do echo crc32 > $i; done;

Comment by Zhenyu Xu [ 15/May/14 ]

Andreas,

Is it possible that crc32c has an issue in this case?

Comment by Andreas Dilger [ 16/May/14 ]

Lukaz, you will probably get better performance with adler32 for the checksum on the AMD nodes instead of crc32:

lctl set_param osc.*.checksum_type=adler32

Bobijam, I don't know enough details of the CRC32c checksum to say for sure if "0xffffffff" is a valid value or not. I know that for most checksums "0" is an invalid value because this would mean that the checksum for any length of zero bytes would be the same, which is why they always initialize the starting value to 0xffffffff. The use of ~0 for indicating the checksum is not sent was originally added in 1.4.x when the checksum feature was just new, and is no longer need. A mismatch between the client and server checksum (as is done with your patch) will detect corruption, and there is no reason to special-case the 0xffffffff checksum value.

I've also submitted a version of your patch for master: http://review.whamcloud.com/10354

Comment by Lukasz Flis [ 19/May/14 ]

@Andreas:
It seems that crc32 is much better than adler on AMD 6276 cpus:

Using crypto hash: crc32 (crc32-pclmul) speed 3196 MB/s
Using crypto hash: adler32 (adler32-zlib) speed 2335 MB/s

Comment by Jodi Levi (Inactive) [ 29/May/14 ]

Patch landed to Master. Please reopen ticket if more work is needed.

Comment by Lukasz Flis [ 17/Jul/14 ]

@Jodi: Could you please land the patch to b2_5 also?

Comment by Jodi Levi (Inactive) [ 17/Jul/14 ]

Yes. We are tracking this externally to land in b2_5 as well.
Thank you!

Generated at Sat Feb 10 01:47:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.