[LU-3123] A client could not communicate with an OSS due to Timed out RDMA Created: 08/Apr/13  Updated: 02/Apr/14  Resolved: 02/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Mitsuhiro Nishizawa Assignee: Bruno Faccini (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: Text File client_oss_log.tar.gz     Text File crash.log    
Severity: 3
Rank (Obsolete): 7585

 Description   

One of clients could not communicate with one of OSSs due to Timed out RDMA error.

kernel: LustreError: 9948:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 2 seconds
kernel: LustreError: 9948:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.8.140@o2ib (32)
kernel: LustreError: 9948:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8104e84f4000

On the server side,

kernel: LustreError: 19792:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 3 seconds
kernel: LustreError: 19792:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.10.84@o2ib (18)
kernel: LustreError: 19789:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8101b15d4000
kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8103505f2b80
kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc ffff8103505f2b80
kernel: LustreError: 21321:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(16384)  req@ffff8101419a3400 x1427389785285422/t0 o4->b
fe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 448/416 e 0 to 0 dl 1364114285 ref 1 fl Interpret:/0/0 rc 0/0
kernel: Lustre: 21321:0:(ost_handler.c:1224:ost_brw_write()) share3-OST000e: ignoring bulk IO comm error with bfe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x
50000ac1a0a54_UUID id 12345-172.26.10.84@o2ib - client will retry
kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000f: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting
kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 2 previous similar messages
kernel: Lustre: 20014:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000c: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting
kernel: Lustre: 19981:0:(ldlm_lib.c:874:target_handle_connect()) share3-OST000d: refuse reconnection from bfe63770-9dc7-fabc-fd87-625dec42ca0c@172.26.10.
84@o2ib to 0xffff81040d190c00; still busy with 1 active RPCs
kernel: LustreError: 19981:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  req@ffff810212b87000 x1427389785285592/t0 o8->bfe63770
-9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 368/264 e 0 to 0 dl 1364114377 ref 1 fl Interpret:/0/0 rc -16/0

Ping (IBoIP) or ibping was okay, but "lctl ping" to this OSS was failing.
"lctl ping" to other OSS was okay.

This issue was finally resolved by rebooting the client.

Would you please check if we can say this is a network issue or something wrong on the lustre side?

Attached is messages and debug log collected from the client and OSS.

Regards,



 Comments   
Comment by Mitsuhiro Nishizawa [ 08/Apr/13 ]

core dump was captured when this client was rebooted.
As core dump itself is too much large, here is simple crash command output.

Comment by Bruno Faccini (Inactive) [ 08/Apr/13 ]

What is the network/IB topology you use at least to get this Clients and OSSs connected ? Also, did you run any pure IB troubleshooting/tool, like to see/extract if some of the error/stats counters increment on involved HCA/switches/boards in the Fabric ??

Comment by Bruno Faccini (Inactive) [ 08/Apr/13 ]

Attached crash-dump extracts show no hung/spinning thread which may indicate some LNET disfunction, and attached Client+OSS dmesg/Lustre-debuglog indicate flaky communications between OSS and Client causing multiple and recurring re-connections and recovery, since at least Sun Mar 24 03:39:09 PDT 2013, as far as I can find in the provided logs/infos.

So definitely we need you to also investigate if no problem/errors are reported by your different IB-Fabric elements. BTW, OSS reports network errors with at least 2 Clients (172.26.10.84@o2ib, 172.26.12.181@o2ib) during the same period of time.

Comment by Bruno Faccini (Inactive) [ 29/Apr/13 ]

Hello Mitsuhiro,
Have you been able to troubleshoot you IB fabric as I indicated in my last update ??
Do you still have occurrence of problem/situation ??

Comment by Mitsuhiro Nishizawa [ 30/Apr/13 ]

Bruno, thanks for your comment. The same kind of problem occurred again and our customer has replaced cables. Currently, they are watching if this re-occur. Please proceed to close this ticket at this time. Thank you.

Generated at Sat Feb 10 01:31:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.