[LU-3123] A client could not communicate with an OSS due to Timed out RDMA Created: 08/Apr/13 Updated: 02/Apr/14 Resolved: 02/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Mitsuhiro Nishizawa | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 7585 |
| Description |
|
One of clients could not communicate with one of OSSs due to Timed out RDMA error. kernel: LustreError: 9948:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 2 seconds kernel: LustreError: 9948:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.8.140@o2ib (32) kernel: LustreError: 9948:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8104e84f4000 On the server side, kernel: LustreError: 19792:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 3 seconds kernel: LustreError: 19792:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.10.84@o2ib (18) kernel: LustreError: 19789:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8101b15d4000 kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8103505f2b80 kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc ffff8103505f2b80 kernel: LustreError: 21321:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(16384) req@ffff8101419a3400 x1427389785285422/t0 o4->b fe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 448/416 e 0 to 0 dl 1364114285 ref 1 fl Interpret:/0/0 rc 0/0 kernel: Lustre: 21321:0:(ost_handler.c:1224:ost_brw_write()) share3-OST000e: ignoring bulk IO comm error with bfe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x 50000ac1a0a54_UUID id 12345-172.26.10.84@o2ib - client will retry kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000f: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 2 previous similar messages kernel: Lustre: 20014:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000c: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting kernel: Lustre: 19981:0:(ldlm_lib.c:874:target_handle_connect()) share3-OST000d: refuse reconnection from bfe63770-9dc7-fabc-fd87-625dec42ca0c@172.26.10. 84@o2ib to 0xffff81040d190c00; still busy with 1 active RPCs kernel: LustreError: 19981:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@ffff810212b87000 x1427389785285592/t0 o8->bfe63770 -9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 368/264 e 0 to 0 dl 1364114377 ref 1 fl Interpret:/0/0 rc -16/0 Ping (IBoIP) or ibping was okay, but "lctl ping" to this OSS was failing. This issue was finally resolved by rebooting the client. Would you please check if we can say this is a network issue or something wrong on the lustre side? Attached is messages and debug log collected from the client and OSS. Regards, |
| Comments |
| Comment by Mitsuhiro Nishizawa [ 08/Apr/13 ] |
|
core dump was captured when this client was rebooted. |
| Comment by Bruno Faccini (Inactive) [ 08/Apr/13 ] |
|
What is the network/IB topology you use at least to get this Clients and OSSs connected ? Also, did you run any pure IB troubleshooting/tool, like to see/extract if some of the error/stats counters increment on involved HCA/switches/boards in the Fabric ?? |
| Comment by Bruno Faccini (Inactive) [ 08/Apr/13 ] |
|
Attached crash-dump extracts show no hung/spinning thread which may indicate some LNET disfunction, and attached Client+OSS dmesg/Lustre-debuglog indicate flaky communications between OSS and Client causing multiple and recurring re-connections and recovery, since at least Sun Mar 24 03:39:09 PDT 2013, as far as I can find in the provided logs/infos. So definitely we need you to also investigate if no problem/errors are reported by your different IB-Fabric elements. BTW, OSS reports network errors with at least 2 Clients (172.26.10.84@o2ib, 172.26.12.181@o2ib) during the same period of time. |
| Comment by Bruno Faccini (Inactive) [ 29/Apr/13 ] |
|
Hello Mitsuhiro, |
| Comment by Mitsuhiro Nishizawa [ 30/Apr/13 ] |
|
Bruno, thanks for your comment. The same kind of problem occurred again and our customer has replaced cables. Currently, they are watching if this re-occur. Please proceed to close this ticket at this time. Thank you. |