Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
Lustre 1.8.8
-
None
-
3
-
7585
Description
One of clients could not communicate with one of OSSs due to Timed out RDMA error.
kernel: LustreError: 9948:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 2 seconds kernel: LustreError: 9948:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.8.140@o2ib (32) kernel: LustreError: 9948:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8104e84f4000
On the server side,
kernel: LustreError: 19792:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 3 seconds kernel: LustreError: 19792:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.10.84@o2ib (18) kernel: LustreError: 19789:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8101b15d4000 kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8103505f2b80 kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc ffff8103505f2b80 kernel: LustreError: 21321:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(16384) req@ffff8101419a3400 x1427389785285422/t0 o4->b fe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 448/416 e 0 to 0 dl 1364114285 ref 1 fl Interpret:/0/0 rc 0/0 kernel: Lustre: 21321:0:(ost_handler.c:1224:ost_brw_write()) share3-OST000e: ignoring bulk IO comm error with bfe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x 50000ac1a0a54_UUID id 12345-172.26.10.84@o2ib - client will retry kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000f: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 2 previous similar messages kernel: Lustre: 20014:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000c: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting kernel: Lustre: 19981:0:(ldlm_lib.c:874:target_handle_connect()) share3-OST000d: refuse reconnection from bfe63770-9dc7-fabc-fd87-625dec42ca0c@172.26.10. 84@o2ib to 0xffff81040d190c00; still busy with 1 active RPCs kernel: LustreError: 19981:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@ffff810212b87000 x1427389785285592/t0 o8->bfe63770 -9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 368/264 e 0 to 0 dl 1364114377 ref 1 fl Interpret:/0/0 rc -16/0
Ping (IBoIP) or ibping was okay, but "lctl ping" to this OSS was failing.
"lctl ping" to other OSS was okay.
This issue was finally resolved by rebooting the client.
Would you please check if we can say this is a network issue or something wrong on the lustre side?
Attached is messages and debug log collected from the client and OSS.
Regards,