Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3123

A client could not communicate with an OSS due to Timed out RDMA

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 1.8.8
    • None
    • 3
    • 7585

    Description

      One of clients could not communicate with one of OSSs due to Timed out RDMA error.

      kernel: LustreError: 9948:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 2 seconds
      kernel: LustreError: 9948:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.8.140@o2ib (32)
      kernel: LustreError: 9948:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff8104e84f4000
      

      On the server side,

      kernel: LustreError: 19792:0:(o2iblnd_cb.c:2914:kiblnd_check_txs()) Timed out tx: active_txs, 3 seconds
      kernel: LustreError: 19792:0:(o2iblnd_cb.c:2977:kiblnd_check_conns()) Timed out RDMA with 172.26.10.84@o2ib (18)
      kernel: LustreError: 19789:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8101b15d4000
      kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 4, status -5, desc ffff8103505f2b80
      kernel: LustreError: 19788:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc ffff8103505f2b80
      kernel: LustreError: 21321:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(16384)  req@ffff8101419a3400 x1427389785285422/t0 o4->b
      fe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 448/416 e 0 to 0 dl 1364114285 ref 1 fl Interpret:/0/0 rc 0/0
      kernel: Lustre: 21321:0:(ost_handler.c:1224:ost_brw_write()) share3-OST000e: ignoring bulk IO comm error with bfe63770-9dc7-fabc-fd87-625dec42ca0c@NET_0x
      50000ac1a0a54_UUID id 12345-172.26.10.84@o2ib - client will retry
      kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000f: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting
      kernel: Lustre: 19982:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 2 previous similar messages
      kernel: Lustre: 20014:0:(ldlm_lib.c:574:target_handle_reconnect()) share3-OST000c: bfe63770-9dc7-fabc-fd87-625dec42ca0c reconnecting
      kernel: Lustre: 19981:0:(ldlm_lib.c:874:target_handle_connect()) share3-OST000d: refuse reconnection from bfe63770-9dc7-fabc-fd87-625dec42ca0c@172.26.10.
      84@o2ib to 0xffff81040d190c00; still busy with 1 active RPCs
      kernel: LustreError: 19981:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16)  req@ffff810212b87000 x1427389785285592/t0 o8->bfe63770
      -9dc7-fabc-fd87-625dec42ca0c@NET_0x50000ac1a0a54_UUID:0/0 lens 368/264 e 0 to 0 dl 1364114377 ref 1 fl Interpret:/0/0 rc -16/0
      

      Ping (IBoIP) or ibping was okay, but "lctl ping" to this OSS was failing.
      "lctl ping" to other OSS was okay.

      This issue was finally resolved by rebooting the client.

      Would you please check if we can say this is a network issue or something wrong on the lustre side?

      Attached is messages and debug log collected from the client and OSS.

      Regards,

      Attachments

        Activity

          People

            bfaccini Bruno Faccini (Inactive)
            mnishizawa Mitsuhiro Nishizawa
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: