Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13972

kiblnd can continue attempting to reconnect indefinitely.

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      As shown in the log below

      00000800:00000100:6.0:1600464412.044215:0:40:0:(o2iblnd_cb.c:2906:kiblnd_rejected()) 172.16.0.44@o2ib rejected: no listener at 987
      00000800:00000100:6.0:1600464412.045753:0:40:0:(o2iblnd_cb.c:2880:kiblnd_check_reconnect()) 172.16.0.44@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
      00000800:00000100:6.0:1600464412.045755:0:40:0:(o2iblnd_cb.c:2906:kiblnd_rejected()) 172.16.0.44@o2ib rejected: no listener at 987
      00000800:00000100:6.0:1600464412.047336:0:40:0:(o2iblnd_cb.c:2880:kiblnd_check_reconnect()) 172.16.0.44@o2ib: reconnect (invalid service id), 12, 12, msg_size: 4096, queue_depth: 8/-1, max_frags: 256/-1
      

      The o2iblnd can get into a loop attemptint to reconnect to a node which is not up until the connection timeout kicks in.

      There are two potential solutions. Add a new module parameter to control the number of times to attempt a reconnect before we fail.

      Another option, which I prefer, is to use the existing retry_count o2iblnd module parameter to limit the number of connection retries.

      It's currently used for:

      retry_count
      The maximum number of times that a data transfer operation 
      should be retried on the connection when an error occurs. This setting 
      controls the number of
      times to retry send, RDMA, and atomic operations when timeouts occur. 
      Applies only to RDMA_PS_TCP. 

      I believe it's can be used for reconnection attempts performed by the iblnd as well.

      Attachments

        Activity

          People

            ashehata Amir Shehata (Inactive)
            ashehata Amir Shehata (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: