Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17258

socklnd connection type not established upon connection race

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The following assertion was triggered on one of our clusters:

      socklnd_cb.c:1950:ksocknal_connect()) ASSERTION( (wanted & ((((1UL))) << (3))) != 0 ) failed:
      socklnd_cb.c:1950:ksocknal_connect()) LBUG

      From crash dumps, we can see that the conn_cb has been set with:

      struct ksock_conn_cb {
      ...
      ksnr_scheduled = 1,
      ksnr_connecting = 1,
      ksnr_connected = 10,
      ksnr_deleted = 0,
      ksnr_ctrl_conn_count = 1,
      ksnr_blki_conn_count = 1,
      ksnr_blko_conn_count = 0,
      ksnr_conn_count = 2,
      ksnr_max_conns = 8,
      ksnr_busy_retry_count = 3
      }

      The debug log shows that a connection race between the two peers is being hit three times, which causes the ksnr_busy_retry_count = 3 in the conn_cb.

      hornc has suggested a fix for this, which we will be submitting in a bit.

      Attachments

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              nangelinas Nikitas Angelinas
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: