Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16124

Wrong condition in patch "LU-13782 lnet: Have LNet routers monitor the ni_fatal flag" ?

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • None
    • None
    • lustre 2.12.8_ddn14
    • 3
    • 9223372036854775807

    Description

      While browsing the LNet code, I stumbled on the function lnet_update_ni_status_locked() showing the following:

                      if (now < net->net_last_alive + timeout)
                              goto check_ni_fatal;               
      
                      spin_lock(&net->net_lock);
                      /* re-check with lock */
                      if (now < net->net_last_alive + timeout) {
                              spin_unlock(&net->net_lock);
                              goto check_ni_fatal;
                      }
                      spin_unlock(&net->net_lock);
      [...]
      check_ni_fatal:
                      list_for_each_entry(ni, &net->net_ni_list, ni_netlist) {
                              /* lnet_ni_set_status() will perform the same check of
                               * ni_status while holding the ni lock. We can safely
                               * check ni_status without that lock because it is only
                               * written to under net_lock/EX and our caller is
                               * holding a net lock.
                               */
                              if (atomic_read(&ni->ni_fatal_error_on) &&
                                  ni->ni_status &&
                                  ni->ni_status->ns_status != LNET_NI_STATUS_DOWN &&
                                  lnet_ni_set_status(ni, LNET_NI_STATUS_DOWN))
                                      push = true;
                      } 

      It seems to me that the condition (now < ni->ni_last_alive + timeout) is inverted, as it will result in setting the interface status to down if the timeout was not yet reached.

      This was introduced by the patch https://review.whamcloud.com/#/c/39353/ which changed the behavior without changing the condition.

      The code was previously this:

                      if (now < ni->ni_last_alive + timeout)
                              continue;
                      lnet_ni_lock(ni);
                      /* re-check with lock */
                      if (now < ni->ni_last_alive + timeout) {
                              lnet_ni_unlock(ni);
                              continue;
                      }
       

      Do I get this wrong ?

      Attachments

        Activity

          People

            hornc Chris Horn
            spiechurski Sebastien Piechurski
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: