Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
None
-
None
-
lustre 2.12.8_ddn14
-
3
-
9223372036854775807
Description
While browsing the LNet code, I stumbled on the function lnet_update_ni_status_locked() showing the following:
if (now < net->net_last_alive + timeout) goto check_ni_fatal; spin_lock(&net->net_lock); /* re-check with lock */ if (now < net->net_last_alive + timeout) { spin_unlock(&net->net_lock); goto check_ni_fatal; } spin_unlock(&net->net_lock); [...] check_ni_fatal: list_for_each_entry(ni, &net->net_ni_list, ni_netlist) { /* lnet_ni_set_status() will perform the same check of * ni_status while holding the ni lock. We can safely * check ni_status without that lock because it is only * written to under net_lock/EX and our caller is * holding a net lock. */ if (atomic_read(&ni->ni_fatal_error_on) && ni->ni_status && ni->ni_status->ns_status != LNET_NI_STATUS_DOWN && lnet_ni_set_status(ni, LNET_NI_STATUS_DOWN)) push = true; }
It seems to me that the condition (now < ni->ni_last_alive + timeout) is inverted, as it will result in setting the interface status to down if the timeout was not yet reached.
This was introduced by the patch https://review.whamcloud.com/#/c/39353/ which changed the behavior without changing the condition.
The code was previously this:
if (now < ni->ni_last_alive + timeout) continue; lnet_ni_lock(ni); /* re-check with lock */ if (now < ni->ni_last_alive + timeout) { lnet_ni_unlock(ni); continue; }
Do I get this wrong ?