[LU-16124] Wrong condition in patch "LU-13782 lnet: Have LNet routers monitor the ni_fatal flag" ? Created: 29/Aug/22 Updated: 30/Aug/22 Resolved: 30/Aug/22 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sebastien Piechurski | Assignee: | Chris Horn |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre 2.12.8_ddn14 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
While browsing the LNet code, I stumbled on the function lnet_update_ni_status_locked() showing the following: if (now < net->net_last_alive + timeout) goto check_ni_fatal; spin_lock(&net->net_lock); /* re-check with lock */ if (now < net->net_last_alive + timeout) { spin_unlock(&net->net_lock); goto check_ni_fatal; } spin_unlock(&net->net_lock); [...] check_ni_fatal: list_for_each_entry(ni, &net->net_ni_list, ni_netlist) { /* lnet_ni_set_status() will perform the same check of * ni_status while holding the ni lock. We can safely * check ni_status without that lock because it is only * written to under net_lock/EX and our caller is * holding a net lock. */ if (atomic_read(&ni->ni_fatal_error_on) && ni->ni_status && ni->ni_status->ns_status != LNET_NI_STATUS_DOWN && lnet_ni_set_status(ni, LNET_NI_STATUS_DOWN)) push = true; } It seems to me that the condition (now < ni->ni_last_alive + timeout) is inverted, as it will result in setting the interface status to down if the timeout was not yet reached. This was introduced by the patch https://review.whamcloud.com/#/c/39353/ which changed the behavior without changing the condition. The code was previously this: if (now < ni->ni_last_alive + timeout) continue; lnet_ni_lock(ni); /* re-check with lock */ if (now < ni->ni_last_alive + timeout) { lnet_ni_unlock(ni); continue; } Do I get this wrong ? |
| Comments |
| Comment by Peter Jones [ 29/Aug/22 ] |
|
Chris Do you agree with Sebastien's analysis of your patch? Peter |
| Comment by Chris Horn [ 30/Aug/22 ] |
|
Thanks for the report Sebastien.
This code does not appear in the lnet_update_ni_status_locked() function. I assume you meant (now < net->net_last_alive + timeout) ? If so, your analysis would be correct if the ni_status was unconditionally set to down by the code at the check_ni_fatal label. However, we can see that ni status is only set to down by this code if atomic_read(&ni->ni_fatal_error_on) returns non-zero. Thus, the code in lnet_update_ni_status_locked() should be correct. |
| Comment by Sebastien Piechurski [ 30/Aug/22 ] |
|
Ok, thanks Chris for the answer. I guess I was confused by the use of the goto, usually used for error handling. Even though the set status to down was conditional, I was still thinking it was something meant to be executed in case the timeout was reached. Glad I was wrong then !
|