[LU-16124] Wrong condition in patch "LU-13782 lnet: Have LNet routers monitor the ni_fatal flag" ? Created: 29/Aug/22  Updated: 30/Aug/22  Resolved: 30/Aug/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sebastien Piechurski Assignee: Chris Horn
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

lustre 2.12.8_ddn14


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While browsing the LNet code, I stumbled on the function lnet_update_ni_status_locked() showing the following:

                if (now < net->net_last_alive + timeout)
                        goto check_ni_fatal;               

                spin_lock(&net->net_lock);
                /* re-check with lock */
                if (now < net->net_last_alive + timeout) {
                        spin_unlock(&net->net_lock);
                        goto check_ni_fatal;
                }
                spin_unlock(&net->net_lock);
[...]
check_ni_fatal:
                list_for_each_entry(ni, &net->net_ni_list, ni_netlist) {
                        /* lnet_ni_set_status() will perform the same check of
                         * ni_status while holding the ni lock. We can safely
                         * check ni_status without that lock because it is only
                         * written to under net_lock/EX and our caller is
                         * holding a net lock.
                         */
                        if (atomic_read(&ni->ni_fatal_error_on) &&
                            ni->ni_status &&
                            ni->ni_status->ns_status != LNET_NI_STATUS_DOWN &&
                            lnet_ni_set_status(ni, LNET_NI_STATUS_DOWN))
                                push = true;
                } 

It seems to me that the condition (now < ni->ni_last_alive + timeout) is inverted, as it will result in setting the interface status to down if the timeout was not yet reached.

This was introduced by the patch https://review.whamcloud.com/#/c/39353/ which changed the behavior without changing the condition.

The code was previously this:

                if (now < ni->ni_last_alive + timeout)
                        continue;
                lnet_ni_lock(ni);
                /* re-check with lock */
                if (now < ni->ni_last_alive + timeout) {
                        lnet_ni_unlock(ni);
                        continue;
                }
 

Do I get this wrong ?



 Comments   
Comment by Peter Jones [ 29/Aug/22 ]

Chris

Do you agree with Sebastien's analysis of your patch?

Peter

Comment by Chris Horn [ 30/Aug/22 ]

Thanks for the report Sebastien.

It seems to me that the condition (now < ni->ni_last_alive + timeout) is inverted

This code does not appear in the lnet_update_ni_status_locked() function. I assume you meant (now < net->net_last_alive + timeout) ?

If so, your analysis would be correct if the ni_status was unconditionally set to down by the code at the check_ni_fatal label. However, we can see that ni status is only set to down by this code if atomic_read(&ni->ni_fatal_error_on) returns non-zero. Thus, the code in lnet_update_ni_status_locked() should be correct.

Comment by Sebastien Piechurski [ 30/Aug/22 ]

Ok, thanks Chris for the answer.

I guess I was confused by the use of the goto, usually used for error handling. Even though the set status to down was conditional, I was still thinking it was something meant to be executed in case the timeout was reached.

Glad I was wrong then !

 

Generated at Sat Feb 10 03:24:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.