Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Not a Bug
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None
Environment:
lustre 2.12.8_ddn14

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While browsing the LNet code, I stumbled on the function lnet_update_ni_status_locked() showing the following:

                if (now < net->net_last_alive + timeout)
                        goto check_ni_fatal;               

                spin_lock(&net->net_lock);
                /* re-check with lock */
                if (now < net->net_last_alive + timeout) {
                        spin_unlock(&net->net_lock);
                        goto check_ni_fatal;
                }
                spin_unlock(&net->net_lock);
[...]
check_ni_fatal:
                list_for_each_entry(ni, &net->net_ni_list, ni_netlist) {
                        /* lnet_ni_set_status() will perform the same check of
                         * ni_status while holding the ni lock. We can safely
                         * check ni_status without that lock because it is only
                         * written to under net_lock/EX and our caller is
                         * holding a net lock.
                         */
                        if (atomic_read(&ni->ni_fatal_error_on) &&
                            ni->ni_status &&
                            ni->ni_status->ns_status != LNET_NI_STATUS_DOWN &&
                            lnet_ni_set_status(ni, LNET_NI_STATUS_DOWN))
                                push = true;
                }

It seems to me that the condition (now < ni->ni_last_alive + timeout) is inverted, as it will result in setting the interface status to down if the timeout was not yet reached.

This was introduced by the patch https://review.whamcloud.com/#/c/39353/ which changed the behavior without changing the condition.

The code was previously this:

                if (now < ni->ni_last_alive + timeout)
                        continue;
                lnet_ni_lock(ni);
                /* re-check with lock */
                if (now < ni->ni_last_alive + timeout) {
                        lnet_ni_unlock(ni);
                        continue;
                }

Do I get this wrong ?

Attachments

Activity

People

Assignee:: Chris Horn

Reporter:: Sebastien Piechurski

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Aug/22 4:14 PM

Updated:: 30/Aug/22 5:32 PM

Resolved:: 30/Aug/22 5:32 PM