[LU-8553] incorrect fix in LU-7558 Created: 26/Aug/16  Updated: 22/Jul/18  Resolved: 22/Jul/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Mikhail Pershin
Resolution: Not a Bug Votes: 0
Labels: ptlrpc

Severity: 3
Rank (Obsolete): 9223372036854775807

 Comments   
Comment by Alexey Lyashkov [ 26/Aug/16 ]

grr.. LU-7558 tryes to fix two connection interpret functions race. It assume a new connection attempt is called while first connection interpret executed. But solving that race don't need an introduce a new flag, but just refine a conditional in ptlrpc_connect_import function. Currently it block a connect attempt in several conditions

        if (imp->imp_state == LUSTRE_IMP_CLOSED) {
                spin_unlock(&imp->imp_lock);
                CERROR("can't connect to a closed import\n");
                RETURN(-EINVAL);
        } else if (imp->imp_state == LUSTRE_IMP_FULL) {
                spin_unlock(&imp->imp_lock);
                CERROR("already connected\n");
                RETURN(0);
        } else if (imp->imp_state == LUSTRE_IMP_CONNECTING) {
                spin_unlock(&imp->imp_lock);
                CERROR("already connecting\n");
                RETURN(-EALREADY);
        }

so open a race to send new connect while import switched to the recovery. but changing a last conditional to the something like imp_state != LUSTRE_IMP_DISCON => return -EALREADY. solve same bug without new flag introduce.

But that patch isn't solve second problem anyway. Second problem is introduced with many ptlrpcd threads added as part of SMP improvement work. first connect interpret scheduled to ptlrpcd thread X, but it thread may blocked for some time while processed other interprets like IO processing, so we may enter to situation when client send send second connect while first in flight. It should be easy reconnect with lctl tool which send a parallel connect request, or recovery vs pinger race. First connect send while replay request failed, second request send from pinger. First thread may blocked on cancel unused locks in that case.
I think checking a imp_conn_cnt on connect interpret (in same way as server side does) will solve it problem as don't allow to execute a stale connect interpret.

Comment by Mikhail Pershin [ 22/Jul/18 ]

Closing issue as an outdated one. Alexey, feel free to reopen if you think it still exists

Generated at Sat Feb 10 02:18:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.