[LU-14981] ldlm_handle_cp_callback() RETURN() among GOTO(out, ...)s. Created: 03/Sep/21  Updated: 18/Oct/21  Resolved: 18/Oct/21

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

7fff052c930da4822c3b2a13d130da7473a20a58 (https://review.whamcloud.com/19898 LU-7791 ldlm: signal vs CP callback race) adds to ldlm_handle_cp_callback() a RETURN() among many GOTO(out, ...)s. As a result there is no wake_up(&lock->l_waitq).

        if (!ldlm_res_eq(&dlm_req->lock_desc.l_resource.lr_name,
                         &lock->l_resource->lr_name)) {
                ldlm_resource_unlink_lock(lock);
                unlock_res_and_lock(lock);
                rc = ldlm_lock_change_resource(ns, lock,
                                &dlm_req->lock_desc.l_resource.lr_name);
		if (rc < 0) {
                        LDLM_ERROR(lock, "Failed to allocate resource");
                        GOTO(out, rc);
                }
                LDLM_DEBUG(lock, "completion AST, new resource");
                lock_res_and_lock(lock);
        }

        if (ldlm_is_failed(lock)) {
		unlock_res_and_lock(lock);
                LDLM_LOCK_RELEASE(lock);
                RETURN(-EINVAL);
        }

        if (ldlm_is_destroyed(lock) ||
            ldlm_is_granted(lock)) {
                /* b=11300: the lock has already been granted */
                unlock_res_and_lock(lock);
                LDLM_DEBUG(lock, "Double grant race happened");
                GOTO(out, rc = 0);
        }

...
out:
        if (rc < 0) {
                lock_res_and_lock(lock);
		ldlm_set_failed(lock);
                unlock_res_and_lock(lock);
                wake_up(&lock->l_waitq);
        }
        LDLM_LOCK_RELEASE(lock);

        return 0;
}

Is this correct?



 Comments   
Comment by Oleg Drokin [ 03/Sep/21 ]

This appears to be correct, if you check the out path, we set the failed flag on the lock first and then wake up all others.

The RETURN call is from the failed check so it came from a racing thread that just set this failed flag and woke us up.

the only other place that sets the failed flag is in cleanup_resource() when we are cleaning up all resources anyway and it wakes the queue via the call to either completion callback or cancel.

Comment by Oleg Drokin [ 03/Sep/21 ]

Just to clarify, if there's a desire to insert a wakeup call in that exit path for whatever "just in case" or future-proofing reasons - I'd have no objections to this.

It's just after the code inspection I do not see a bug here.

Generated at Sat Feb 10 03:14:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.