Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0, Lustre 2.5.4
Affects Version/s: Lustre 2.4.0
Labels:
- LB
- llnl

Severity:
3
Rank (Obsolete):
6847

Description

If a successful reply is lost for an intent lock request, MDS will not correctly recover from this situation on resend.

The cause of which seems to be the code of ldlm_handle_enqueue0() and mdt_intent_fixup_resent()

 int ldlm_handle_enqueue0(struct ldlm_namespace *ns,
                         struct ptlrpc_request *req,
                         const struct ldlm_request *dlm_req,
                         const struct ldlm_callback_suite *cbs)
{
...
        /* The lock's callback data might be set in the policy function */
        lock = ldlm_lock_create(ns, &dlm_req->lock_desc.l_resource.lr_name,
                                dlm_req->lock_desc.l_resource.lr_type,
                                dlm_req->lock_desc.l_req_mode,
                                cbs, NULL, 0);
...
        lock->l_export = class_export_lock_get(req->rq_export, lock);
        if (lock->l_export->exp_lock_hash) {
                cfs_hash_add(lock->l_export->exp_lock_hash,
                             &lock->l_remote_handle,
                             &lock->l_exp_hash);
        }
...
        err = ldlm_lock_enqueue(ns, &lock, cookie, &flags);
...
}

static void mdt_intent_fixup_resent(struct mdt_thread_info *info,
                                    struct ldlm_lock *new_lock,
                                    struct ldlm_lock **old_lock,
                                    struct mdt_lock_handle *lh)
{
        struct ptlrpc_request  *req = mdt_info_req(info);
        struct obd_export      *exp = req->rq_export;
        struct lustre_handle    remote_hdl;
        struct ldlm_request    *dlmreq;
        struct ldlm_lock       *lock;

        if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT))
                return;

        dlmreq = req_capsule_client_get(info->mti_pill, &RMF_DLM_REQ);
        remote_hdl = dlmreq->lock_handle[0];

        lock = cfs_hash_lookup(exp->exp_lock_hash, &remote_hdl);
        if (lock) {
                if (lock != new_lock) {
...
}

On resend, ldlm_handle_enqueue0() add the new lock into hash even though there's already a granted lock with the same remote handle. mdt_intent_fixup_resent() will find the newly added lock in hash and ignore it. This will cause to an enqueue request on the newly created lock, deadlock and client eviction.

Alexey thinks that the problem has existed since we moved from the correct code:

static void fixup_handle_for_resent_req(struct ptlrpc_request *req, int offset,
                                        struct ldlm_lock *new_lock,
                                        struct ldlm_lock **old_lock,
                                        struct lustre_handle *lockh)
{
        struct obd_export *exp = req->rq_export;
        struct ldlm_request *dlmreq =
                lustre_msg_buf(req->rq_reqmsg, offset, sizeof(*dlmreq));
        struct lustre_handle remote_hdl = dlmreq->lock_handle[0];
        struct list_head *iter;

        if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT))
                return;

        spin_lock(&exp->exp_ldlm_data.led_lock);
        list_for_each(iter, &exp->exp_ldlm_data.led_held_locks) {
                struct ldlm_lock *lock;
                lock = list_entry(iter, struct ldlm_lock, l_export_chain);
                if (lock == new_lock)
                        continue; <==================== N.B.
                if (lock->l_remote_handle.cookie == remote_hdl.cookie) {
                        lockh->cookie = lock->l_handle.h_cookie;
                        LDLM_DEBUG(lock, "restoring lock cookie");
                        DEBUG_REQ(D_DLMTRACE, req,"restoring lock cookie "LPX64,
                                  lockh->cookie);
                        if (old_lock)
                                *old_lock = LDLM_LOCK_GET(lock);
                        spin_unlock(&exp->exp_ldlm_data.led_lock);
                        return;
                }
        }
...
}

Logs for this issue will follow.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

hash-bug.log
1.98 MB
18/Feb/13 4:08 PM

Issue Links

is related to

LU-5266 LBUG on Failover -ldlm_process_extent_lock()) ASSERTION( lock->l_granted_mode != lock->l_req_mode )

Resolved

LU-4584 Lock revocation process fails consistently

Resolved

LU-5314 Lustre 2.4.2 MDS hit LBUG and crash

Resolved

is related to

LU-5530 MDS thread lockup witrh patched 2.5 server

Resolved

mdt_intent_fixup_resent() cannot find the proper lock in hash

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates