[LU-8514] tgt_main.c:121:tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) failed Created: 18/Aug/16  Updated: 14/Mar/17  Resolved: 02/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnl

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running Lustre version 2.8.0_0.0.llnlpreview.33 (see repo lustre-release-fe-llnl), we hit the following assertion on one of our MDS nodes (out of 16) in our 2.8 DNE testbed:

2016-08-18 05:45:32 [47956.538091] LustreError: 17552:0:(llog_cat.c:385:llog_cat_current_log()) lquake-MDT0004-osd: next log does not exist!
2016-08-18 05:45:32 [47956.550273] LustreError: 17552:0:(update_trans.c:1006:top_trans_stop()) lquake-MDT0004-osd: write updates failed: rc = -14
2016-08-18 05:45:32 [47956.592024] LustreError: 15805:0:(tgt_main.c:121:tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) failed:
2016-08-18 05:45:32 [47956.604585] LustreError: 15805:0:(tgt_main.c:121:tgt_cancel_slc_locks()) LBUG
2016-08-18 05:45:32 [47956.612773] Pid: 15805, comm: tx_commit_cb

Note that the "next log does not exist!" message was introduced by the patch from LU-7800, Change-Id: I2343023c1f3109c077c98d78d3669377d95ed42f, Patch-Set: 6.



 Comments   
Comment by Peter Jones [ 19/Aug/16 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Di Wang [ 19/Aug/16 ]

Lai, It looks like we should not save slc locks if transaction fails?

static void mdt_save_remote_lock(struct mdt_thread_info *info,
                                 struct lustre_handle *h, enum ldlm_mode mode,
                                 int decref)
{
        ENTRY;

        if (lustre_handle_is_used(h)) {
                if (decref || !info->mti_has_trans ||
                    !(mode & (LCK_PW | LCK_EX))) {    ----> check failure transaction here.
                        ldlm_lock_decref_and_cancel(h, mode);
                } else {
                        struct ldlm_lock *lock = ldlm_handle2lock(h);
                        struct ptlrpc_request *req = mdt_info_req(info);

                        LASSERT(req != NULL);
                        tgt_save_slc_lock(lock, req->rq_transno);
                        ldlm_lock_decref(h, mode);
                }
                h->cookie = 0ull;
        }

        EXIT;
}
Comment by Lai Siyao [ 22/Aug/16 ]

IMHO the cause if transaction failure is not checked in mdt handlers, do you know why it's always ignored?

Comment by Christopher Morrone [ 22/Aug/16 ]

We hit this a couple of more times. Now it is hitting almost as soon as the MDS is brought back up. This is currently our worst blocker in testing on the DNE testbed.

Comment by Di Wang [ 22/Aug/16 ]
IMHO the cause if transaction failure is not checked in mdt handlers, do you know why it's always ignored?

Usually if the transaction fails, info->mti_has_trans should not be set, but unfortunately for this case, the local transaction succeeds (i.e. info->mti_has_trans is set), but the remote transaction fails, so I think you need add another flag or sth to check this failure.

Comment by Gerrit Updater [ 23/Aug/16 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/22071
Subject: LU-8514 mdd: transaction failure should be checked
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b1d88d35ee5b156aed9a3c4ab61fedc8f1821257

Comment by Gerrit Updater [ 02/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22071/
Subject: LU-8514 mdd: transaction failure should be checked
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e1ace3751f9add26b3f01aad9c278b6bfca8f739

Comment by Peter Jones [ 02/Sep/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:18:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.