[LU-8514] tgt_main.c:121:tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) failed Created: 18/Aug/16 Updated: 14/Mar/17 Resolved: 02/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Running Lustre version 2.8.0_0.0.llnlpreview.33 (see repo lustre-release-fe-llnl), we hit the following assertion on one of our MDS nodes (out of 16) in our 2.8 DNE testbed: 2016-08-18 05:45:32 [47956.538091] LustreError: 17552:0:(llog_cat.c:385:llog_cat_current_log()) lquake-MDT0004-osd: next log does not exist! 2016-08-18 05:45:32 [47956.550273] LustreError: 17552:0:(update_trans.c:1006:top_trans_stop()) lquake-MDT0004-osd: write updates failed: rc = -14 2016-08-18 05:45:32 [47956.592024] LustreError: 15805:0:(tgt_main.c:121:tgt_cancel_slc_locks()) ASSERTION( lock->l_client_cookie != 0 ) failed: 2016-08-18 05:45:32 [47956.604585] LustreError: 15805:0:(tgt_main.c:121:tgt_cancel_slc_locks()) LBUG 2016-08-18 05:45:32 [47956.612773] Pid: 15805, comm: tx_commit_cb Note that the "next log does not exist!" message was introduced by the patch from |
| Comments |
| Comment by Peter Jones [ 19/Aug/16 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Di Wang [ 19/Aug/16 ] |
|
Lai, It looks like we should not save slc locks if transaction fails? static void mdt_save_remote_lock(struct mdt_thread_info *info,
struct lustre_handle *h, enum ldlm_mode mode,
int decref)
{
ENTRY;
if (lustre_handle_is_used(h)) {
if (decref || !info->mti_has_trans ||
!(mode & (LCK_PW | LCK_EX))) { ----> check failure transaction here.
ldlm_lock_decref_and_cancel(h, mode);
} else {
struct ldlm_lock *lock = ldlm_handle2lock(h);
struct ptlrpc_request *req = mdt_info_req(info);
LASSERT(req != NULL);
tgt_save_slc_lock(lock, req->rq_transno);
ldlm_lock_decref(h, mode);
}
h->cookie = 0ull;
}
EXIT;
}
|
| Comment by Lai Siyao [ 22/Aug/16 ] |
|
IMHO the cause if transaction failure is not checked in mdt handlers, do you know why it's always ignored? |
| Comment by Christopher Morrone [ 22/Aug/16 ] |
|
We hit this a couple of more times. Now it is hitting almost as soon as the MDS is brought back up. This is currently our worst blocker in testing on the DNE testbed. |
| Comment by Di Wang [ 22/Aug/16 ] |
IMHO the cause if transaction failure is not checked in mdt handlers, do you know why it's always ignored? Usually if the transaction fails, info->mti_has_trans should not be set, but unfortunately for this case, the local transaction succeeds (i.e. info->mti_has_trans is set), but the remote transaction fails, so I think you need add another flag or sth to check this failure. |
| Comment by Gerrit Updater [ 23/Aug/16 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/22071 |
| Comment by Gerrit Updater [ 02/Sep/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22071/ |
| Comment by Peter Jones [ 02/Sep/16 ] |
|
Landed for 2.9 |