[LU-7173] ldlm_lock_destroy_internal() LBUG encountered during 2.8 large scale testing Created: 16/Sep/15 Updated: 02/Dec/22 Resolved: 01/Feb/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
Clients running Lustre-2.7.57 plus patches using an lustre 2.5.4 server back end. |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
While running a simulated user work load our MDS crashed due to the following: <3>[ 5913.638287] LustreError: 15340:0:(ldlm_lock.c:371:ldlm_lock_destroy_internal()) ### lock still on resource ns: mdt-atlas1-MDT0000_UUID lock: ffff883fd37b3700/0xb9db2cef5d1e41 39 lrc: 3/0,0 mode: CR/CR res: [0x200252cc7:0x3:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x50000000000000 nid: 9310@gni100 remote: 0xd3f59509a44985ef expref: 60 pid: 15340 timeout: 0 lvb_type: 3 <0>[ 5913.674996] LustreError: 15340:0:(ldlm_lock.c:372:ldlm_lock_destroy_internal()) LBUG <4>[ 5913.683915] Pid: 15340, comm: mdt03_052 <4>[ 5913.688327] <4>[ 5913.688327] Call Trace: <4>[ 5913.692965] [<ffffffffa0430895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4>[ 5913.700886] [<ffffffffa0430e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4>[ 5913.707968] [<ffffffffa0707871>] ldlm_lock_destroy_internal+0x251/0x2c0 [ptlrpc] <4>[ 5913.716573] [<ffffffffa07092b5>] ldlm_lock_destroy+0x35/0x130 [ptlrpc] <4>[ 5913.724118] [<ffffffffa070a311>] ldlm_lock_enqueue+0x161/0x980 [ptlrpc] <4>[ 5913.731760] [<ffffffffa0733e9b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc] <4>[ 5913.739797] [<ffffffffa0d6f1a6>] mdt_enqueue+0x46/0xe0 [mdt] <4>[ 5913.746362] [<ffffffffa0d7401a>] mdt_handle_common+0x52a/0x1470 [mdt] <4>[ 5913.753804] [<ffffffffa0db0615>] mds_regular_handle+0x15/0x20 [mdt] <4>[ 5913.761064] [<ffffffffa0762f55>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] <4>[ 5913.769866] [<ffffffffa0442785>] ? lc_watchdog_touch+0x65/0x170 [libcfs] <4>[ 5913.777632] [<ffffffffa075b929>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] <4>[ 5913.785377] [<ffffffffa07656dd>] ptlrpc_main+0xaed/0x1930 [ptlrpc] <4>[ 5913.792557] [<ffffffffa0764bf0>] ? ptlrpc_main+0x0/0x1930 [ptlrpc] <4>[ 5913.799689] [<ffffffff8109e78e>] kthread+0x9e/0xc0 <4>[ 5913.805270] [<ffffffff8100c28a>] child_rip+0xa/0x20 <4>[ 5913.810950] [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 <4>[ 5913.816622] [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Peter Jones [ 16/Sep/15 ] |
|
Bobijam Could you please look into this issue? Thanks Peter |
| Comment by Andreas Dilger [ 16/Sep/15 ] |
|
Have you run this similar workload with 2.5.4 clients without problem? This is a server-side bug and it shouldn't be possible for the client to induce this kind of failure regardless of what it is doing. The only thing I can think of is that the 2.7.57 client is causing the MDS to follow some error handling path that isn't exercised by 2.5 clients. |
| Comment by James A Simmons [ 16/Sep/15 ] |
|
Never seen this problem with the cray 2.5 clients or the lustre 2.7 clients we run with. This only happened with the pre-2.8 clients. |
| Comment by James A Simmons [ 16/Sep/15 ] |
|
Ugh. Your ftp server died on me so I uploaded the vmcore files and dmesgs at http://www.infradead.org/~jsimmons/2015-09-15.tgz |
| Comment by Zhenyu Xu [ 18/Sep/15 ] |
|
git commit 5517eab06eb99e4ecb66be251a10e70c37547610 added layout lock handling, esp. diff --git a/lustre/ldlm/ldlm_inodebits.c b/lustre/ldlm/ldlm_inodebits.c index e10a654..b68dd59 100644 --- a/lustre/ldlm/ldlm_inodebits.c +++ b/lustre/ldlm/ldlm_inodebits.c @@ -190,8 +190,12 @@ int ldlm_process_inodebits_lock(struct ldlm_lock *lock, __u 64 *flags, LASSERT(cfs_list_empty(&res->lr_converting)); check_res_locked(res); - if (!first_enq) { - LASSERT(work_list != NULL); + /* (*flags & LDLM_FL_BLOCK_NOWAIT) is for layout lock right now. */ + if (!first_enq || (*flags & LDLM_FL_BLOCK_NOWAIT)) { + *err = ELDLM_LOCK_ABORTED; + if (*flags & LDLM_FL_BLOCK_NOWAIT) + *err = ELDLM_LOCK_WOULDBLOCK; + rc = ldlm_inodebits_compat_queue(&res->lr_granted, lock, NULL); if (!rc) RETURN(LDLM_ITER_STOP); @@ -201,6 +205,8 @@ int ldlm_process_inodebits_lock(struct ldlm_lock *lock, __u6 4 *flags, ldlm_resource_unlink_lock(lock); ldlm_grant_lock(lock, work_list); + + *err = ELDLM_OK; RETURN(LDLM_ITER_CONTINUE); } and in ldlm_lock_enqueue() /* policies are not executed on the client or during replay */ if ((*flags & (LDLM_FL_HAS_INTENT|LDLM_FL_REPLAY)) == LDLM_FL_HAS_INTENT && !local && ns->ns_policy) { rc = ns->ns_policy(ns, lockp, cookie, lock->l_req_mode, *flags, NULL); if (rc == ELDLM_LOCK_REPLACED) { .... } else if (rc != ELDLM_OK || (rc == ELDLM_OK && (*flags & LDLM_FL_INTENT_ONLY))) { ldlm_lock_destroy(lock); RETURN(rc); } } I suspect that some intent_only lock has been granted and ldlm_lock_destroy() takes non-granted lock as its precondition. |
| Comment by Zhenyu Xu [ 18/Sep/15 ] |
|
two of the vmcore-dmesg.txt shows that problematic locks are granted (mode: CR/CR) LAYOUT locks (lvb_type: 3 ->LVB_T_LAYOUT) <3>[ 5913.638287] LustreError: 15340:0:(ldlm_lock.c:371:ldlm_lock_destroy_intern al()) ### lock still on resource ns: mdt-atlas1-MDT0000_UUID lock: ffff883fd37b3 700/0xb9db2cef5d1e4139 lrc: 3/0,0 mode: CR/CR res: [0x200252cc7:0x3:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x50000000000000 nid: 9310@gni100 remote: 0xd3f59509 a44985ef expref: 60 pid: 15340 timeout: 0 lvb_type: 3 <3>[ 969.928837] LustreError: 16016:0:(ldlm_lock.c:371:ldlm_lock_destroy_intern al()) ### lock still on resource ns: mdt-atlas1-MDT0000_UUID lock: ffff881fbaae9 9c0/0xcc3ee5780a4e2eea lrc: 3/0,0 mode: CR/CR res: [0x20024d1ef:0x2e:0x0].0 bits 0x8 rrc: 1 type: IBT flags: 0x50000000000000 nid: 91@gni100 remote: 0xb81bd76a7 d6191aa expref: 57 pid: 16016 timeout: 0 lvb_type: 3 |
| Comment by Zhenyu Xu [ 18/Sep/15 ] |
|
Hi James, Is it easy to reproduce? |
| Comment by James A Simmons [ 18/Sep/15 ] |
|
Not so easy to reproduce. We saw it only during our test shot on Titan running the 2.8 clients with a 2.5.4 server back end. It only showed up when we ran our test harness. After the first couple of runs the problem stopped showing up |
| Comment by Gerrit Updater [ 21/Sep/15 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/16497 |
| Comment by James A Simmons [ 21/Sep/15 ] |
|
The patch looks good but it will not be until Oct 13th that we get it test it out. |
| Comment by James A Simmons [ 07/Oct/15 ] |
|
We have been a bunch of testing of this patch at various scales before our Titan test shot and we are seeing very large meta data performance improvement with this patch. Results with patch Directory creation : 37714.322 11084.963 26771.564 9055.694 Results before patch: Directory creation: 5657.386 4416.307 5086.521 376.908 |
| Comment by Andreas Dilger [ 08/Oct/15 ] |
|
Well, there is an improvement for directory creation and removal performance with the patch, but the directory stat and file creation and removal performance is down and the performance results are much more variable between runs. |
| Comment by Vitaly Fertman [ 13/Oct/15 ] |
|
who said it is about intent_only? is there a recovery? is it DNE? |
| Comment by James A Simmons [ 13/Oct/15 ] |
|
I had a discussion about this patch with Oleg. The patch only is suppose to touch the recovery path but we saw this LBUG during an application run. We will see if this patch resolves our problem on our Oct 27 test shot. |
| Comment by Oleg Drokin [ 15/Oct/15 ] |
|
I had a lengthly discussion with Vitaly about what appears to be the same bug they are hitting (much closer to master, though). Sicne there's no recovery at play here, we believe the intent_only flag might be a red herring and instead the problem is the other one: ldlm_lock_enqueue: rc = ns->ns_policy(ns, lockp, cookie, lock->l_req_mode, *flags,
NULL);
if (rc == ELDLM_LOCK_REPLACED) {
...
} else if (rc != ELDLM_OK ||
(rc == ELDLM_OK && (*flags & LDLM_FL_INTENT_ONLY))) {
ldlm_lock_destroy(lock);
RETURN(rc);
}
This is correctly identified by Bobijam, but we believe the case here is that of rc != ELDLM_OK (which is supported by crashdump data from this other affected site). (for the record, rc is -2, or -ENOENT). So imagine we have this situation - an intent request comes in (getattr in this case), it is handled, and then a reply is sent (along with a lock), but the reply is then lost. The mystery was - what might cause a successful request to fail on retry, esp. with ENOENT with a lock in place, no less. So it appears the sequence of events is roughly: I would like to verify this theory on the ORNL crashdump, but unfortunately it does not contain the kernel vmlinux and lustre modules with symbols, so I cannot. |
| Comment by Oleg Drokin [ 15/Oct/15 ] |
|
The idea for a fix is to additionally check if the failed request is a resend (with a granted lock) and in that case we need to "soft cancel" it - i.e. to send a blocking ast. This is needed because it's possible the other reply made it to the client while we are working here, so it knows about the lock and if we cancel it outright - there's a bit of state inconsistency. |
| Comment by Oleg Drokin [ 15/Oct/15 ] |
|
So, I got the symbos information from James and the picture got a bit clouded. What I see similar: it's a resend, there is rc -2 returned from policy function, there's no repbuffers created hinting at a very early on failure. Now the really strange part: crash> p *(( struct mdt_thread_info *)0xffff8840296cb000)->mti_pill
$9 = {
rc_req = 0xffff884045bd3000,
rc_fmt = 0xffffffffa080c9a0 <RQF_MDS_REINT_SETXATTR>,
rc_loc = RCL_SERVER,
rc_area = {{4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}, {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}
}
So somehow this request got wrong request format assigned, which is really weird. |
| Comment by Oleg Drokin [ 16/Oct/15 ] |
|
So I think it would be an interesting experiment to perform a forced resend for an IT_LAYOUT intent and see if it happens to fail every time. First with 2.5.4 on both ends (easier to organize) and then 2.5.4 on server and master on client. I envision a manual test to be something like: Then set OBD_FAIL_LDLM_REPLY (0x30c, or really 0x8000030c to make it one time) as fail_loc on MDS. If it crahes - then we know it's every time such a resend happens and it should be easy to figure it out afterwards, if not we'll need some more investigations. In any case I do not have time for this right now as I am departing for a conference tomorrow. James, if you have time to perform an experiment like this, that would be great as a first step, as I think Bobijam is busy with another conference in the next few days. |
| Comment by James A Simmons [ 16/Oct/15 ] |
|
I just tried your test using a 2.8 client <-> 2.8 server. No crash happened. I will work the admin to setup a 2.8 client to test against our 2.5 server setup. I did see time outs on the clients and MDS side. |
| Comment by Vitaly Fertman [ 16/Oct/15 ] |
|
due to |
| Comment by Gerrit Updater [ 07/Dec/15 ] |
|
Vitaly Fertman (vitaly.fertman@seagate.com) uploaded a new patch: http://review.whamcloud.com/17501 |
| Comment by Gerrit Updater [ 21/Dec/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17501/ |
| Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ] |
|
Another instance found for interop : 2.7.1 Server/EL6.7 Client |
| Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ] |
|
Another instance found for interop : 2.5.5 Server/EL6.7 Client |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for interop : 2.5.5 Server/EL7 Client |
| Comment by Peter Jones [ 01/Feb/16 ] |
|
It looks like this fix already landed to master for 2.8 |