[LU-7173] ldlm_lock_destroy_internal() LBUG encountered during 2.8 large scale testing Created: 16/Sep/15  Updated: 02/Dec/22  Resolved: 01/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: patch
Environment:

Clients running Lustre-2.7.57 plus patches using an lustre 2.5.4 server back end.


Issue Links:
Duplicate
Related
is related to LU-6334 racer test_1: ldlm_lock_destroy_inter... Resolved
is related to LU-7535 mdt_intent_layout does not care about... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While running a simulated user work load our MDS crashed due to the following:

<3>[ 5913.638287] LustreError: 15340:0:(ldlm_lock.c:371:ldlm_lock_destroy_internal()) ### lock still on resource ns: mdt-atlas1-MDT0000_UUID lock: ffff883fd37b3700/0xb9db2cef5d1e41
39 lrc: 3/0,0 mode: CR/CR res: [0x200252cc7:0x3:0x0].0 bits 0x8 rrc: 2 type: IBT flags: 0x50000000000000 nid: 9310@gni100 remote: 0xd3f59509a44985ef expref: 60 pid: 15340 timeout: 
0 lvb_type: 3
<0>[ 5913.674996] LustreError: 15340:0:(ldlm_lock.c:372:ldlm_lock_destroy_internal()) LBUG
<4>[ 5913.683915] Pid: 15340, comm: mdt03_052
<4>[ 5913.688327] 
<4>[ 5913.688327] Call Trace:
<4>[ 5913.692965]  [<ffffffffa0430895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4>[ 5913.700886]  [<ffffffffa0430e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4>[ 5913.707968]  [<ffffffffa0707871>] ldlm_lock_destroy_internal+0x251/0x2c0 [ptlrpc]
<4>[ 5913.716573]  [<ffffffffa07092b5>] ldlm_lock_destroy+0x35/0x130 [ptlrpc]
<4>[ 5913.724118]  [<ffffffffa070a311>] ldlm_lock_enqueue+0x161/0x980 [ptlrpc]
<4>[ 5913.731760]  [<ffffffffa0733e9b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
<4>[ 5913.739797]  [<ffffffffa0d6f1a6>] mdt_enqueue+0x46/0xe0 [mdt]
<4>[ 5913.746362]  [<ffffffffa0d7401a>] mdt_handle_common+0x52a/0x1470 [mdt]
<4>[ 5913.753804]  [<ffffffffa0db0615>] mds_regular_handle+0x15/0x20 [mdt]
<4>[ 5913.761064]  [<ffffffffa0762f55>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
<4>[ 5913.769866]  [<ffffffffa0442785>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
<4>[ 5913.777632]  [<ffffffffa075b929>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4>[ 5913.785377]  [<ffffffffa07656dd>] ptlrpc_main+0xaed/0x1930 [ptlrpc]
<4>[ 5913.792557]  [<ffffffffa0764bf0>] ? ptlrpc_main+0x0/0x1930 [ptlrpc]
<4>[ 5913.799689]  [<ffffffff8109e78e>] kthread+0x9e/0xc0
<4>[ 5913.805270]  [<ffffffff8100c28a>] child_rip+0xa/0x20
<4>[ 5913.810950]  [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
<4>[ 5913.816622]  [<ffffffff8100c280>] ? child_rip+0x0/0x20


 Comments   
Comment by Peter Jones [ 16/Sep/15 ]

Bobijam

Could you please look into this issue?

Thanks

Peter

Comment by Andreas Dilger [ 16/Sep/15 ]

Have you run this similar workload with 2.5.4 clients without problem? This is a server-side bug and it shouldn't be possible for the client to induce this kind of failure regardless of what it is doing. The only thing I can think of is that the 2.7.57 client is causing the MDS to follow some error handling path that isn't exercised by 2.5 clients.

Comment by James A Simmons [ 16/Sep/15 ]

Never seen this problem with the cray 2.5 clients or the lustre 2.7 clients we run with. This only happened with the pre-2.8 clients.

Comment by James A Simmons [ 16/Sep/15 ]

Ugh. Your ftp server died on me so I uploaded the vmcore files and dmesgs at http://www.infradead.org/~jsimmons/2015-09-15.tgz

Comment by Zhenyu Xu [ 18/Sep/15 ]

git commit 5517eab06eb99e4ecb66be251a10e70c37547610 added layout lock handling, esp.

diff --git a/lustre/ldlm/ldlm_inodebits.c b/lustre/ldlm/ldlm_inodebits.c
index e10a654..b68dd59 100644
--- a/lustre/ldlm/ldlm_inodebits.c
+++ b/lustre/ldlm/ldlm_inodebits.c
@@ -190,8 +190,12 @@ int ldlm_process_inodebits_lock(struct ldlm_lock *lock, __u
64 *flags,
         LASSERT(cfs_list_empty(&res->lr_converting));
         check_res_locked(res);
 
-        if (!first_enq) {
-                LASSERT(work_list != NULL);
+       /* (*flags & LDLM_FL_BLOCK_NOWAIT) is for layout lock right now. */
+        if (!first_enq || (*flags & LDLM_FL_BLOCK_NOWAIT)) {
+               *err = ELDLM_LOCK_ABORTED;
+               if (*flags & LDLM_FL_BLOCK_NOWAIT)
+                       *err = ELDLM_LOCK_WOULDBLOCK;
+
                 rc = ldlm_inodebits_compat_queue(&res->lr_granted, lock, NULL);
                 if (!rc)
                         RETURN(LDLM_ITER_STOP);
@@ -201,6 +205,8 @@ int ldlm_process_inodebits_lock(struct ldlm_lock *lock, __u6
4 *flags,
 
                 ldlm_resource_unlink_lock(lock);
                 ldlm_grant_lock(lock, work_list);
+
+               *err = ELDLM_OK;
                 RETURN(LDLM_ITER_CONTINUE);
         }

and in ldlm_lock_enqueue()

        /* policies are not executed on the client or during replay */
        if ((*flags & (LDLM_FL_HAS_INTENT|LDLM_FL_REPLAY)) == LDLM_FL_HAS_INTENT
            && !local && ns->ns_policy) {
                rc = ns->ns_policy(ns, lockp, cookie, lock->l_req_mode, *flags,
                                   NULL);
                if (rc == ELDLM_LOCK_REPLACED) {
                     ....
                } else if (rc != ELDLM_OK ||
                           (rc == ELDLM_OK && (*flags & LDLM_FL_INTENT_ONLY))) {
                        ldlm_lock_destroy(lock);
                        RETURN(rc);
                }
        }

I suspect that some intent_only lock has been granted and ldlm_lock_destroy() takes non-granted lock as its precondition.

Comment by Zhenyu Xu [ 18/Sep/15 ]

two of the vmcore-dmesg.txt shows that problematic locks are granted (mode: CR/CR) LAYOUT locks (lvb_type: 3 ->LVB_T_LAYOUT)

<3>[ 5913.638287] LustreError: 15340:0:(ldlm_lock.c:371:ldlm_lock_destroy_intern
al()) ### lock still on resource ns: mdt-atlas1-MDT0000_UUID lock: ffff883fd37b3
700/0xb9db2cef5d1e4139 lrc: 3/0,0 mode: CR/CR res: [0x200252cc7:0x3:0x0].0 bits  
0x8 rrc: 2 type: IBT flags: 0x50000000000000 nid: 9310@gni100 remote: 0xd3f59509
a44985ef expref: 60 pid: 15340 timeout: 0 lvb_type: 3
<3>[  969.928837] LustreError: 16016:0:(ldlm_lock.c:371:ldlm_lock_destroy_intern
al()) ### lock still on resource ns: mdt-atlas1-MDT0000_UUID lock: ffff881fbaae9
9c0/0xcc3ee5780a4e2eea lrc: 3/0,0 mode: CR/CR res: [0x20024d1ef:0x2e:0x0].0 bits
 0x8 rrc: 1 type: IBT flags: 0x50000000000000 nid: 91@gni100 remote: 0xb81bd76a7
d6191aa expref: 57 pid: 16016 timeout: 0 lvb_type: 3
Comment by Zhenyu Xu [ 18/Sep/15 ]

Hi James,

Is it easy to reproduce?

Comment by James A Simmons [ 18/Sep/15 ]

Not so easy to reproduce. We saw it only during our test shot on Titan running the 2.8 clients with a 2.5.4 server back end. It only showed up when we ran our test harness. After the first couple of runs the problem stopped showing up I have never seen it at smaller scales.

Comment by Gerrit Updater [ 21/Sep/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/16497
Subject: LU-7173 ldlm: do not grant intent-only lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9702522581eaf7ad309e8c91931d571b9cad0b7b

Comment by James A Simmons [ 21/Sep/15 ]

The patch looks good but it will not be until Oct 13th that we get it test it out.

Comment by James A Simmons [ 07/Oct/15 ]

We have been a bunch of testing of this patch at various scales before our Titan test shot and we are seeing very large meta data performance improvement with this patch.

Results with patch

Directory creation : 37714.322 11084.963 26771.564 9055.694
Directory stat : 58904.478 26689.126 48141.812 9343.208
Directory removal : 34290.774 12088.699 22193.915 6303.121
File creation : 4937.247 43.594 4020.814 1064.732
File stat : 796.591 124.390 476.913 149.635
File read : 30091.505 19214.287 26420.152 3238.929
File removal : 4577.256 1845.895 4072.557 591.356

Results before patch:

Directory creation: 5657.386 4416.307 5086.521 376.908
Directory stat : 74770.631 72558.713 73347.269 544.856
Directory removal : 7990.972 6751.102 7465.554 382.254
File creation : 6867.935 6227.613 6558.801 201.736
File read : 28801.818 27562.517 28219.511 412.373
File removal : 5889.433 4705.258 5566.932 314.018

Comment by Andreas Dilger [ 08/Oct/15 ]

Well, there is an improvement for directory creation and removal performance with the patch, but the directory stat and file creation and removal performance is down and the performance results are much more variable between runs.

Comment by Vitaly Fertman [ 13/Oct/15 ]

who said it is about intent_only? is there a recovery? is it DNE?
I have a similar dump and this is intent_getattr resend, due to some reason getattr created a new lock and the previous update lock is to be destroyed due to LOCK_REPLACED.

Comment by James A Simmons [ 13/Oct/15 ]

I had a discussion about this patch with Oleg. The patch only is suppose to touch the recovery path but we saw this LBUG during an application run. We will see if this patch resolves our problem on our Oct 27 test shot.

Comment by Oleg Drokin [ 15/Oct/15 ]

I had a lengthly discussion with Vitaly about what appears to be the same bug they are hitting (much closer to master, though).

Sicne there's no recovery at play here, we believe the intent_only flag might be a red herring and instead the problem is the other one:

ldlm_lock_enqueue:

                rc = ns->ns_policy(ns, lockp, cookie, lock->l_req_mode, *flags,
                                   NULL);
                if (rc == ELDLM_LOCK_REPLACED) {
...
                } else if (rc != ELDLM_OK ||
                           (rc == ELDLM_OK && (*flags & LDLM_FL_INTENT_ONLY))) {
                        ldlm_lock_destroy(lock);
                        RETURN(rc);
                }

This is correctly identified by Bobijam, but we believe the case here is that of rc != ELDLM_OK (which is supported by crashdump data from this other affected site). (for the record, rc is -2, or -ENOENT).
The other bit of info we gathered from there is the request is a RESEND (but not replay).

So imagine we have this situation - an intent request comes in (getattr in this case), it is handled, and then a reply is sent (along with a lock), but the reply is then lost.
Client resend the request, mdt fixup code finds the old lock and substitutes it in.
Now the policy function fails and returns rc that's not OK. This will prompt us to call the lock_destroy on the already granted lock and trigger this assertion.

The mystery was - what might cause a successful request to fail on retry, esp. with ENOENT with a lock in place, no less.
We have noticed that the request at the crash point did not contain reply buffers, so that means it failed super early on in processing, in body_unpack to be precise. This function does a lookup and could return -2 if the object is not there.
The request is also a getattr_by_fid.

So it appears the sequence of events is roughly:
a file is opened.
the file is unlinked
openhandle holder does fstat and causes getattr_by_fid request
reply is lost
the file is closed. // How this could happen while the getattr did not finish is not very clear to be sure. Perhaps that was not fstat, but some other way to get at the file and raced with unlink?
getattr_by_fid is resent - the object is no longer there, mdt_unpack_req_pack_rep->mdt_body_unpack->mdt_object_init->...->osd_object_init() returns -ENOENT.

I would like to verify this theory on the ORNL crashdump, but unfortunately it does not contain the kernel vmlinux and lustre modules with symbols, so I cannot.
Can we have that additional data please?

Comment by Oleg Drokin [ 15/Oct/15 ]

The idea for a fix is to additionally check if the failed request is a resend (with a granted lock) and in that case we need to "soft cancel" it - i.e. to send a blocking ast.

This is needed because it's possible the other reply made it to the client while we are working here, so it knows about the lock and if we cancel it outright - there's a bit of state inconsistency.
If the client does not know about the lock, it'll respond with a n error and the lock would be canceled imediatelly.
In any case there's no need to wait for the lock to be canceled here and we then can proceed with returning the error to the client.

Comment by Oleg Drokin [ 15/Oct/15 ]

So, I got the symbos information from James and the picture got a bit clouded.

What I see similar: it's a resend, there is rc -2 returned from policy function, there's no repbuffers created hinting at a very early on failure.
Differences: The request is ldlm layout intent.

Now the really strange part:

crash> p *(( struct mdt_thread_info *)0xffff8840296cb000)->mti_pill
$9 = {
  rc_req = 0xffff884045bd3000, 
  rc_fmt = 0xffffffffa080c9a0 <RQF_MDS_REINT_SETXATTR>, 
  rc_loc = RCL_SERVER, 
  rc_area = {{4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}, {4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295, 4294967295}}
}

So somehow this request got wrong request format assigned, which is really weird.
Both crashes have this same wrong request format too, so it's not a purely coincidence.

Comment by Oleg Drokin [ 16/Oct/15 ]

So I think it would be an interesting experiment to perform a forced resend for an IT_LAYOUT intent and see if it happens to fail every time. First with 2.5.4 on both ends (easier to organize) and then 2.5.4 on server and master on client.

I envision a manual test to be something like:
dd if=/dev/urandom of=/mnt/lustre/file1 bs=1024
press ^Z to pause that process
cp /etc/passwd /mnt/lustre/file2
lfs swap_layouts /mnt/lustre/file1 /mnt/lustre/file2
This will invalidate the layout lock, so next write activity from paused dd would trigger intent lock.

Then set OBD_FAIL_LDLM_REPLY (0x30c, or really 0x8000030c to make it one time) as fail_loc on MDS.
Unpause the dd (with fg) and make sure the fail_loc did trigger on MDS (this fail_loc does not work properly in master so need to make sure it works in 2.5.4).

If it crahes - then we know it's every time such a resend happens and it should be easy to figure it out afterwards, if not we'll need some more investigations.

In any case I do not have time for this right now as I am departing for a conference tomorrow.

James, if you have time to perform an experiment like this, that would be great as a first step, as I think Bobijam is busy with another conference in the next few days.

Comment by James A Simmons [ 16/Oct/15 ]

I just tried your test using a 2.8 client <-> 2.8 server. No crash happened. I will work the admin to setup a 2.8 client to test against our 2.5 server setup. I did see time outs on the clients and MDS side.

Comment by Vitaly Fertman [ 16/Oct/15 ]

due to LU-5604, OBD_FAIL_LDLM_REPLY does not work, you have to use OBD_FAIL_MDS_LDLM_REPLY_NET

Comment by Gerrit Updater [ 07/Dec/15 ]

Vitaly Fertman (vitaly.fertman@seagate.com) uploaded a new patch: http://review.whamcloud.com/17501
Subject: LU-7173 mdt: intent vs unlink race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c38c254ccad0e1a4b3a5eab99a6d3e09fd6f021f

Comment by Gerrit Updater [ 21/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17501/
Subject: LU-7173 mdt: intent vs unlink race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9b520c36ddff687c51bea026d78380eb85981971

Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ]

Another instance found for interop : 2.7.1 Server/EL6.7 Client
Server: 2.7.1, b2_7_fe/34
Client: master, build# 3303, RHEL 6.7
https://testing.hpdd.intel.com/test_sets/1dbf619c-bb05-11e5-9137-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ]

Another instance found for interop : 2.5.5 Server/EL6.7 Client
Server: 2.5.5, b2_5_fe/62
Client: master, build# 3303, RHEL 6.7
https://testing.hpdd.intel.com/test_sets/0ac277da-bb25-11e5-861c-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for interop : 2.5.5 Server/EL7 Client
Server: 2.5.5, b2_5_fe/62
Client: master, build# 3303, RHEL 7
https://testing.hpdd.intel.com/test_sets/76574e5a-bb0a-11e5-87b4-5254006e85c2

Comment by Peter Jones [ 01/Feb/16 ]

It looks like this fix already landed to master for 2.8

Generated at Sat Feb 10 02:06:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.