[LU-4179] LBUG ASSERTION( !lustre_handle_is_used(&lhc->mlh_reg_lh) ) failed: Created: 29/Oct/13  Updated: 31/Dec/13  Resolved: 03/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: server
Environment:

our source is at git://github.com/jlan/lustre-nas.git


Attachments: File nasa-mtd_open-v1.patch    
Issue Links:
Related
is related to LU-3607 Interop 2.3.0<->2.5 failure on test s... Resolved
Severity: 3
Epic: server
Rank (Obsolete): 11313

 Description   

We have the crash dumps. But requires analysis by US personal only.

LustreError: 6425:0:(mdt_open.c:1690:mdt_reint_open()) ASSERTION( !lustre_handle_is_used(&lhc->mlh_reg_lh) ) failed: ^M
LustreError: 6425:0:(mdt_open.c:1690:mdt_reint_open()) LBUG^M
4>Pid: 6425, comm: mdt01_002^M
<4>^M
<4>Call Trace:^M
<4> [<ffffffffa041f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
<4> [<ffffffffa041fe97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
<4> [<ffffffffa0ca7553>] mdt_reint_open+0x1973/0x20c0 [mdt]^M
<4> [<ffffffffa0ca832c>] mdt_reconstruct_open+0x68c/0xc30 [mdt]^M
<4> [<ffffffffa072d6a6>] ? __req_capsule_get+0x166/0x700 [ptlrpc]^M
<4> [<ffffffffa07061ae>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]^M
<4> [<ffffffffa0c9b195>] mdt_reconstruct+0x45/0x120 [mdt]^M
<4> [<ffffffffa0c76cfb>] mdt_reint_internal+0x6bb/0x780 [mdt]^M
<4> [<ffffffffa0c7708d>] mdt_intent_reint+0x1ed/0x520 [mdt]^M
<4> [<ffffffffa0c74f3e>] mdt_intent_policy+0x39e/0x720 [mdt]^M
<4> [<ffffffffa06bd7e1>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]^M
<4> [<ffffffffa06e424f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]^M
<4> [<ffffffffa0c753c6>] mdt_enqueue+0x46/0xe0 [mdt]^M
<4> [<ffffffffa0c7bab7>] mdt_handle_common+0x647/0x16d0 [mdt]^M
<4> [<ffffffffa0706c0c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]^M
<4> [<ffffffffa0cb5295>] mds_regular_handle+0x15/0x20 [mdt]^M
<4> [<ffffffffa0716428>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]^M
<4> [<ffffffffa04205de>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
<4> [<ffffffffa0431dbf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]^M
<4> [<ffffffffa070d789>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]^M
<4> [<ffffffff810557f3>] ? __wake_up+0x53/0x70^M
<4> [<ffffffffa07177be>] ptlrpc_main+0xace/0x1700 [ptlrpc]^M
<4> [<ffffffffa0716cf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
<4> [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
<4> [<ffffffffa0716cf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
<4> [<ffffffffa0716cf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
<4>^M
<0>Kernel panic - not syncing: LBUG^M
<4>Pid: 6425, comm: mdt01_002 Tainted: G --------------- T 2.6.32-358.6.2.el6.20130607.x86_64.lustre240 #1^M
<4>Call Trace:^M
<4> [<ffffffff8153e8da>] ? panic+0xa7/0x190^M
<4> [<ffffffffa041feeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
<4> [<ffffffffa0ca7553>] ? mdt_reint_open+0x1973/0x20c0 [mdt]^M
<4> [<ffffffffa0ca832c>] ? mdt_reconstruct_open+0x68c/0xc30 [mdt]^M
<4> [<ffffffffa072d6a6>] ? __req_capsule_get+0x166/0x700 [ptlrpc]^M
<4> [<ffffffffa07061ae>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]^M
<4> [<ffffffffa0c9b195>] ? mdt_reconstruct+0x45/0x120 [mdt]^M
<4> [<ffffffffa0c76cfb>] ? mdt_reint_internal+0x6bb/0x780 [mdt]^M
<4> [<ffffffffa0c7708d>] ? mdt_intent_reint+0x1ed/0x520 [mdt]^M
<4> [<ffffffffa0c74f3e>] ? mdt_intent_policy+0x39e/0x720 [mdt]^M
<4> [<ffffffffa06bd7e1>] ? ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]^M
<4> [<ffffffffa06e424f>] ? ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]^M
<4> [<ffffffffa0c753c6>] ? mdt_enqueue+0x46/0xe0 [mdt]^M
<4> [<ffffffffa0c7bab7>] ? mdt_handle_common+0x647/0x16d0 [mdt]^M
<4> [<ffffffffa0706c0c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]^M
<4> [<ffffffffa0cb5295>] ? mds_regular_handle+0x15/0x20 [mdt]^M
<4> [<ffffffffa0716428>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]^M
<4> [<ffffffffa04205de>] ? cfs_timer_arm+0xe/0x10 [libcfs]^M
<4> [<ffffffffa0431dbf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]^M
<4> [<ffffffffa070d789>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]^M
<4> [<ffffffff810557f3>] ? __wake_up+0x53/0x70^M
<4> [<ffffffffa07177be>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]^M
<4> [<ffffffffa0716cf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
<4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20^M
<4> [<ffffffffa0716cf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
<4> [<ffffffffa0716cf0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]^M
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M



 Comments   
Comment by Mahmoud Hanafi [ 29/Oct/13 ]

Correction Effected version should be 2.4 not 2.1.5

Comment by Peter Jones [ 29/Oct/13 ]

Keith is a US citizen and so should be ok to look over the crash dump.

Comment by Keith Mannthey (Inactive) [ 29/Oct/13 ]

I have sent email asking for the steps to access the crashdump.

Comment by Keith Mannthey (Inactive) [ 29/Oct/13 ]

Can you confirm both Lustre Server and Client Version?

Comment by Mahmoud Hanafi [ 29/Oct/13 ]

clients are 2.1.5 and a few 2.3
Servers are 2.4.0-3nas

See git://github.com/jlan/lustre-nas.git for our source tree.

I'll will try to get you the crash dumps.

Comment by Jay Lan (Inactive) [ 30/Oct/13 ]

If you are looking for source at github,
look for nas-2.1.5 branch for the 2.1.5 code and nas-2.4.0-1 branch for the 2.4.0-3nasS code.

Comment by Keith Mannthey (Inactive) [ 31/Oct/13 ]

Is DNE being used?

How often are you hitting this issue? Can you say when it started?

There is no easy answer from the crashdump and it is in a complicated area of the code. I have unwound the stack and have the struct mdt_thread_info *info being passed into mdt_reint_open. I am going to see if I can find the mdt_lock_handle tomorrow and continue to dig into this issue.

Comment by Keith Mannthey (Inactive) [ 31/Oct/13 ]

I was able to extract a good Lustre Debug log from the crash. The system was busy with service.c:1079:ptlrpc_update_export_timer() during the time of the error. I don't know if this is important or not yet.

Comment by Mahmoud Hanafi [ 31/Oct/13 ]

We not using DNE.

We are hitting this bug at least once a day. So this is number one on our priority list to be fixed.

Comment by Keith Mannthey (Inactive) [ 01/Nov/13 ]

I have attached a possible fix from Di Wang. Please test it and report back.

If you get an LASSERT with this patch applied please send a fresh crashdump.

Comment by Keith Mannthey (Inactive) [ 01/Nov/13 ]

A patch against master can be tracked here: http://review.whamcloud.com/8142

Comment by Jay Lan (Inactive) [ 01/Nov/13 ]

I applied the patch against 2.4.0 and built. Hit build errors:

/usr/src/redhat/BUILD/lustre-2.4.0/lustre/mdt/mdt_open.c: In function 'mdt_reint_open':
/usr/src/redhat/BUILD/lustre-2.4.0/lustre/mdt/mdt_open.c:1741: error: invalid storage class for function 'mdt_mfd_closed'
cc1: warnings being treated as errors
/usr/src/redhat/BUILD/lustre-2.4.0/lustre/mdt/mdt_open.c:1740: error: ISO C90 forbids mixed declarations and code
/usr/src/redhat/BUILD/lustre-2.4.0/lustre/mdt/mdt_open.c:2004: error: expected declaration or statement at end of input
make[8]: *** [/usr/src/redhat/BUILD/lustre-2.4.0/lustre/mdt/mdt_open.o] Error 1
make[7]: *** [/usr/src/redhat/BUILD/lustre-2.4.0/lustre/mdt] Error 2
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [/usr/src/redhat/BUILD/lustre-2.4.0/lustre] Error 2
make[5]: *** [_module_/usr/src/redhat/BUILD/lustre-2.4.0] Error 2

Comment by Keith Mannthey (Inactive) [ 01/Nov/13 ]

Sorry Jay, Let me work with this patch some more. We see this with the Master build as well.

Comment by Di Wang [ 01/Nov/13 ]

Hmm, the patch post on http://review.whamcloud.com/8142 missing a "}"

diff --git a/lustre/mdt/mdt_open.c b/lustre/mdt/mdt_open.c
index b4057a0..570d07c 100644
--- a/lustre/mdt/mdt_open.c
+++ b/lustre/mdt/mdt_open.c
@@ -1841,17 +1841,18 @@ int mdt_reint_open(struct mdt_thread_info *info, struct mdt_lock_handle *lhc)
                }
         }
 
-        LASSERT(!lustre_handle_is_used(&lhc->mlh_reg_lh));
-
-       /* get openlock if this is not replay and if a client requested it */
-       if (!req_is_replay(req)) {
-               rc = mdt_object_open_lock(info, child, lhc, &ibits);
-               if (rc != 0)
-                       GOTO(out_child_unlock, result = rc);
-               else if (create_flags & MDS_OPEN_LOCK)
-                       mdt_set_disposition(info, ldlm_rep, DISP_OPEN_LOCK);
+       if (lustre_handle_is_used(&lhc->mlh_reg_lh)) {
+               LASSERT((lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT));
+       } else {
+               /* get openlock if this is not replay and if a client requested it */
+               if (!req_is_replay(req)) {
+                       rc = mdt_object_open_lock(info, child, lhc, &ibits);
+                       if (rc != 0)
+                               GOTO(out_child_unlock, result = rc);
+                       else if (create_flags & MDS_OPEN_LOCK)
+                               mdt_set_disposition(info, ldlm_rep, DISP_OPEN_LOCK);
+               }
        }
-
        /* Try to open it now. */
        rc = mdt_finish_open(info, parent, child, create_flags,
                             created, ldlm_rep);

But please hold off a bit to try, since I need revisit these code a bit, Thanks!

Comment by Di Wang [ 02/Nov/13 ]

I just updated the patch http://review.whamcloud.com/#/c/8142/ , please try again. Thanks.

Comment by Jay Lan (Inactive) [ 02/Nov/13 ]

Hi Di,

The new patch can not be applied to 2.4.0.

Your new patch contains this line:
GOTO(out_child_unlock, result = rc);
but my code does not have out_child_unlock label.
There is "out_child" label in the routine, but I can not tell what other changes need to be made.

Could you write a patch that is applicable on top of 2.4.0?

Thanks!

Comment by Di Wang [ 02/Nov/13 ]

http://review.whamcloud.com/8145 Here is the one for b2_4, and the patch should be able to applied to 2.4.0 as well, please try.

Comment by Peter Jones [ 02/Nov/13 ]

Jay

Just to be clear - please only use the patch in a test environment until we have completed our validation

Thanks

Peter

Comment by Mahmoud Hanafi [ 04/Nov/13 ]

What is the risk with this patch. We can't fully test the patch using test env because we didn't isolate trigger. We would need to deploy it on production to fully test it.

Comment by Di Wang [ 04/Nov/13 ]

IMHO, the patch itself is fine, but we can not say it is "safe', until it went through all of our validation processes, i.e. reviewing, passing internal tests and landing, as peter mentioned.

Comment by Di Wang [ 05/Nov/13 ]

http://review.whamcloud.com/8145 has the wrong tag "based on master), actually that patch is based on b2_4, so I resubmit a new one http://review.whamcloud.com/8173 please track this one.

Comment by Jay Lan (Inactive) [ 05/Nov/13 ]

Hmm, I had no problem patching and building based on
http://review.whamcloud.com/8145 in my b2_4 build environment.

I compared one in 8145 with one in 8173, the patches seemed to be identical to me?

Comment by Di Wang [ 05/Nov/13 ]

Yes, they are same, but just some tag issue. If you just pull out the patch and build yourself, you can ignore it.

Comment by Jian Yu [ 26/Nov/13 ]

Lustre client: http://build.whamcloud.com/job/lustre-b2_4/57/
Lustre server: http://build.whamcloud.com/job/lustre-b2_3/41/ (2.3.0)

racer test also hit this LBUG:
https://maloo.whamcloud.com/test_sets/d1007e84-54b0-11e3-9029-52540035b04c

Comment by Mahmoud Hanafi [ 03/Dec/13 ]

Patched applied and haven't see this bug.
Note: NASA BUILD 2.4.0.4-1nasS

Please close

Comment by Peter Jones [ 03/Dec/13 ]

Great - thanks Mahmoud!

Generated at Sat Feb 10 01:40:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.