[LU-2943] LBUG mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) Created: 11/Mar/13  Updated: 20/Nov/13  Resolved: 20/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Diego Moreno (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 2
Labels: mn1

Attachments: File lascaux2420.console.log.mdsissue     Text File trace_debug_mdt_reconstruct_open_assertion.txt    
Issue Links:
Related
is related to LU-3987 LBUG ASSERTION( (!(rc < 0) || (lustr... Resolved
Severity: 3
Rank (Obsolete): 7064

 Description   

This issue has already been hit on lustre 2.2 (see LU-1702). Traces are exactly the same as for LU-1702.

It's been hit four consecutive times so it seems quite easy to reproduce.

2013-03-06 16:05:01 LustreError: 31751:0:(mdt_open.c:1023:mdt_reconstruct_open()) ASSERTION( (!(rc < 0)

(lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed:
2013-03-06 16:05:01 LustreError: 31751:0:(mdt_open.c:1023:mdt_reconstruct_open()) LBUG
2013-03-06 16:05:01 Pid: 31751, comm: mdt_145
2013-03-06 16:05:01
2013-03-06 16:05:01 Call Trace:
2013-03-06 16:05:01 [<ffffffffa04a27f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-03-06 16:05:01 [<ffffffffa04a2e07>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-03-06 16:05:01 [<ffffffffa0d9ed87>] mdt_reconstruct_open+0x7c7/0xa80 [mdt]
2013-03-06 16:05:01 [<ffffffffa0d908c5>] mdt_reconstruct+0x45/0x120 [mdt]
2013-03-06 16:05:01 [<ffffffffa0d7d099>] mdt_reint_internal+0x709/0x8e0 [mdt]
2013-03-06 16:05:01 [<ffffffffa0d7d53d>] mdt_intent_reint+0x1ed/0x500 [mdt]
2013-03-06 16:05:01 [<ffffffffa0d7bc09>] mdt_intent_policy+0x379/0x690 [mdt]
2013-03-06 16:05:01 [<ffffffffa06ca3c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
2013-03-06 16:05:01 [<ffffffffa06f03dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
2013-03-06 16:05:01 [<ffffffffa0d7c586>] mdt_enqueue+0x46/0x130 [mdt]
2013-03-06 16:05:01 [<ffffffffa0d71762>] mdt_handle_common+0x932/0x1750 [mdt]
2013-03-06 16:05:01 [<ffffffffa0d72655>] mdt_regular_handle+0x15/0x20 [mdt]
2013-03-06 16:05:01 [<ffffffffa071f4f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc]
2013-03-06 16:05:01 [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
2013-03-06 16:05:01 [<ffffffff8100412a>] child_rip+0xa/0x20
2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
2013-03-06 16:05:01 [<ffffffffa071e7e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
2013-03-06 16:05:01 [<ffffffff81004120>] ? child_rip+0x0/0x20

On the crash, the file who make the LBUG is a file created by mpio.

Onsite support team made the following analysis

The return status (rc) is -EREMOTE (-66) and it seems the
disposition mask was DISP_IT_EXECD / DISP_LOOKUP_EXECD / DISP_LOOKUP_POS
/ DISP_OPEN_OPEN / DISP_OPEN_LOCK. According to these information, it could be possible that, prior to the LBUG, MDS has run mdt_reint_open() having in return -EREMOTE just before the LBUG.

So mdt_reint_open() would return -EREMOTE and then
mdt_reconstruct_open() does not make attention that in case of -EREMOTE
return there is no msg transno setting ...

On the attachment file you can find the struct mdt_thread_info info data
who made the LBUG and also the req data (struct ptlrpc_request°
and lcd data (struct lsd_client_data).



 Comments   
Comment by Bruno Faccini (Inactive) [ 11/Mar/13 ]

Hello Diego,
Thank's for submitting this issue and these details and infos.
Will get back to you after reviewing all of this.
Best Regards.
Bruno.

Comment by Oleg Drokin [ 12/Mar/13 ]

I think this might be somewhat related to lu-2275 that I fixed for 2.3, perhaps you should try patches from there.

Additionally, how can there be EREMOTE in 2.1.4, it's not like we really have DNE there?

If it's easy to reproduce, I wonder what's your reproducer method?

Comment by Alexandre Louvet [ 12/Mar/13 ]

> If it's easy to reproduce, I wonder what's your reproducer method?

Right now we don't know. We got it several times in few days but we were not able to identify the root cause (thousand of nodes and hundreds of jobs running ...). Since then we moved back to 2.1.3 were we doesn't have such problem. We will continue to investigate and try ro reproduce it on test cluster.

Comment by Bruno Faccini (Inactive) [ 12/Mar/13 ]

Yes, I agree EREMOTE wih 2.1.4 is strange! Diego, can you also provide, if any, the list of patches that may have been added on top of 2.1.4 ?
Thank's.

Comment by Aurelien Degremont (Inactive) [ 12/Mar/13 ]

> If it's easy to reproduce, I wonder what's your reproducer method?

In fact, it is not so easy to reproduce.
The problem was, as long as you do not reboot clients involved in this issue, you can restart the MDT as many times as you want, you hit this crash for sure. MDT was restarted 4 times in 1 hour timeframe and we hit this bug very quickly after each restart.

I'm not sure we simply have to retry again to hit this bug again. Not sure what the client was really doing to trigger this...

Comment by Oleg Drokin [ 12/Mar/13 ]

Quick search for EREMOTE in b2_! returns LU-1397, there's also a debug patch http://review.whamcloud.com/#change,2793 so please take a look there too.

Comment by Aurelien Degremont (Inactive) [ 12/Mar/13 ]

Yes, I agree EREMOTE wih 2.1.4 is strange!

Yes, I had the same reaction, but the code is there, and for along time, according to git blame ("Openlock cache forward port" 2008-08).
If you look to mdt_reint_open():1427

/* get openlock if this is not replay and if a client requested it */
if (!req_is_replay(req) && create_flags & MDS_OPEN_LOCK) {
        ldlm_mode_t lm;

        if (create_flags & FMODE_WRITE)
                lm = LCK_CW;
        else if (create_flags & MDS_FMODE_EXEC)
                lm = LCK_PR;
        else
                lm = LCK_CR;
        mdt_lock_handle_init(lhc);
        mdt_lock_reg_init(lhc, lm);
        rc = mdt_object_lock(info, child, lhc,
                             MDS_INODELOCK_LOOKUP | MDS_INODELOCK_OPEN,
                             MDT_CROSS_LOCK);
        if (rc) {
                result = rc;
                GOTO(out_child, result);
        } else {
                result = -EREMOTE;
                mdt_set_disposition(info, ldlm_rep, DISP_OPEN_LOCK);
        }
}

You can see that under some condition result could be set to -EREMOTE and return to caller (mdt_reint_open()), with absolutely no link to DNE.

Comment by Patrick Valentin (Inactive) [ 12/Mar/13 ]

Hello Bruno,
here is the list of patches that were applied on top of 2.1.4.

Below is the first set of patches, present on top of both 2.1.3 and 2.1.4. As said above, when the customer moved back to 2.1.3, the problem no longer appeared.

ORNL-22 general ptlrpcd threads pool support
        From branch b2_1 (id: 71350744808a2791d6b623bfb24623052322380d)

LU-1144 ptlrpc: implement a NUMA aware ptlrpcd binding policy
        This patch is a backport on lustre 2.1 of the master branch patch.

LU-1110 fid: add full support for open-by-fid
        This patch is a backport on lustre 2.1 of the master branch patch.

LU-645  Avoid unnecessary dentry rehashing
        This patch is a backport on lustre 2.1 of the b1_8 branch patch.

LU-1331 changelog: allow changelog to extend record
        This patch is a backport on lustre 2.1 of the master branch patch.

LU-1448 llite: Prevent NULL pointer dereference on disabled OSC
        This patch is a backport on lustre 2.1 of the master branch patch.

LU-1714 lnet: Properly initialize sg_magic value
        This patch is a backport on lustre 2.1 of the master branch patch.

And here is the second set of patches, which is only on top of 2.1.4:

LU-1887 ptlrpc: grant shrink rpc format is special
        From branch b2_1 (id: 1de6014a19aae85ad92fc00265f9aeb86fb7f0cb)

LU-2613 mdt: update disk for fake transactions
        This patch is coming from "review.whamcloud.com/#change,5143"
        patch set 2, which is still in "Review in Progress" status.

LU-2624 ptlrpc: improve stop of ptlrpcd threads
        This patch is a backport on lustre 2.1 of the master branch patch.

LU-2683 lov: release all locks in closure to release sublock
        This patch is coming from "review.whamcloud.com/#change,5208" patch
        set 2, which was in "Review in Progress" status.
        It is now in master branch since 2013-03-04.

LU-1666 obdclass: reduce lock contention on coh_page_guard
        From branch b2_1 (id: 3d63043afdbf9842ce763bcff1efa30472ec3881)

LU-744  obdclass: revise cl_page refcount
        From branch b2_1 (id: 17f83b93481932e3476b076651ab60e1fbd15136)

Note: I also found LU-2927 ticket, which reports the same LBUG on Lustre 2.4.0.

Comment by Bruno Faccini (Inactive) [ 12/Mar/13 ]

Thank's Patrick.

Aurelien, it is true that EREMOTE is already there and since quite a long time! And this puzzle me because actually I still can not understand why you did not hit this before ??

Also, info->mti_spec.sp_cr_flags are MDS_OPEN_OWNEROVERRIDE|MDS_OPEN_LOCK which should come from an NFS export ... And thus we may only have triggered a very rare open reconstruct need for a NFSd Client request ?? And like in LU-2927 case too ??

Comment by Aurelien Degremont (Inactive) [ 13/Mar/13 ]

Bruno,

The fact that we hit this bug with 2.1.4 (Bull 227) and 2.4 (LU-2927) makes me think some recent change introduce a race condition with this code. For compatibility reason some patches from 2.4 are backported to 2.1.4. Bull backported other patches too. But I've look at the difference between 2.1.3 and 2.1.4 and found no candidate for such regression...

Strange....

Comment by Bruno Faccini (Inactive) [ 13/Mar/13 ]

Hello Aurelien,

Yes this looked strange to me as well, and this is why I better think you hit a very rare situation that trigger and old and still present bug. Can you try to determine from the crash-dump which FS and Client/Node were involved ?? I wonder it could a Lustre FS that some of you Client should then re-export via NFS ??

BTW, and according to my colleagues, seems that EREMOTE usage for open_lock feature may be avoided, so I may be back with a patch proposal.

Comment by Aurelien Degremont (Inactive) [ 13/Mar/13 ]

Filesystem is our scratch fs on TERA-100, which is absolutely not re-exported by NFS.

Client which seems involved was a classical compute client, as previously said, using MPI-IO (Not sure at all this is related).

Comment by Bruno Faccini (Inactive) [ 14/Mar/13 ]

Yes, it is right, this can definitely also happen out of NFS-export scenario ...

So I think that finally you experienced a somewhat rare case of an open-reconstruct/recovery scenario that triggered the bug (like LU-1702 in 2.2 and LU-2927 in 2.4), and what cleared the situation is the Client re-boot for downgrade ...

But let's wait for the currently investigated fix, as part of LU-2927, to come with a definitive solution. And also understand why our specific auto-tests (replay-single/test_55 and recovery-small/test_53) may not trigger.

Comment by Sebastien Buisson (Inactive) [ 25/Mar/13 ]

Hi,

Now that a fix for LU-2927 has landed in 2.4, what do we do with this problem?
Can you advise?

Thanks,
Sebastien.

Comment by Bruno Faccini (Inactive) [ 26/Mar/13 ]

Master/2.4 fix landed from LU-2927 is at http://review.whamcloud.com/#change,5694 , I will evaluate the b2_1 back-port.

Comment by Antoine Percher [ 28/Mar/13 ]

I have found the client node who are certainly the client who send the failing request
and on its log we can see the stack of the request :

2013-03-06 16:03:04 INFO: task %%A197:17742 blocked for more than 120 seconds.
2013-03-06 16:03:04 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-03-06 16:03:04 %%A197 D 0000000000000001 0 17742 17718 0x00000000
2013-03-06 16:03:04 ffff8806f08eb658 0000000000000086 0000000000000000 ffffffffa0686d24
2013-03-06 16:03:04 0000000000000008 ffff88080f3e3c00 ffff8806f08eb628 ffffffffa06872e7
2013-03-06 16:03:04 ffff88080775da70 ffff8806f08ebfd8 000000000000db00 ffff88080775da70
2013-03-06 16:03:04 Call Trace:
2013-03-06 16:03:04 [<ffffffff81485ab5>] schedule_timeout+0x205/0x2d0
2013-03-06 16:03:04 [<ffffffff81486a25>] __down+0x75/0xc0
2013-03-06 16:03:04 [<ffffffff81081611>] down+0x41/0x50
2013-03-06 16:03:04 [<ffffffffa08e4ae4>] mdc_enqueue+0x4e4/0x13a0 [mdc]
2013-03-06 16:03:04 [<ffffffffa08e5bc0>] mdc_intent_lock+0x220/0x630 [mdc]
2013-03-06 16:03:04 [<ffffffffa0b61920>] lmv_intent_open+0x2d0/0x10a0 [lmv]
2013-03-06 16:03:04 [<ffffffffa0b62980>] lmv_intent_lock+0x290/0x360 [lmv]
2013-03-06 16:03:04 [<ffffffffa0a2cbd8>] ll_revalidate_it+0x2b8/0x15b0 [lustre]
2013-03-06 16:03:04 [<ffffffffa0a2e003>] ll_revalidate_nd+0x133/0x3a0 [lustre]
2013-03-06 16:03:04 [<ffffffff81171612>] do_lookup+0x62/0x1e0
2013-03-06 16:03:04 [<ffffffff81172b4a>] path_walk+0x6a/0xe0
2013-03-06 16:03:04 [<ffffffff81172d1b>] do_path_lookup+0x5b/0xa0
2013-03-06 16:03:04 [<ffffffff81173c4b>] do_filp_open+0xfb/0xd00
2013-03-06 16:03:04 [<ffffffff811607b9>] do_sys_open+0x69/0x140
2013-03-06 16:03:04 [<ffffffff811608d0>] sys_open+0x20/0x30
2013-03-06 16:03:04 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b

I have also attach a file with complete client log

Comment by Sebastien Buisson (Inactive) [ 04/Apr/13 ]

Hi Bruno,

We tried to backport the fix from LU-2927 to b2_1, but we failed. Have you been able to make progress on this?
for the record, this issue is the most disruptive one at CEA ATM.

TIA,
Sebastien.

Comment by Bruno Faccini (Inactive) [ 05/Apr/13 ]

Yes, I am working on it and as you pointed it is not an easy one, will keep you updated.

Comment by Bruno Faccini (Inactive) [ 05/Apr/13 ]

Seb,
Just to share and get this done as quick as possible, did you experience regressions in your porting attempts ? And if yes of what kind ?
I have one build in progress, will let you know on testing progress soon.

Comment by Bruno Faccini (Inactive) [ 07/Apr/13 ]

B2_1 port/patch http://review.whamcloud.com/5954 submitted and successfully passed auto-tests.

Comment by Patrick Valentin (Inactive) [ 09/Apr/13 ]

Bruno,
I agree with your b2_1 patch. I tried to backport the b2_4 fix from LU-2927, but there is quite a lot of changes around mdt_open_by_fid_lock(), and I was afraid to miss something.

Comment by Alexandre Louvet [ 23/Aug/13 ]

Bruno, sorry to says that just after installing the patch, we got a lot of crashes on 3 large clusters.

The lbug message is :
(mdt_handler.c:3411:mdt_intent_reint()) ASSERTION( lustre_handle_is_used(&lhc->mlh_reg_lh) ) failed:
(mdt_handler.c:3411:mdt_intent_reint()) LBUG

followed by this stack :
Call Trace:
[<ffffffffa041a7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa041ae07>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0c0a841>] mdt_intent_reint+0x4f1/0x530 [mdt]
[<ffffffffa0c08c09>] mdt_intent_policy+0x379/0x690 [mdt]
[<ffffffffa06653c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
[<ffffffffa068b3dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
[<ffffffffa0c09586>] mdt_enqueue+0x46/0x130 [mdt]
[<ffffffffa0bfe762>] mdt_handle_common+0x932/0x1750 [mdt]
[<ffffffffa0bff655>] mdt_regular_handle+0x15/0x20 [mdt]
[<ffffffffa06ba4f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc]
[<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
[<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
[<ffffffff8100412a>] child_rip+0xa/0x20
[<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
[<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
[<ffffffff81004120>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 60412, comm: mdt_440 Not tainted 2.6.32-220.23.1.bl6.Bull.28.8.x86_64 0000001
Call Trace:
[<ffffffff81484650>] ? panic+0x78/0x143
[<ffffffffa041ae5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[<ffffffffa0c0a841>] ? mdt_intent_reint+0x4f1/0x530 [mdt]
[<ffffffffa0c08c09>] ? mdt_intent_policy+0x379/0x690 [mdt]
[<ffffffffa06653c1>] ? ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
[<ffffffffa068b3dd>] ? ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
[<ffffffffa0c09586>] ? mdt_enqueue+0x46/0x130 [mdt]
[<ffffffffa0bfe762>] ? mdt_handle_common+0x932/0x1750 [mdt]
[<ffffffffa0bff655>] ? mdt_regular_handle+0x15/0x20 [mdt]
[<ffffffffa06ba4f6>] ? ptlrpc_main+0xd16/0x1a80 [ptlrpc]
[<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
[<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
[<ffffffff8100412a>] ? child_rip+0xa/0x20
[<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
[<ffffffffa06b97e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
[<ffffffff81004120>] ? child_rip+0x0/0x20

That was observed on system running lustre 2.1.5 + patches

  • ORNL-22 general ptlrpcd threads pool support
  • LU-1144 implement a NUMA aware ptlrpcd binding policy
  • LU-1110 MDS Oops in osd_xattr_get() during file open by FID
  • LU-2613 to much unreclaimable slab space
  • LU-2624 ptlrpc fix thread stop
  • LU-2683 client deadlock in cl_lock_mutex_get
  • LU-2943 LBUG in mdt_reconstruct_open()

I agree this is not the same context than previously, but it located exactly were the patch is modifying the source code.
The patch was removed on the 3 affected cluster and since, stability is back.

Alex.

Comment by Bruno Faccini (Inactive) [ 26/Aug/13 ]

Humm sorry about that, and due to your report I am currently looking again to original/master patch from LU-2927. Seems that I missed some case in my 1st back-port and I am testing a new version. Will keep you updated soon.

Comment by Bruno Faccini (Inactive) [ 29/Aug/13 ]

New version/patch-set #3 of b2_1 port/patch http://review.whamcloud.com/5954 submitted and successfully passed auto-tests.

Comment by Alexandre Louvet [ 05/Sep/13 ]

What the current status of the latest patch ?

Comment by Bruno Faccini (Inactive) [ 09/Sep/13 ]

Hello Alex,
Reviewers agreed my patch, so it should be integrated soon, but in the mean time is there a possibility for you to temporarily integrate it and test it under production work-load? I know it is not easy to setup for you since it affects the MDS side, but I have no idea on how to reproduce locally.

Comment by Sebastien Buisson (Inactive) [ 09/Sep/13 ]

Hi Bruno,

The patchset #3 of http://review.whamcloud.com/5954 has been rolled out at CEA for test purpose at the end of last week.
Hopefully we will have news soon.

Cheers,
Sebastien.

Comment by Alexandre Louvet [ 09/Sep/13 ]

Hello Bruno,
The new package with your fix was delivered last Friday, when we got the approval to pick up your patch.
Remain that we have to find a time frame to install it on the system.

I'll keep you inform.

Cheers,
Alex.

Comment by Bruno Faccini (Inactive) [ 18/Nov/13 ]

Hello Alex and Seb, do you have any update fo this ticket ??
Bye,
Bruno.

Comment by Sebastien Buisson (Inactive) [ 20/Nov/13 ]

Hi Bruno,

Support team confirms that your fix does fix the issue.
Thank you!

Sebastien.

Comment by Bruno Faccini (Inactive) [ 20/Nov/13 ]

Cool, thanks for your update Seb. So I am marking this ticket as Fixed.

Generated at Sat Feb 10 01:29:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.