[LU-1520] client fails MDS connection and stack threads on another client Created: 14/Jun/12  Updated: 29/Apr/16  Resolved: 29/Apr/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: Lustre 1.8.9

Type: Bug Priority: Minor
Reporter: Shuichi Ihara (Inactive) Assignee: Hongchao Zhang
Resolution: Won't Fix Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6529 Server side lock limits to avoid unne... Closed
Severity: 3
Rank (Obsolete): 7592

 Description   

An client (cluster1) failed connection to MDS and recovered, but failed connection again by some reasons.

Jun 11 11:28:45 cluster1 kernel: Lustre: 30906:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1402727385081248 sent from lustre-MDT0000-mdc-
ffff880c06249800 to NID 192.168.3.45@o2ib 995s ago has timed out (995s prior to deadline).
Jun 11 11:28:45 cluster1 kernel:  req@ffff880293aaf800 x1402727385081248/t0 o101->lustre-MDT0000_UUID@192.168.3.45@o2ib:12/10 lens 560/1616 e 3 to 1 dl 
1339381725 ref 1 fl Rpc:/0/0 rc 0/0

few hours later, call traces showed up on another client (cluster3).

Jun 11 15:03:10 cluster3 kernel: Call Trace:
Jun 11 15:03:10 cluster3 kernel: [<ffffffff814dbcd5>] schedule_timeout+0x215/0x2e0
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa086808d>] ? lustre_msg_early_size+0x6d/0x70 [ptlrpc]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0996244>] ? mdc_intent_open_pack+0x364/0x530 [mdc]
Jun 11 15:03:10 cluster3 kernel: [<ffffffff8115a1ae>] ? cache_alloc_refill+0x9e/0x240
Jun 11 15:03:10 cluster3 kernel: [<ffffffff814dcbf2>] __down+0x72/0xb0
Jun 11 15:03:10 cluster3 kernel: [<ffffffff81093f61>] down+0x41/0x50
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0997173>] mdc_enqueue+0x283/0xa20 [mdc]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa081fbef>] ? __ldlm_handle2lock+0x9f/0x3d0 [ptlrpc]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa081fbef>] ? __ldlm_handle2lock+0x9f/0x3d0 [ptlrpc]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa09987d2>] mdc_intent_lock+0x102/0x440 [mdc]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0853e90>] ? ptlrpc_req_finished+0x10/0x20 [ptlrpc]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a431a5>] ? ll_lookup_it+0x405/0x870 [lustre]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a40490>] ? ll_mdc_blocking_ast+0x0/0x5f0 [lustre]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a402ee>] ? ll_prepare_mdc_op_data+0xbe/0x120 [lustre]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a40490>] ? ll_mdc_blocking_ast+0x0/0x5f0 [lustre]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa083f770>] ? ldlm_completion_ast+0x0/0x8a0 [ptlrpc]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a402ee>] ? ll_prepare_mdc_op_data+0xbe/0x120 [lustre]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a430b5>] ll_lookup_it+0x315/0x870 [lustre]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a40490>] ? ll_mdc_blocking_ast+0x0/0x5f0 [lustre]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa06f97c1>] ? cfs_alloc+0x91/0xf0 [libcfs]
Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a43ac8>] ll_lookup_nd+0x88/0x470 [lustre]
Jun 11 15:03:10 cluster3 kernel: [<ffffffff8118ad4e>] ? d_alloc+0x13e/0x1b0
Jun 11 15:03:10 cluster3 kernel: [<ffffffff81181c02>] __lookup_hash+0x102/0x160
Jun 11 15:03:10 cluster3 kernel: [<ffffffff81181d3a>] lookup_hash+0x3a/0x50
Jun 11 15:03:10 cluster3 kernel: [<ffffffff81182768>] do_filp_open+0x2c8/0xd90
Jun 11 15:03:10 cluster3 kernel: [<ffffffff8118f1e2>] ? alloc_fd+0x92/0x160
Jun 11 15:03:10 cluster3 kernel: [<ffffffff8116f989>] do_sys_open+0x69/0x140
Jun 11 15:03:10 cluster3 kernel: [<ffffffff8116faa0>] sys_open+0x20/0x30
Jun 11 15:03:10 cluster3 kernel: [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

I will upload the all log files soon.



 Comments   
Comment by Shuichi Ihara (Inactive) [ 14/Jun/12 ]

all log files on /uploads/LU-1520

Thanks!

Comment by Peter Jones [ 14/Jun/12 ]

Hi Hongchao

Could you please look into this issue?

Thanks

Peter

Comment by Hongchao Zhang [ 15/Jun/12 ]

the call traces seen in cluster3 is caused for waiting the rpc_lock in mdc_enqueue, which is the result of the bad performance
of mds(mds01), it runs low of memory and could be more damaged by ldlm_pools_shrink(BZ24419 or LU-607?).

Comment by Shuichi Ihara (Inactive) [ 18/Jun/12 ]

Hongchao,

It seems LU-607 was landed in b1_8 once, but the patches were reverted by Johann.
http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=447794d5ebb71dbd39d7378944c3c9eeb230f8d0

Any reason why this reverted?
Also, the patches for BZ24419 was not landed in -wc branch.

Which patches worth to try?

Please advise.

Ihara

Comment by Shuichi Ihara (Inactive) [ 19/Jun/12 ]

Just read LU-607 again and it seems the patch for LU-607 introduces the regressions and reverted.
So, this problem is not fixed even yet..
we are seeing couple of times similar situation in month. Please investigate to avoid this problem..

Comment by Hongchao Zhang [ 26/Jun/12 ]

Hi, Ihara

sorry for delayed response!
the patch ported from BZ24419 is reverted for it causes a LASSERT. the patch tracked at BZ24419 is to improve
the performance of shrinking of LDLM, which could mitigate this issue. will port and test the newest patch in BZ24419
to check whether it can fix the issue.

Comment by Shuichi Ihara (Inactive) [ 28/Jun/12 ]

Hi Hongchao,
do you mind if you can port patch to b1_8, please?

Comment by Hongchao Zhang [ 28/Jun/12 ]

Hi Ihara
Okay, will port it to b1_8

Comment by Hongchao Zhang [ 04/Jul/12 ]

the patch is tracked at http://review.whamcloud.com/#change,3270

Comment by Shuichi Ihara (Inactive) [ 11/Jul/12 ]

Thanks!
does it need to review by someone before we try to test this patches?

Comment by Shuichi Ihara (Inactive) [ 19/Jul/12 ]

Hongchao, we will be applying backported patches at the customer site, but before apply them, I wonder if someone could review it.. Please..

Ihara

Comment by Hongchao Zhang [ 19/Jul/12 ]

oh, yes, sorry!! I missed your previous comment, I'll do it right now.

Comment by Kit Westneat (Inactive) [ 29/Aug/12 ]

Hello, I was wondering what the status was of this patch. It appears that there were some suggested changes, were those ever done? Can we get this landed?

Comment by Hongchao Zhang [ 03/Sep/12 ]

Hi, the updated patch is under creation & test, and will upload it soon

Comment by Hongchao Zhang [ 04/Sep/12 ]

the updated patch has been pushed, http://review.whamcloud.com/#change,3270

Comment by Kit Westneat (Inactive) [ 13/Sep/12 ]

Hello, can we get an update? It looks like it has two +1 reviews, can it be landed? Thanks, Kit

Comment by Peter Jones [ 13/Sep/12 ]

Kit

The b1_8 version is ready to land but the master version is still being worked http://review.whamcloud.com/#change,3859. We try to land to master first so as to avoid deltas between 1.8.x and 2.x arising.

Peter

Comment by Kit Westneat (Inactive) [ 13/Sep/12 ]

Ah ok, thanks.

Comment by Shuichi Ihara (Inactive) [ 14/Sep/12 ]

Peter,
Thanks for clarifying on this, but nobody listed on http://review.whamcloud.com/#change,3859 as reviewer.
I really want this patches to review and land, land into b1_8 as well. otherwise we need to apply patches to top of b1_8 to solve current customer issue..

Comment by Kit Westneat (Inactive) [ 26/Sep/12 ]

It looks like the Maloo testing hit LU-479?

Comment by Hongchao Zhang [ 15/Oct/12 ]

there is a bug in the previous patch, which causes sanity subtest 124a failed.
the updated patch has been pushed to Gerrit

Comment by Kit Westneat (Inactive) [ 15/Oct/12 ]

Hi Hongchao,
Will the 1.8 version need to be updated too?

Comment by Hongchao Zhang [ 15/Oct/12 ]

no, the 1.8.x version has no such problem.

Comment by Kit Westneat (Inactive) [ 08/Apr/13 ]

any updates on the master port of this patch? We are carrying the 1.8.x version in our 1.8.9 build, but it would be nice to integrate it in the Intel version. The last update on the changeset was in October.

Comment by James A Simmons [ 16/Sep/15 ]

Really old ticket. Peter we should close this as well.

Comment by Peter Jones [ 16/Sep/15 ]

Probably we should defer to DDN on that?

Comment by John Fuchs-Chesney (Inactive) [ 01/Apr/16 ]

Hello Ihara,

Do you want us to keep this open, or can we go ahead and mark it as resolved/won't fix?

Thanks,
~ jfc.

Comment by John Fuchs-Chesney (Inactive) [ 29/Apr/16 ]

Hello Ihara,

We have marked this as Resolved/Won't fix.

Thanks,
~ jfc.

Generated at Sat Feb 10 01:17:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.