[LU-1520] client fails MDS connection and stack threads on another client Created: 14/Jun/12 Updated: 29/Apr/16 Resolved: 29/Apr/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | Lustre 1.8.9 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 7592 | ||||||||
| Description |
|
An client (cluster1) failed connection to MDS and recovered, but failed connection again by some reasons. Jun 11 11:28:45 cluster1 kernel: Lustre: 30906:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1402727385081248 sent from lustre-MDT0000-mdc- ffff880c06249800 to NID 192.168.3.45@o2ib 995s ago has timed out (995s prior to deadline). Jun 11 11:28:45 cluster1 kernel: req@ffff880293aaf800 x1402727385081248/t0 o101->lustre-MDT0000_UUID@192.168.3.45@o2ib:12/10 lens 560/1616 e 3 to 1 dl 1339381725 ref 1 fl Rpc:/0/0 rc 0/0 few hours later, call traces showed up on another client (cluster3). Jun 11 15:03:10 cluster3 kernel: Call Trace: Jun 11 15:03:10 cluster3 kernel: [<ffffffff814dbcd5>] schedule_timeout+0x215/0x2e0 Jun 11 15:03:10 cluster3 kernel: [<ffffffffa086808d>] ? lustre_msg_early_size+0x6d/0x70 [ptlrpc] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0996244>] ? mdc_intent_open_pack+0x364/0x530 [mdc] Jun 11 15:03:10 cluster3 kernel: [<ffffffff8115a1ae>] ? cache_alloc_refill+0x9e/0x240 Jun 11 15:03:10 cluster3 kernel: [<ffffffff814dcbf2>] __down+0x72/0xb0 Jun 11 15:03:10 cluster3 kernel: [<ffffffff81093f61>] down+0x41/0x50 Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0997173>] mdc_enqueue+0x283/0xa20 [mdc] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa081fbef>] ? __ldlm_handle2lock+0x9f/0x3d0 [ptlrpc] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa081fbef>] ? __ldlm_handle2lock+0x9f/0x3d0 [ptlrpc] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa09987d2>] mdc_intent_lock+0x102/0x440 [mdc] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0853e90>] ? ptlrpc_req_finished+0x10/0x20 [ptlrpc] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a431a5>] ? ll_lookup_it+0x405/0x870 [lustre] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a40490>] ? ll_mdc_blocking_ast+0x0/0x5f0 [lustre] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a402ee>] ? ll_prepare_mdc_op_data+0xbe/0x120 [lustre] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a40490>] ? ll_mdc_blocking_ast+0x0/0x5f0 [lustre] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa083f770>] ? ldlm_completion_ast+0x0/0x8a0 [ptlrpc] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a402ee>] ? ll_prepare_mdc_op_data+0xbe/0x120 [lustre] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a430b5>] ll_lookup_it+0x315/0x870 [lustre] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a40490>] ? ll_mdc_blocking_ast+0x0/0x5f0 [lustre] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa06f97c1>] ? cfs_alloc+0x91/0xf0 [libcfs] Jun 11 15:03:10 cluster3 kernel: [<ffffffffa0a43ac8>] ll_lookup_nd+0x88/0x470 [lustre] Jun 11 15:03:10 cluster3 kernel: [<ffffffff8118ad4e>] ? d_alloc+0x13e/0x1b0 Jun 11 15:03:10 cluster3 kernel: [<ffffffff81181c02>] __lookup_hash+0x102/0x160 Jun 11 15:03:10 cluster3 kernel: [<ffffffff81181d3a>] lookup_hash+0x3a/0x50 Jun 11 15:03:10 cluster3 kernel: [<ffffffff81182768>] do_filp_open+0x2c8/0xd90 Jun 11 15:03:10 cluster3 kernel: [<ffffffff8118f1e2>] ? alloc_fd+0x92/0x160 Jun 11 15:03:10 cluster3 kernel: [<ffffffff8116f989>] do_sys_open+0x69/0x140 Jun 11 15:03:10 cluster3 kernel: [<ffffffff8116faa0>] sys_open+0x20/0x30 Jun 11 15:03:10 cluster3 kernel: [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b I will upload the all log files soon. |
| Comments |
| Comment by Shuichi Ihara (Inactive) [ 14/Jun/12 ] |
|
all log files on /uploads/ Thanks! |
| Comment by Peter Jones [ 14/Jun/12 ] |
|
Hi Hongchao Could you please look into this issue? Thanks Peter |
| Comment by Hongchao Zhang [ 15/Jun/12 ] |
|
the call traces seen in cluster3 is caused for waiting the rpc_lock in mdc_enqueue, which is the result of the bad performance |
| Comment by Shuichi Ihara (Inactive) [ 18/Jun/12 ] |
|
Hongchao, It seems Any reason why this reverted? Which patches worth to try? Please advise. Ihara |
| Comment by Shuichi Ihara (Inactive) [ 19/Jun/12 ] |
|
Just read |
| Comment by Hongchao Zhang [ 26/Jun/12 ] |
|
Hi, Ihara sorry for delayed response! |
| Comment by Shuichi Ihara (Inactive) [ 28/Jun/12 ] |
|
Hi Hongchao, |
| Comment by Hongchao Zhang [ 28/Jun/12 ] |
|
Hi Ihara |
| Comment by Hongchao Zhang [ 04/Jul/12 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,3270 |
| Comment by Shuichi Ihara (Inactive) [ 11/Jul/12 ] |
|
Thanks! |
| Comment by Shuichi Ihara (Inactive) [ 19/Jul/12 ] |
|
Hongchao, we will be applying backported patches at the customer site, but before apply them, I wonder if someone could review it.. Please.. Ihara |
| Comment by Hongchao Zhang [ 19/Jul/12 ] |
|
oh, yes, sorry!! I missed your previous comment, I'll do it right now. |
| Comment by Kit Westneat (Inactive) [ 29/Aug/12 ] |
|
Hello, I was wondering what the status was of this patch. It appears that there were some suggested changes, were those ever done? Can we get this landed? |
| Comment by Hongchao Zhang [ 03/Sep/12 ] |
|
Hi, the updated patch is under creation & test, and will upload it soon |
| Comment by Hongchao Zhang [ 04/Sep/12 ] |
|
the updated patch has been pushed, http://review.whamcloud.com/#change,3270 |
| Comment by Kit Westneat (Inactive) [ 13/Sep/12 ] |
|
Hello, can we get an update? It looks like it has two +1 reviews, can it be landed? Thanks, Kit |
| Comment by Peter Jones [ 13/Sep/12 ] |
|
Kit The b1_8 version is ready to land but the master version is still being worked http://review.whamcloud.com/#change,3859. We try to land to master first so as to avoid deltas between 1.8.x and 2.x arising. Peter |
| Comment by Kit Westneat (Inactive) [ 13/Sep/12 ] |
|
Ah ok, thanks. |
| Comment by Shuichi Ihara (Inactive) [ 14/Sep/12 ] |
|
Peter, |
| Comment by Kit Westneat (Inactive) [ 26/Sep/12 ] |
|
It looks like the Maloo testing hit |
| Comment by Hongchao Zhang [ 15/Oct/12 ] |
|
there is a bug in the previous patch, which causes sanity subtest 124a failed. |
| Comment by Kit Westneat (Inactive) [ 15/Oct/12 ] |
|
Hi Hongchao, |
| Comment by Hongchao Zhang [ 15/Oct/12 ] |
|
no, the 1.8.x version has no such problem. |
| Comment by Kit Westneat (Inactive) [ 08/Apr/13 ] |
|
any updates on the master port of this patch? We are carrying the 1.8.x version in our 1.8.9 build, but it would be nice to integrate it in the Intel version. The last update on the changeset was in October. |
| Comment by James A Simmons [ 16/Sep/15 ] |
|
Really old ticket. Peter we should close this as well. |
| Comment by Peter Jones [ 16/Sep/15 ] |
|
Probably we should defer to DDN on that? |
| Comment by John Fuchs-Chesney (Inactive) [ 01/Apr/16 ] |
|
Hello Ihara, Do you want us to keep this open, or can we go ahead and mark it as resolved/won't fix? Thanks, |
| Comment by John Fuchs-Chesney (Inactive) [ 29/Apr/16 ] |
|
Hello Ihara, We have marked this as Resolved/Won't fix. Thanks, |