[LU-793] Reconnections should not be refused when there is a request in progress from this client. Created: 25/Oct/11 Updated: 03/Jun/19 Resolved: 18/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 2.2.0, Lustre 2.4.0, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | JL | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 5914 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
While originally this was a useful workaround, it created a lot of other unintended problems. This code must be disabled and instead we just should disable handling several duplicate requests at the same time. |
| Comments |
| Comment by Cory Spitz [ 28/Mar/12 ] |
|
Vitaly shared that a patch is under review at http://review.whamcloud.com/#change,1616 |
| Comment by Iurii Golovach (Inactive) [ 04/Apr/12 ] |
|
Hi Oleg, since we are interested in this patch could you provide us with the next information: thank you. |
| Comment by Oleg Drokin [ 04/Apr/12 ] |
|
Well, I cannot predict how well the inspections will pass. Also see my comment for patchset 3 on Mar 13th. |
| Comment by Iurii Golovach (Inactive) [ 10/Apr/12 ] |
|
Hi Oleg. Is it possible to speed-up the review process somehow? Or we can do review in gerrit from our side since we are interested in this code? Thank you, |
| Comment by Nathan Rutman [ 02/May/12 ] |
|
In the Description for this bug, does "this was a useful workaround" refer to the http://review.whamcloud.com/#change,1616 patch? |
| Comment by Oleg Drokin [ 02/May/12 ] |
|
The workaround being referred to is the refusal of reconnection. The patch in change 1616 is originally mine, and I still work on it as I have time. The Andreas' comments are definitely valid and the request switching I think is needed, I was liking this idea from the tart, and now Andreas seems to like it too, so it needs to be done I guess. |
| Comment by Cory Spitz [ 14/Aug/12 ] |
|
http://review.whamcloud.com/#change,1616 needs rebasing. |
| Comment by Christopher Morrone [ 22/Oct/12 ] |
|
Long standing bug, duplicate of |
| Comment by Iurii Golovach (Inactive) [ 06/Nov/12 ] |
|
This scheme shows cases when we drop new packet and when we process it. |
| Comment by Jodi Levi (Inactive) [ 20/Dec/12 ] |
|
Mike, |
| Comment by Mikhail Pershin [ 06/Jan/13 ] |
|
http://review.whamcloud.com/#change,4960 I've just updated old patch to be compatible with new master code. It provides the same functionality as old one so far, without any attempts to use new reply buffers while answering on older request. |
| Comment by Cliff White (Inactive) [ 15/Jan/13 ] |
|
I am still experiencing this issue on Hyperion, especially using mdtest file-per-process. In the last test, starting seeing these timeouts prior to the actual error. Jan 14 22:35:08 hyperion-rst6 kernel: LNet: Service thread pid 11870 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 14 22:35:08 hyperion-rst6 kernel: Pid: 11870, comm: mdt03_001 Jan 14 22:35:08 hyperion-rst6 kernel: Jan 14 22:35:08 hyperion-rst6 kernel: Call Trace: Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa06cf072>] start_this_handle+0x282/0x500 [jbd2] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff81092170>] ? autoremove_wake_function+0x0/0x40 Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa06cf4f0>] jbd2_journal_start+0xd0/0x110 [jbd2] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa072bab8>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa072be7c>] ldiskfs_dquot_initialize+0x4c/0xc0 [ldiskfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0708590>] ? ldiskfs_delete_inode+0x0/0x250 [ldiskfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff811968a3>] generic_delete_inode+0x173/0x1d0 Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff81196965>] generic_drop_inode+0x65/0x80 Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff811957b2>] iput+0x62/0x70 Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0e3fc24>] osd_object_delete+0x1d4/0x2f0 [osd_ldiskfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa07e6059>] lu_object_free+0x89/0x1a0 [obdclass] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa03a5351>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa03aa5d2>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa07e690d>] lu_object_put+0xad/0x320 [obdclass] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f2c69d>] mdt_object_unlock_put+0x3d/0x110 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f4e0d7>] mdt_reint_unlink+0x637/0x850 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0965fce>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f48cf1>] mdt_reint_rec+0x41/0xe0 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f423a3>] mdt_reint_internal+0x4e3/0x7d0 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f426d4>] mdt_reint+0x44/0xe0 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f31a72>] mdt_handle_common+0x8e2/0x1680 [mdt] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0f67a25>] mds_regular_handle+0x15/0x20 [mdt] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa09761ec>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa039564e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa096d669>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffff81053463>] ? __wake_up+0x53/0x70 Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0977776>] ptlrpc_main+0xbb6/0x1950 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0976bc0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0976bc0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0976bc0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
| Comment by Cliff White (Inactive) [ 15/Jan/13 ] |
|
Here is a call trace from a client, while the MDS is reporting the RPC issue: 2013-01-15 09:45:57 INFO: task mdtest:6078 blocked for more than 120 seconds. 2013-01-15 09:45:57 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-01-15 09:45:57 mdtest D 0000000000000003 0 6078 6074 0x00000000 2013-01-15 09:45:57 ffff88017ede7c58 0000000000000086 0000000000000000 0000000000000086 2013-01-15 09:45:57 ffff88030aa62a20 ffff8801c5851960 ffffffff81aa5740 0000000000000286 2013-01-15 09:45:57 ffff8801bba69ab8 ffff88017ede7fd8 000000000000fb88 ffff8801bba69ab8 2013-01-15 09:45:57 Call Trace: 2013-01-15 09:45:57 [<ffffffff814ff6fe>] __mutex_lock_slowpath+0x13e/0x180 2013-01-15 09:45:57 [<ffffffff814ff59b>] mutex_lock+0x2b/0x50 2013-01-15 09:45:57 [<ffffffffa07eb80c>] mdc_reint+0x3c/0x3b0 [mdc] 2013-01-15 09:45:57 [<ffffffffa07ec880>] mdc_unlink+0x1b0/0x500 [mdc] 2013-01-15 09:45:57 [<ffffffffa0a5feb9>] lmv_unlink+0x199/0x7e0 [lmv] 2013-01-15 09:45:57 [<ffffffffa097dba6>] ll_unlink+0x176/0x670 [lustre] 2013-01-15 09:45:57 [<ffffffff8118923f>] vfs_unlink+0x9f/0xe0 2013-01-15 09:45:57 [<ffffffff81187f8a>] ? lookup_hash+0x3a/0x50 2013-01-15 09:45:57 [<ffffffff8118b773>] do_unlinkat+0x183/0x1c0 2013-01-15 09:45:57 [<ffffffff8119a960>] ? mntput_no_expire+0x30/0x110 2013-01-15 09:45:57 [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60 2013-01-15 09:45:57 [<ffffffff8118b7c6>] sys_unlink+0x16/0x20 2013-01-15 09:45:57 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b |
| Comment by Christopher Morrone [ 15/Jan/13 ] |
|
Cliff, the refused-reconnection problem is the after-effect of whatever problem happened, and there are many ways it can be triggered. You should open a separate bug report for the initial problem of the server going unresponsive, or whatever happened. |
| Comment by Iurii Golovach (Inactive) [ 17/Jan/13 ] |
|
Hi gents, here is a patch version for this issue from our team: |
| Comment by Iurii Golovach (Inactive) [ 18/Jan/13 ] |
|
Since Andreas recommended to merge Mikhail patch and mine I updated http://review.whamcloud.com/#change,4960 with improvement request. |
| Comment by Cory Spitz [ 06/Mar/13 ] |
|
Change #4960 failed autotest. Do you just need to resubmit? |
| Comment by Mikhail Pershin [ 10/Mar/13 ] |
|
retriggered |
| Comment by Mikhail Pershin [ 02/Sep/13 ] |
|
This patch doesn't work properly with bulk resends because bulks use new XID always, even for RESENT case. Therefore we cannot match original request that might be processed on server at the same time. It is not clear how to solve this in simple way in context of this patch, looks like patch should be reworked and will be more complex, e.g. we might need to change protocol and store original XID in bulk along with new one. int ptlrpc_register_bulk(struct ptlrpc_request *req) { ... /* An XID is only used for a single request from the client. * For retried bulk transfers, a new XID will be allocated in * in ptlrpc_check_set() if it needs to be resent, so it is not * using the same RDMA match bits after an error. */ ... } and void ptlrpc_resend_req(struct ptlrpc_request *req) { DEBUG_REQ(D_HA, req, "going to resend"); lustre_msg_set_handle(req->rq_reqmsg, &(struct lustre_handle){ 0 }); req->rq_status = -EAGAIN; spin_lock(&req->rq_lock); req->rq_resend = 1; req->rq_net_err = 0; req->rq_timedout = 0; if (req->rq_bulk) { __u64 old_xid = req->rq_xid; /* ensure previous bulk fails */ req->rq_xid = ptlrpc_next_xid(); CDEBUG(D_HA, "resend bulk old x"LPU64" new x"LPU64"\n", old_xid, req->rq_xid); } ptlrpc_client_wake_req(req); spin_unlock(&req->rq_lock); } |
| Comment by Mikhail Pershin [ 11/Sep/13 ] |
|
patch was refreshed http://review.whamcloud.com/#/c/4960/. I doesn't handle bulk request for now, this will be solved in the following patch. |
| Comment by Peter Jones [ 02/Oct/13 ] |
|
This is still a support priority, but we need to finalize the fix before we consider it for inclusion in a release |
| Comment by Mikhail Pershin [ 25/Oct/13 ] |
|
Patch is refreshed, now it handles all request including bulk. That requires protocol changes and works only with new clients, old clients will be handled as before - returning -EBUSY on connect request if there is another request in processing |
| Comment by Mikhail Pershin [ 27/Nov/13 ] |
|
Patch was updated again and I hope it addresses all cases including bulk requests. It doesn't change protocol now. Cris, I expect this patch will be landed soon to the master, can you try it and see how it helps? |
| Comment by Christopher Morrone [ 27/Nov/13 ] |
|
Unfortunately, no, we won't be able to find out it it helps for some time. We are doing a major upgrade to Lustre 2.4 over the next 2-3 weeks on the SCF machines, but this patch missed the window for inclusion in that distribution. We will have to work it into the pipeline for the next upgrade. Can you explain a little more about what the patch will do? I see "Bulk requests are aborted upon reconnection by comparing connection count of request and export." in the patch comment. What happens when the bulk requests are aborted? Will the client transparently resend them? Also, what happens if there is more than one rpc outstanding? Is the client able to reconnect in that case? |
| Comment by Mikhail Pershin [ 27/Nov/13 ] |
|
The behavior is almost the same as before for bulks. Currently all pending bulks are aborted if new reconnect arrived from client and reconnect is refused with -EBUSY until there will be no more active requests, this is how it is handled before patch. With this patch we accept reconnect even if there are active requests and all bulks from last connection are aborted. Basically it is the same behavior as before, but now the connection count is checked instead of specific flag. |
| Comment by Andreas Dilger [ 11/Dec/13 ] |
|
I think this patch introduced a timeout in conf-sanity ( |
| Comment by Peter Jones [ 08/Feb/14 ] |
|
Mike Could you please clarify what LLNL would need to port in order to use this fix on b2_4? Thanks Peter |
| Comment by Mikhail Pershin [ 10/Feb/14 ] |
| Comment by Peter Jones [ 11/Feb/14 ] |
|
Backports to b2_4 http://review.whamcloud.com/#/c/9209/ |
| Comment by Manish Patel (Inactive) [ 03/Mar/14 ] |
|
Hi We are seeing similar issues on Lustre 2.1.6 release, so is this patch compatible with 2.1.x release and if yes then can this be backported to branch b2_1. Thank You, |
| Comment by Christopher Morrone [ 02/Jul/14 ] |
|
The patch http://review.whamcloud.com/#/c/9211/ was determined to not be needed for b2_4. |