[LU-793] Reconnections should not be refused when there is a request in progress from this client. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.1.0, Lustre 2.2.0, Lustre 2.4.0, Lustre 1.8.6
Labels:
- JL

Severity:
3
Rank (Obsolete):
5914

Description

While originally this was a useful workaround, it created a lot of other unintended problems.

This code must be disabled and instead we just should disable handling several duplicate requests at the same time.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

LU-793.jpg
39 kB
06/Nov/12 9:39 AM

Issue Links

duplicates

LU-7 Reconnect server->client connection

Resolved

is related to

LU-2621 SIngle client timeout hangs MDS -related to LU-793

Resolved

LU-4349 conf-sanity test_47: test failed to respond and timed out

Resolved

LU-4458 Interop 2.5.0<->2.6 failure on test suite recovery-small test_9

Resolved

LU-18072 Lock cancel resending overwhelms ldlm canceld thread

Resolved

LU-7 Reconnect server->client connection

Resolved

LU-4359 1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers

Resolved

LU-2429 easy to find bad client

Resolved

LU-4480 cfs_fail_timeout id 214 sleeping for -1000ms

Closed

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

(4 is related to, 1 Trackbacks)

Activity

[LU-793] Reconnections should not be refused when there is a request in progress from this client.

Mikhail Pershin added a comment - 02/Sep/13 5:39 AM - edited

This patch doesn't work properly with bulk resends because bulks use new XID always, even for RESENT case. Therefore we cannot match original request that might be processed on server at the same time. It is not clear how to solve this in simple way in context of this patch, looks like patch should be reworked and will be more complex, e.g. we might need to change protocol and store original XID in bulk along with new one.
Btw, the comment about bulk XID and replated code:

int ptlrpc_register_bulk(struct ptlrpc_request *req)
{
...

	/* An XID is only used for a single request from the client.
	 * For retried bulk transfers, a new XID will be allocated in
	 * in ptlrpc_check_set() if it needs to be resent, so it is not
	 * using the same RDMA match bits after an error.
	 */
...
}

and

void ptlrpc_resend_req(struct ptlrpc_request *req)
{
        DEBUG_REQ(D_HA, req, "going to resend");
        lustre_msg_set_handle(req->rq_reqmsg, &(struct lustre_handle){ 0 });
        req->rq_status = -EAGAIN;

	spin_lock(&req->rq_lock);
        req->rq_resend = 1;
        req->rq_net_err = 0;
        req->rq_timedout = 0;
        if (req->rq_bulk) {
                __u64 old_xid = req->rq_xid;

                /* ensure previous bulk fails */
                req->rq_xid = ptlrpc_next_xid();
                CDEBUG(D_HA, "resend bulk old x"LPU64" new x"LPU64"\n",
                       old_xid, req->rq_xid);
        }
        ptlrpc_client_wake_req(req);
	spin_unlock(&req->rq_lock);
}

Mikhail Pershin added a comment - 02/Sep/13 5:39 AM - edited This patch doesn't work properly with bulk resends because bulks use new XID always, even for RESENT case. Therefore we cannot match original request that might be processed on server at the same time. It is not clear how to solve this in simple way in context of this patch, looks like patch should be reworked and will be more complex, e.g. we might need to change protocol and store original XID in bulk along with new one. Btw, the comment about bulk XID and replated code: int ptlrpc_register_bulk(struct ptlrpc_request *req) { ... /* An XID is only used for a single request from the client. * For retried bulk transfers, a new XID will be allocated in * in ptlrpc_check_set() if it needs to be resent, so it is not * using the same RDMA match bits after an error. */ ... } and void ptlrpc_resend_req(struct ptlrpc_request *req) { DEBUG_REQ(D_HA, req, "going to resend" ); lustre_msg_set_handle(req->rq_reqmsg, &(struct lustre_handle){ 0 }); req->rq_status = -EAGAIN; spin_lock(&req->rq_lock); req->rq_resend = 1; req->rq_net_err = 0; req->rq_timedout = 0; if (req->rq_bulk) { __u64 old_xid = req->rq_xid; /* ensure previous bulk fails */ req->rq_xid = ptlrpc_next_xid(); CDEBUG(D_HA, "resend bulk old x" LPU64 " new x" LPU64 "\n" , old_xid, req->rq_xid); } ptlrpc_client_wake_req(req); spin_unlock(&req->rq_lock); }

Mikhail Pershin added a comment - 10/Mar/13 11:16 PM

retriggered

Mikhail Pershin added a comment - 10/Mar/13 11:16 PM retriggered

Cory Spitz added a comment - 06/Mar/13 3:17 PM

Change #4960 failed autotest. Do you just need to resubmit?

Cory Spitz added a comment - 06/Mar/13 3:17 PM Change #4960 failed autotest. Do you just need to resubmit?

Iurii Golovach (Inactive) added a comment - 18/Jan/13 11:05 AM

Since Andreas recommended to merge Mikhail patch and mine I updated http://review.whamcloud.com/#change,4960 with improvement request.

Iurii Golovach (Inactive) added a comment - 18/Jan/13 11:05 AM Since Andreas recommended to merge Mikhail patch and mine I updated http://review.whamcloud.com/#change,4960 with improvement request.

Iurii Golovach (Inactive) added a comment - 17/Jan/13 5:42 PM

Hi gents,

here is a patch version for this issue from our team:
http://review.whamcloud.com/#change,5054

Iurii Golovach (Inactive) added a comment - 17/Jan/13 5:42 PM Hi gents, here is a patch version for this issue from our team: http://review.whamcloud.com/#change,5054

Christopher Morrone (Inactive) added a comment - 15/Jan/13 1:59 PM - edited

Cliff, the refused-reconnection problem is the after-effect of whatever problem happened, and there are many ways it can be triggered. You should open a separate bug report for the initial problem of the server going unresponsive, or whatever happened.

Christopher Morrone (Inactive) added a comment - 15/Jan/13 1:59 PM - edited Cliff, the refused-reconnection problem is the after-effect of whatever problem happened, and there are many ways it can be triggered. You should open a separate bug report for the initial problem of the server going unresponsive, or whatever happened.

Cliff White (Inactive) added a comment - 15/Jan/13 1:52 PM

Here is a call trace from a client, while the MDS is reporting the RPC issue:

2013-01-15 09:45:57 INFO: task mdtest:6078 blocked for more than 120 seconds.
2013-01-15 09:45:57 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2013-01-15 09:45:57 mdtest        D 0000000000000003     0  6078   6074 0x00000000
2013-01-15 09:45:57  ffff88017ede7c58 0000000000000086 0000000000000000 0000000000000086
2013-01-15 09:45:57  ffff88030aa62a20 ffff8801c5851960 ffffffff81aa5740 0000000000000286
2013-01-15 09:45:57  ffff8801bba69ab8 ffff88017ede7fd8 000000000000fb88 ffff8801bba69ab8
2013-01-15 09:45:57 Call Trace:
2013-01-15 09:45:57  [<ffffffff814ff6fe>] __mutex_lock_slowpath+0x13e/0x180
2013-01-15 09:45:57  [<ffffffff814ff59b>] mutex_lock+0x2b/0x50
2013-01-15 09:45:57  [<ffffffffa07eb80c>] mdc_reint+0x3c/0x3b0 [mdc]
2013-01-15 09:45:57  [<ffffffffa07ec880>] mdc_unlink+0x1b0/0x500 [mdc]
2013-01-15 09:45:57  [<ffffffffa0a5feb9>] lmv_unlink+0x199/0x7e0 [lmv]
2013-01-15 09:45:57  [<ffffffffa097dba6>] ll_unlink+0x176/0x670 [lustre]
2013-01-15 09:45:57  [<ffffffff8118923f>] vfs_unlink+0x9f/0xe0
2013-01-15 09:45:57  [<ffffffff81187f8a>] ? lookup_hash+0x3a/0x50
2013-01-15 09:45:57  [<ffffffff8118b773>] do_unlinkat+0x183/0x1c0
2013-01-15 09:45:57  [<ffffffff8119a960>] ? mntput_no_expire+0x30/0x110
2013-01-15 09:45:57  [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60
2013-01-15 09:45:57  [<ffffffff8118b7c6>] sys_unlink+0x16/0x20
2013-01-15 09:45:57  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Cliff White (Inactive) added a comment - 15/Jan/13 1:52 PM Here is a call trace from a client, while the MDS is reporting the RPC issue: 2013-01-15 09:45:57 INFO: task mdtest:6078 blocked for more than 120 seconds. 2013-01-15 09:45:57 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2013-01-15 09:45:57 mdtest D 0000000000000003 0 6078 6074 0x00000000 2013-01-15 09:45:57 ffff88017ede7c58 0000000000000086 0000000000000000 0000000000000086 2013-01-15 09:45:57 ffff88030aa62a20 ffff8801c5851960 ffffffff81aa5740 0000000000000286 2013-01-15 09:45:57 ffff8801bba69ab8 ffff88017ede7fd8 000000000000fb88 ffff8801bba69ab8 2013-01-15 09:45:57 Call Trace: 2013-01-15 09:45:57 [<ffffffff814ff6fe>] __mutex_lock_slowpath+0x13e/0x180 2013-01-15 09:45:57 [<ffffffff814ff59b>] mutex_lock+0x2b/0x50 2013-01-15 09:45:57 [<ffffffffa07eb80c>] mdc_reint+0x3c/0x3b0 [mdc] 2013-01-15 09:45:57 [<ffffffffa07ec880>] mdc_unlink+0x1b0/0x500 [mdc] 2013-01-15 09:45:57 [<ffffffffa0a5feb9>] lmv_unlink+0x199/0x7e0 [lmv] 2013-01-15 09:45:57 [<ffffffffa097dba6>] ll_unlink+0x176/0x670 [lustre] 2013-01-15 09:45:57 [<ffffffff8118923f>] vfs_unlink+0x9f/0xe0 2013-01-15 09:45:57 [<ffffffff81187f8a>] ? lookup_hash+0x3a/0x50 2013-01-15 09:45:57 [<ffffffff8118b773>] do_unlinkat+0x183/0x1c0 2013-01-15 09:45:57 [<ffffffff8119a960>] ? mntput_no_expire+0x30/0x110 2013-01-15 09:45:57 [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60 2013-01-15 09:45:57 [<ffffffff8118b7c6>] sys_unlink+0x16/0x20 2013-01-15 09:45:57 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Cliff White (Inactive) added a comment - 15/Jan/13 1:41 PM

I am still experiencing this issue on Hyperion, especially using mdtest file-per-process. In the last test, starting seeing these timeouts prior to the actual error.

Jan 14 22:35:08 hyperion-rst6 kernel: LNet: Service thread pid 11870 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 14 22:35:08 hyperion-rst6 kernel: Pid: 11870, comm: mdt03_001
Jan 14 22:35:08 hyperion-rst6 kernel:
Jan 14 22:35:08 hyperion-rst6 kernel: Call Trace:
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa06cf072>] start_this_handle+0x282/0x500 [jbd2]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff81092170>] ? autoremove_wake_function+0x0/0x40
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa06cf4f0>] jbd2_journal_start+0xd0/0x110 [jbd2]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa072bab8>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa072be7c>] ldiskfs_dquot_initialize+0x4c/0xc0 [ldiskfs]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0708590>] ? ldiskfs_delete_inode+0x0/0x250 [ldiskfs]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff811968a3>] generic_delete_inode+0x173/0x1d0
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff81196965>] generic_drop_inode+0x65/0x80
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff811957b2>] iput+0x62/0x70
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0e3fc24>] osd_object_delete+0x1d4/0x2f0 [osd_ldiskfs]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa07e6059>] lu_object_free+0x89/0x1a0 [obdclass]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa03a5351>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa03aa5d2>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa07e690d>] lu_object_put+0xad/0x320 [obdclass] 
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f2c69d>] mdt_object_unlock_put+0x3d/0x110 [mdt]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f4e0d7>] mdt_reint_unlink+0x637/0x850 [mdt]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0965fce>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f48cf1>] mdt_reint_rec+0x41/0xe0 [mdt]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f423a3>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f426d4>] mdt_reint+0x44/0xe0 [mdt]
Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f31a72>] mdt_handle_common+0x8e2/0x1680 [mdt]
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0f67a25>] mds_regular_handle+0x15/0x20 [mdt]
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa09761ec>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa039564e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa096d669>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffff81053463>] ? __wake_up+0x53/0x70
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0977776>] ptlrpc_main+0xbb6/0x1950 [ptlrpc]
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0976bc0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc]
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0976bc0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc]
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0976bc0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc]
Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Cliff White (Inactive) added a comment - 15/Jan/13 1:41 PM I am still experiencing this issue on Hyperion, especially using mdtest file-per-process. In the last test, starting seeing these timeouts prior to the actual error. Jan 14 22:35:08 hyperion-rst6 kernel: LNet: Service thread pid 11870 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 14 22:35:08 hyperion-rst6 kernel: Pid: 11870, comm: mdt03_001 Jan 14 22:35:08 hyperion-rst6 kernel: Jan 14 22:35:08 hyperion-rst6 kernel: Call Trace: Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa06cf072>] start_this_handle+0x282/0x500 [jbd2] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff81092170>] ? autoremove_wake_function+0x0/0x40 Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa06cf4f0>] jbd2_journal_start+0xd0/0x110 [jbd2] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa072bab8>] ldiskfs_journal_start_sb+0x58/0x90 [ldiskfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa072be7c>] ldiskfs_dquot_initialize+0x4c/0xc0 [ldiskfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0708590>] ? ldiskfs_delete_inode+0x0/0x250 [ldiskfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff811968a3>] generic_delete_inode+0x173/0x1d0 Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff81196965>] generic_drop_inode+0x65/0x80 Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffff811957b2>] iput+0x62/0x70 Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0e3fc24>] osd_object_delete+0x1d4/0x2f0 [osd_ldiskfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa07e6059>] lu_object_free+0x89/0x1a0 [obdclass] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa03a5351>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa03aa5d2>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa07e690d>] lu_object_put+0xad/0x320 [obdclass] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f2c69d>] mdt_object_unlock_put+0x3d/0x110 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f4e0d7>] mdt_reint_unlink+0x637/0x850 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0965fce>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f48cf1>] mdt_reint_rec+0x41/0xe0 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f423a3>] mdt_reint_internal+0x4e3/0x7d0 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f426d4>] mdt_reint+0x44/0xe0 [mdt] Jan 14 22:35:08 hyperion-rst6 kernel: [<ffffffffa0f31a72>] mdt_handle_common+0x8e2/0x1680 [mdt] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0f67a25>] mds_regular_handle+0x15/0x20 [mdt] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa09761ec>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa039564e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa096d669>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffff81053463>] ? __wake_up+0x53/0x70 Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0977776>] ptlrpc_main+0xbb6/0x1950 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0976bc0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0976bc0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffffa0976bc0>] ? ptlrpc_main+0x0/0x1950 [ptlrpc] Jan 14 22:35:09 hyperion-rst6 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Mikhail Pershin added a comment - 06/Jan/13 9:49 AM

http://review.whamcloud.com/#change,4960

I've just updated old patch to be compatible with new master code. It provides the same functionality as old one so far, without any attempts to use new reply buffers while answering on older request.

Mikhail Pershin added a comment - 06/Jan/13 9:49 AM http://review.whamcloud.com/#change,4960 I've just updated old patch to be compatible with new master code. It provides the same functionality as old one so far, without any attempts to use new reply buffers while answering on older request.

Jodi Levi (Inactive) added a comment - 20/Dec/12 2:57 PM

Mike,
Could you please have a look at this one?
Thank you!

Jodi Levi (Inactive) added a comment - 20/Dec/12 2:57 PM Mike, Could you please have a look at this one? Thank you!

Iurii Golovach (Inactive) added a comment - 06/Nov/12 9:39 AM

This scheme shows cases when we drop new packet and when we process it.

Iurii Golovach (Inactive) added a comment - 06/Nov/12 9:39 AM This scheme shows cases when we drop new packet and when we process it.

People

Assignee:: Mikhail Pershin

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 24 Start watching this issue

Dates

Created:: 25/Oct/11 11:41 AM

Updated:: 29/Jul/24 5:21 PM

Resolved:: 18/Feb/14 10:03 PM