[LU-5341] Intermittent hangs waiting for RPCs in unregistering phase to timeout Created: 14/Jul/14  Updated: 28/Jul/16  Resolved: 28/Jul/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Ann Koehler (Inactive) Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-5259 request gets stuck in UNREGISTERING p... Resolved
is related to LU-5073 lustre don't able to unload modules i... Resolved
Severity: 3
Rank (Obsolete): 14898

 Description   

Regression introduced by the patch for LU-5073, http://review.whamcloud.com/10353

See also LU-5259, http://review.whamcloud.com/10846

When reply_in_callback executes before request_out_callback, the task sending the rpc can hang in ptlrpc_set_wait until the request times out. An example is a call to mdc_close; in this example the mdc_close takes 55 seconds to complete even though the server handles the request promptly.

Client:
> 00000002:00000001:18.0:1403799444.471726:0:14636:0:(mdc_request.c:829:mdc_close()) Process entered
...
> 00000100:00000001:18.0:1403799444.848775:0:14636:0:(client.c:1382:ptlrpc_send_new_req()) Process entered
> 00000100:00000040:18.0:1403799444.848776:0:14636:0:(lustre_net.h:3250:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc"  req@ffff880fffb10c00 x1471331744844100/t0(0) o35->snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
...
> 00000100:00000200:18.0:1403799444.852847:0:14636:0:(events.c:100:reply_in_callback()) @@@ type 6, status 0  req@ffff880fffb10c00 x1471331744844100/t0(0) o35->snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 1403799539 ref 3 fl Rpc:R/0/ffffffff rc 0/-1
...
> 00000100:00000040:18.0:1403799444.852858:0:14636:0:(lustre_net.h:3250:ptlrpc_rqphase_move()) @@@ move req "Rpc" -> "Unregistering"  req@ffff880fffb10c00 x1471331744844100/t0(0) o35->snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 1403799539 ref 3 fl Rpc:R/0/ffffffff rc 0/-1
...
> 00000100:00000001:18.0:1403799499.847659:0:14636:0:(client.c:1929:ptlrpc_expired_set()) Process entered
> 00000100:00000001:18.0:1403799499.847661:0:14636:0:(client.c:1965:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1)
...
> 00000100:00000040:18.0:1403799499.847669:0:14636:0:(lustre_net.h:3250:ptlrpc_rqphase_move()) @@@ move req "Unregistering" -> "Rpc"  req@ffff880fffb10c00 x1471331744844100/t0(0) o35->snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 1403799539 ref 2 fl Unregistering:R/0/ffffffff rc 0/-1
...
> 00000002:00000001:18.0:1403799444.472256:0:14636:0:(mdc_request.c:922:mdc_close()) Process leaving (rc=0 : 0 : 0)


MDS:
> 00000100:00100000:14.0:1403799444.858578:0:20307:0:(service.c:1886:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_178:0a613492-ed84-267d-d415-d0983ee768af+19:14636:x1471331744844100:12345-253@gni1:35
> 00000100:00100000:14.0:1403799444.858660:0:20307:0:(service.c:1930:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_178:0a613492-ed84-267d-d415-d0983ee768af+19:14636:x1471331744844100:12345-253@gni1:35 Request procesed in 89us (129us total) trans 193287303580 rc  0/0

When reply_in_callback executes, rq_req_unlink is still set so the rpc is moved to the unregistering phase until request_out_callback has a chance to reset the flag. But when request_out_callback is finally invoked it is with an LNET_EVENT_SEND event:

 00000100:00000200:2:1404844097.216045:0:5109:0:(events.c:68:request_out_callback()) @@@ type 5, status 0  req@ffff8801fb962c00 x1473065104065052/t0(0) o35->snx11014-MDT0000-mdc-ffff88040f1a6000@10.10.100.4@o2ib6000:23/10 lens 392/2536 e 0 to 0 dl 1404844281 ref 3 fl Unregistering:R/0/ffffffff rc 0/-1

/* Note: trace is from a different dump but shows the event type and rpc state. Apologies for mixing data from different dumps but I don't have a single debug log that shows the whole story.) */

With a send event, request_out_callback does not call ptlrpc_client_wake_req, so the mdc_close task waiting in ptlrpc_set_wait is not woken up, ptlrpc_check_set is not invoked, and mdc_close waits until the rpc times out before continuing.

void request_out_callback(lnet_event_t *ev)
{
...
        DEBUG_REQ(D_NET, req, "type %d, status %d", ev->type, ev->status);
....
        if (ev->unlinked)
                req->rq_req_unlink = 0;

        if (ev->type == LNET_EVENT_UNLINK || ev->status != 0) {

                /* Failed send: make it seem like the reply timed out, just
                 * like failing sends in client.c does currently...  */

                req->rq_net_err = 1;
                ptlrpc_client_wake_req(req);
        }   

Note, waiting for the rpc timeout does not appear to cause any functional errors but the performance penalty is significant.



 Comments   
Comment by John Hammond [ 14/Jul/14 ]

I've noticed that insanity tests 0 and 1 (and possibly others) fail about 50% of the time when run locally. I bisected that to the landing of http://review.whamcloud.com/10846.

== insanity test 1: MDS/MDS failure == 11:51:49 (1404751909)
Stopping /mnt/mds1 (opts:) on t
Failover mds1 to t
Stopping /mnt/mds2 (opts:) on t
Reintegrating MDS2
11:52:26 (1404751946) waiting for t network 900 secs ...
11:52:26 (1404751946) network interface is UP
Starting mds2:   -o loop /tmp/lustre-mdt2 /mnt/mds2
Started lustre-MDT0001
11:52:41 (1404751961) waiting for t network 900 secs ...
11:52:41 (1404751961) network interface is UP
Starting mds1:   -o loop /tmp/lustre-mdt1 /mnt/mds1
mount.lustre: mount /dev/loop1 at /mnt/mds1 failed: Input/output error
Is the MGS running?
Start of /tmp/lustre-mdt1 on mds1 failed 5
 insanity test_1: @@@@@@ FAIL: test_1 failed with 5
  Trace dump:
  = lustre/tests/../tests/test-framework.sh:4504:error_noexit()
  = lustre/tests/../tests/test-framework.sh:4535:error()
  = lustre/tests/../tests/test-framework.sh:4781:run_one()
  = lustre/tests/../tests/test-framework.sh:4816:run_one_logged()
  = lustre/tests/../tests/test-framework.sh:4636:run_test()
  = lustre/tests/insanity.sh:207:main()
Dumping lctl log to /tmp/test_logs/1404751907/insanity.test_1.*.1404751982.log
Dumping logs only on local client.
Comment by Ann Koehler (Inactive) [ 14/Jul/14 ]

Patch available at http://review.whamcloud.com/11090

This patch is an expedient fix to the bug that minimizes perturbations to the ptlrpc state machine. In the longer term, the whole set of issues associated with LU-5073 warrant an architectural review.

Comment by Cliff White (Inactive) [ 19/Sep/14 ]

The patch has a negative review, can you address those concerns?

Comment by Chris Horn [ 05/Dec/14 ]

Looks like this was addressed in LU-5696

Comment by Ann Koehler (Inactive) [ 28/Jul/16 ]

Closing since the mod was superceded by LU-5696.

Generated at Sat Feb 10 01:50:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.