Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5341

Intermittent hangs waiting for RPCs in unregistering phase to timeout

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.6.0
    • 3
    • 14898

    Description

      Regression introduced by the patch for LU-5073, http://review.whamcloud.com/10353

      See also LU-5259, http://review.whamcloud.com/10846

      When reply_in_callback executes before request_out_callback, the task sending the rpc can hang in ptlrpc_set_wait until the request times out. An example is a call to mdc_close; in this example the mdc_close takes 55 seconds to complete even though the server handles the request promptly.

      Client:
      > 00000002:00000001:18.0:1403799444.471726:0:14636:0:(mdc_request.c:829:mdc_close()) Process entered
      ...
      > 00000100:00000001:18.0:1403799444.848775:0:14636:0:(client.c:1382:ptlrpc_send_new_req()) Process entered
      > 00000100:00000040:18.0:1403799444.848776:0:14636:0:(lustre_net.h:3250:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc"  req@ffff880fffb10c00 x1471331744844100/t0(0) o35->snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1
      ...
      > 00000100:00000200:18.0:1403799444.852847:0:14636:0:(events.c:100:reply_in_callback()) @@@ type 6, status 0  req@ffff880fffb10c00 x1471331744844100/t0(0) o35->snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 1403799539 ref 3 fl Rpc:R/0/ffffffff rc 0/-1
      ...
      > 00000100:00000040:18.0:1403799444.852858:0:14636:0:(lustre_net.h:3250:ptlrpc_rqphase_move()) @@@ move req "Rpc" -> "Unregistering"  req@ffff880fffb10c00 x1471331744844100/t0(0) o35->snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 1403799539 ref 3 fl Rpc:R/0/ffffffff rc 0/-1
      ...
      > 00000100:00000001:18.0:1403799499.847659:0:14636:0:(client.c:1929:ptlrpc_expired_set()) Process entered
      > 00000100:00000001:18.0:1403799499.847661:0:14636:0:(client.c:1965:ptlrpc_expired_set()) Process leaving (rc=1 : 1 : 1)
      ...
      > 00000100:00000040:18.0:1403799499.847669:0:14636:0:(lustre_net.h:3250:ptlrpc_rqphase_move()) @@@ move req "Unregistering" -> "Rpc"  req@ffff880fffb10c00 x1471331744844100/t0(0) o35->snx11063-MDT0000-mdc-ffff880835a8a400@10.149.150.4@o2ib4000:23/10 lens 392/15080 e 0 to 0 dl 1403799539 ref 2 fl Unregistering:R/0/ffffffff rc 0/-1
      ...
      > 00000002:00000001:18.0:1403799444.472256:0:14636:0:(mdc_request.c:922:mdc_close()) Process leaving (rc=0 : 0 : 0)
      
      
      MDS:
      > 00000100:00100000:14.0:1403799444.858578:0:20307:0:(service.c:1886:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_178:0a613492-ed84-267d-d415-d0983ee768af+19:14636:x1471331744844100:12345-253@gni1:35
      > 00000100:00100000:14.0:1403799444.858660:0:20307:0:(service.c:1930:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_178:0a613492-ed84-267d-d415-d0983ee768af+19:14636:x1471331744844100:12345-253@gni1:35 Request procesed in 89us (129us total) trans 193287303580 rc  0/0
      

      When reply_in_callback executes, rq_req_unlink is still set so the rpc is moved to the unregistering phase until request_out_callback has a chance to reset the flag. But when request_out_callback is finally invoked it is with an LNET_EVENT_SEND event:

       00000100:00000200:2:1404844097.216045:0:5109:0:(events.c:68:request_out_callback()) @@@ type 5, status 0  req@ffff8801fb962c00 x1473065104065052/t0(0) o35->snx11014-MDT0000-mdc-ffff88040f1a6000@10.10.100.4@o2ib6000:23/10 lens 392/2536 e 0 to 0 dl 1404844281 ref 3 fl Unregistering:R/0/ffffffff rc 0/-1
      

      /* Note: trace is from a different dump but shows the event type and rpc state. Apologies for mixing data from different dumps but I don't have a single debug log that shows the whole story.) */

      With a send event, request_out_callback does not call ptlrpc_client_wake_req, so the mdc_close task waiting in ptlrpc_set_wait is not woken up, ptlrpc_check_set is not invoked, and mdc_close waits until the rpc times out before continuing.

      void request_out_callback(lnet_event_t *ev)
      {
      ...
              DEBUG_REQ(D_NET, req, "type %d, status %d", ev->type, ev->status);
      ....
              if (ev->unlinked)
                      req->rq_req_unlink = 0;
      
              if (ev->type == LNET_EVENT_UNLINK || ev->status != 0) {
      
                      /* Failed send: make it seem like the reply timed out, just
                       * like failing sends in client.c does currently...  */
      
                      req->rq_net_err = 1;
                      ptlrpc_client_wake_req(req);
              }   
      

      Note, waiting for the rpc timeout does not appear to cause any functional errors but the performance penalty is significant.

      Attachments

        Issue Links

          Activity

            [LU-5341] Intermittent hangs waiting for RPCs in unregistering phase to timeout

            People

              bogl Bob Glossman (Inactive)
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: