Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5116

Race between resend and reply processing

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0, Lustre 2.5.2
    • Lustre 2.4.1, Lustre 2.5.0, Lustre 2.6.0
    • 3
    • 14104

    Description

      Server evict client during invalid request

      00000100:00100000:9.0:1400505646.197736:0:83755:0:(service.c:1734:ptlrpc_server_handle_req_in()) got req x1468534034672908
      00000100:00020000:9.0:1400505646.197738:0:83755:0:(service.c:975:ptlrpc_check_req()) @@@ Invalid replay without recovery  req@ffff88079c2b0850 x1468534034672908/t0(88947828) o4->7f3cf026-15bd-c61a-088c-a943e5bce2bf@335@gni1:0/0 lens 488/0 e 0 to 0 dl 0 ref 1 fl New:/6/ffffffff rc 0/-1
      00000020:00080000:9.0:1400505646.221792:0:83755:0:(genops.c:1391:class_fail_export()) disconnecting export ffff8805acac6400/7f3cf026-15bd-c61a-088c-a943e5bce2bf
      00000020:00000080:10.0:1400505646.221811:0:83755:0:(genops.c:1229:class_disconnect()) disconnect: cookie 0xa74fa39ba3a7cd61
      00000020:00010000:10.0:1400505646.221817:0:83755:0:(genops.c:1746:obd_stale_export_put()) Put export ffff8805acac6400: total 1
      00000100:00080000:10.0:1400505646.221820:0:83755:0:(import.c:1502:ptlrpc_cleanup_imp()) ffff88054809a800 ^W: changing import state from FULL to CLOSED
      

      At the client side we can see a race

      00000100:00080000:22.0:1400505646.246037:0:19252:0:(client.c:2487:ptlrpc_resend_req()) @@@ going to resend  req@ffff880ffea86000 x1468534034670388/t88947827(88947827) o4->snx11063-OST0050-osc-ffff881039a22400@10.149.150.25@o2ib4008:6/4 lens 488/416 e 2 to 0 dl 1400505782 ref 2 fl Interpret:R/4/0 rc 0/0
      

      Client going to resend request but it already has req->rq_replied flag (Interpret:R), and req->rq_reqmsg = MSG_REPLAY flag (/4).

      There was disconnect/reconnect at the client side (lnet error) and no recovery happened.

      The race exist between ptlrpc_check_set() and reconnect->ptlrpc_resend_req. The request belong to the imp->imp_sending_list and has MSG_REPLAY flag after after_reply() at ptlrpc_check_set() and before

                      if (!cfs_list_empty(&req->rq_list)) {
                              cfs_list_del_init(&req->rq_list);
                              cfs_atomic_dec(&imp->imp_inflight);                    
                      }
      

      The reconnect code process this list to resend request. So, it could happened that request got reply, after_reply() processed it, set MSG_REPLAY. But ptlrpc_resend_req() set rq_resend flag, and request going to resend. After such request with MSG_REPLAY flag come to server, it cause client eviction.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: