Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15190

ptlrpc_server_check_resend_in_progress() can miss duplicate RPC

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • Upstream
    • None
    • 3
    • 9223372036854775807

    Description

      ptlrpc_server_check_resend_in_progress() has the following check at the beginning:

              if (!(lustre_msg_get_flags(req->rq_reqmsg) & MSG_RESENT) ||
                  (atomic_read(&req->rq_export->exp_rpc_count) == 0))
                      return NULL;
      

      I think this can cause duplicate RPCs if none is in progress at the moment (due to high load, deep incoming queue).

      and there is a crash dump in support of this theory. in that dump I was able to find lots of duplicate (up to 14). for example,

      crash> p *(struct ptlrpc_request *)(0xffff9887c6c75ee0-0x60)
        rq_reqmsg = 0xffff9887c7f34000, 
        rq_xid = 1709012909603712, 
        rq_export = 0xffff988805ee5400, 
        rq_peer = {
          nid = 1407418002966021, 
      
      crash> p *(struct ptlrpc_request *)(0xffff987eb906a8e0-0x60)
        rq_reqmsg = 0xffff987e7145c148, 
        rq_xid = 1709012909603712, 
        rq_export = 0xffff988805ee5400, 
        rq_peer = {
          nid = 1407418002966021, 
      
      crash> ptlrpc_request_dump (0xffff98745d3a5a60-0x60)
      req: 0xffff9875002d6520, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff9875002d6600/104
      crash> ptlrpc_request_dump (0xffff98771dd06360-0x60)
      req: 0xffff9884e2218520, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff9884e2218600/104
      crash> ptlrpc_request_dump (0xffff9878a80c3ae0-0x60)
      req: 0xffff9878ae8ae148, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff9878ae8ae228/104
      crash> ptlrpc_request_dump (0xffff98789c049b60-0x60)
      req: 0xffff98789c7403d8, xid: 2531069376, opc: 103, flags: 2, buf2: 0xffff98789c7404b8/104
      
      crash> p ((struct ldlm_request *)0xffff9875002d6600)->lock_handle
          cookie = 13969718594132579448
      crash> p ((struct ldlm_request *)0xffff9884e2218600)->lock_handle
          cookie = 13969718594132579448
      crash> p ((struct ldlm_request *)0xffff9878ae8ae228)->lock_handle
          cookie = 13969718594132579448
      crash> p ((struct ldlm_request *)0xffff98789c7404b8)->lock_handle
          cookie = 13969718594132579448
      

      notice same XID and same lock's handle.

      dumped all RPCs from export's HP list and checked the XID's:

      $ cat xid-sorted-list.txt | wc -l
      877858
      $ cat xid-sorted-list.txt | uniq |wc -l
      213480
      

      i.e. 3/4 of all RPCs were duplicates.

      given ptlrpc_server_check_resend_in_progress() uses a linear scan to check for duplicates and a single spinlock, the check takes a lot and many CPUs were spinning for seconds.

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: