Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3443

performance impact of mdc_rpc_lock serialization

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.1.4
    • 3
    • 8584

    Description

      Serialization of in-flight RPCs on mdc_rpc_lock makes non-modifying operations, such as calling open() on a directory, vulnerable to blocking due to a slow backend MDS filesystem. In particular, users may see long delays running ls when LDLM_ENQUEUE requests get blocked behind long-lived metadata-modifying MDS_REINT requests.

      For example, in LU-3442, RPCs involving writing llog records experienced long service times due to a misbehaving backend filesystem. Therefore RPCs for operations like create(), unlink() and rename() would stay in flight for many seconds on the client. Unfortunately, these long-lived in-flight RPCs prevent LDLM_ENQUEUE requests for open() on a directory from being sent, due to the mdc_get_rpc_lock() call in mdc_enqueue(). Once issued, the LDLM_ENQUEUE request completes almost immediately since it doesn't involve synchronous I/O on the backend. It would be desirable if such non-modifying operations could be shielded from the effects of slow synchronous operations.

      To that end, it would be helpful to clarify what the mdc_rpc_lock is protecting. mdc_enqueue() has this to say:

      812         /* It is important to obtain rpc_lock first (if applicable), so that
      813          * threads that are serialised with rpc_lock are not polluting our
      814          * rpcs in flight counter. We do not do flock request limiting, though*/
      815         if (it) {
      816                 mdc_get_rpc_lock(obddev->u.cli.cl_rpc_lock, it);
      817                 rc = mdc_enter_request(&obddev->u.cli);
      

      but it's not clear to me what is meant by "polluting", and why the counter can't be protected by a separate lock that need no be held across the entire network request.
      I also observe that the in-flight RPC counter for an OBD import rarely exceeds 1 or 2, and never approaches the upper limit of 8. So it seems we are not doing a good job of keeping a full pipeline of in-flight RPCs.

      LLNL-bug-id: TOSS-2084

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: