Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5319

Support multiple slots per client in last_rcvd file

Details

    • 14856

    Description

      While running mdtest benchmark, I have observed that file creation and unlink operations from a single Lustre client quickly saturates to around 8000 iops: maximum is reached as soon as with 4 tasks in parallel.
      When using several Lustre mount points on a single client node, the file creation and unlink rate do scale with the number of tasks, up to the 16 cores of my client node.

      Looking at the code, it appears that most metadata operations are serialized by a mutex in the MDC layer.
      In mdc_reint() routine, request posting is protected by mdc_get_rpc_lock() and mdc_put_rpc_lock(), where the lock is :
      struct client_obd -> struct mdc_rpc_lock *cl_rpc_lock -> struct mutex rpcl_mutex.

      After an email discussion with Andreas Dilger, it appears that the limitation is actually on the MDS, since it cannot handle more than a single filesystem-modifying RPC at one time. There is only one slot in the MDT last_rcvd file for each client to save the state for the reply in case it is lost.

      The aim of this ticket is to implement multiple slots per client in the last_rcvd file so that several filesystem-modifying RPCs can be handled in parallel.

      The single client metadata performance should be significantly improved while still ensuring a safe recovery mecanism.

      Attachments

        Issue Links

          Activity

            [LU-5319] Support multiple slots per client in last_rcvd file
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-14144 [ LU-14144 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-7410 [ LU-7410 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-6864 [ LU-6864 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is duplicated by DELL-242 [ DELL-242 ]
            jgmitter Joseph Gmitter (Inactive) made changes -
            Remote Link Original: This issue links to "Page (HPDD Community Wiki)" [ 15849 ] New: This issue links to "Page (HPDD Community Wiki)" [ 15849 ]
            pichong Gregoire Pichon made changes -
            Link New: This issue is related to LU-7729 [ LU-7729 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to LDEV-37 [ LDEV-37 ]
            pichong Gregoire Pichon made changes -
            Link New: This issue is related to LU-7408 [ LU-7408 ]

            The multi-slots implementation introduced a regression, see LU-5951.

            To get the unreplied requests by scan the sending/delayed list, current multi-slots implementation moved the xid assignment from request packing stage to request sending stage, however, that breaks the original mechanism which used to coordinate the timestamp update on OST objects (caused by some out of order operations, such as setattr, truncate and write).

            To fix this regression, LU-5951 moved the xid assignment back to request packing stage, and introduced an unreplied list to track all the unreplied requests. Following is a brief description of the LU-5951 patch:

            obd_import->imp_unreplied_list is introduced to track all the unreplied requests, and all requests in the list is sorted by xid, so that client may get the known maximal replied xid by checking the first element in the list.

            obd_import->imp_known_replied_xid is introduced for sanity check purpose, it's updated along with the imp_unreplied_list.

            Once a request is built, it'll be inserted into the unreplied list, and when the reply is seen by client or the request is going to be freed, the request will be removed from the list. Two tricky points are worth mentioning here:

            1. Replay requests need be added back to the unreplied list before sending, instead of adding them back one by one during replay, we choose to add them back all together before replay, that'll be easier for strict sanity check and less bug prone.

            2. The sanity check on server side is strengthened a lot, to satisfy the stricter check, connect & disconnect request won't carry the known replied xid anymore, see the comments in ptlrpc_send_new_req() for details.

            niu Niu Yawei (Inactive) added a comment - The multi-slots implementation introduced a regression, see LU-5951 . To get the unreplied requests by scan the sending/delayed list, current multi-slots implementation moved the xid assignment from request packing stage to request sending stage, however, that breaks the original mechanism which used to coordinate the timestamp update on OST objects (caused by some out of order operations, such as setattr, truncate and write). To fix this regression, LU-5951 moved the xid assignment back to request packing stage, and introduced an unreplied list to track all the unreplied requests. Following is a brief description of the LU-5951 patch: obd_import->imp_unreplied_list is introduced to track all the unreplied requests, and all requests in the list is sorted by xid, so that client may get the known maximal replied xid by checking the first element in the list. obd_import->imp_known_replied_xid is introduced for sanity check purpose, it's updated along with the imp_unreplied_list. Once a request is built, it'll be inserted into the unreplied list, and when the reply is seen by client or the request is going to be freed, the request will be removed from the list. Two tricky points are worth mentioning here: 1. Replay requests need be added back to the unreplied list before sending, instead of adding them back one by one during replay, we choose to add them back all together before replay, that'll be easier for strict sanity check and less bug prone. 2. The sanity check on server side is strengthened a lot, to satisfy the stricter check, connect & disconnect request won't carry the known replied xid anymore, see the comments in ptlrpc_send_new_req() for details.
            jgmitter Joseph Gmitter (Inactive) made changes -
            Link New: This issue is related to JFC-15 [ JFC-15 ]

            People

              bzzz Alex Zhuravlev
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              34 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: