Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5319

Support multiple slots per client in last_rcvd file

Details

    • 14856

    Description

      While running mdtest benchmark, I have observed that file creation and unlink operations from a single Lustre client quickly saturates to around 8000 iops: maximum is reached as soon as with 4 tasks in parallel.
      When using several Lustre mount points on a single client node, the file creation and unlink rate do scale with the number of tasks, up to the 16 cores of my client node.

      Looking at the code, it appears that most metadata operations are serialized by a mutex in the MDC layer.
      In mdc_reint() routine, request posting is protected by mdc_get_rpc_lock() and mdc_put_rpc_lock(), where the lock is :
      struct client_obd -> struct mdc_rpc_lock *cl_rpc_lock -> struct mutex rpcl_mutex.

      After an email discussion with Andreas Dilger, it appears that the limitation is actually on the MDS, since it cannot handle more than a single filesystem-modifying RPC at one time. There is only one slot in the MDT last_rcvd file for each client to save the state for the reply in case it is lost.

      The aim of this ticket is to implement multiple slots per client in the last_rcvd file so that several filesystem-modifying RPCs can be handled in parallel.

      The single client metadata performance should be significantly improved while still ensuring a safe recovery mecanism.

      Attachments

        Issue Links

          Activity

            [LU-5319] Support multiple slots per client in last_rcvd file

            The multi-slots implementation introduced a regression, see LU-5951.

            To get the unreplied requests by scan the sending/delayed list, current multi-slots implementation moved the xid assignment from request packing stage to request sending stage, however, that breaks the original mechanism which used to coordinate the timestamp update on OST objects (caused by some out of order operations, such as setattr, truncate and write).

            To fix this regression, LU-5951 moved the xid assignment back to request packing stage, and introduced an unreplied list to track all the unreplied requests. Following is a brief description of the LU-5951 patch:

            obd_import->imp_unreplied_list is introduced to track all the unreplied requests, and all requests in the list is sorted by xid, so that client may get the known maximal replied xid by checking the first element in the list.

            obd_import->imp_known_replied_xid is introduced for sanity check purpose, it's updated along with the imp_unreplied_list.

            Once a request is built, it'll be inserted into the unreplied list, and when the reply is seen by client or the request is going to be freed, the request will be removed from the list. Two tricky points are worth mentioning here:

            1. Replay requests need be added back to the unreplied list before sending, instead of adding them back one by one during replay, we choose to add them back all together before replay, that'll be easier for strict sanity check and less bug prone.

            2. The sanity check on server side is strengthened a lot, to satisfy the stricter check, connect & disconnect request won't carry the known replied xid anymore, see the comments in ptlrpc_send_new_req() for details.

            niu Niu Yawei (Inactive) added a comment - The multi-slots implementation introduced a regression, see LU-5951 . To get the unreplied requests by scan the sending/delayed list, current multi-slots implementation moved the xid assignment from request packing stage to request sending stage, however, that breaks the original mechanism which used to coordinate the timestamp update on OST objects (caused by some out of order operations, such as setattr, truncate and write). To fix this regression, LU-5951 moved the xid assignment back to request packing stage, and introduced an unreplied list to track all the unreplied requests. Following is a brief description of the LU-5951 patch: obd_import->imp_unreplied_list is introduced to track all the unreplied requests, and all requests in the list is sorted by xid, so that client may get the known maximal replied xid by checking the first element in the list. obd_import->imp_known_replied_xid is introduced for sanity check purpose, it's updated along with the imp_unreplied_list. Once a request is built, it'll be inserted into the unreplied list, and when the reply is seen by client or the request is going to be freed, the request will be removed from the list. Two tricky points are worth mentioning here: 1. Replay requests need be added back to the unreplied list before sending, instead of adding them back one by one during replay, we choose to add them back all together before replay, that'll be easier for strict sanity check and less bug prone. 2. The sanity check on server side is strengthened a lot, to satisfy the stricter check, connect & disconnect request won't carry the known replied xid anymore, see the comments in ptlrpc_send_new_req() for details.
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14861/
            Subject: LU-5319 tests: testcases for multiple modify RPCs feature
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c2d27a0f12688c0d029880919f8b002e557b540c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14861/ Subject: LU-5319 tests: testcases for multiple modify RPCs feature Project: fs/lustre-release Branch: master Current Patch Set: Commit: c2d27a0f12688c0d029880919f8b002e557b540c

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14862/
            Subject: LU-5319 utils: update lr_reader to display additional data
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 6460ae59bf6e5175797dc66ecbe560eebc8b6333

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14862/ Subject: LU-5319 utils: update lr_reader to display additional data Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6460ae59bf6e5175797dc66ecbe560eebc8b6333

            Here is version 2 of the test plan document, updated with tests results.

            pichong Gregoire Pichon added a comment - Here is version 2 of the test plan document, updated with tests results.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14860/
            Subject: LU-5319 mdt: support multiple modify RCPs in parallel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5fc7aa3687daca5c14b0e479c58146e0987daf7f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14860/ Subject: LU-5319 mdt: support multiple modify RCPs in parallel Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5fc7aa3687daca5c14b0e479c58146e0987daf7f

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14374/
            Subject: LU-5319 mdc: manage number of modify RPCs in flight
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1fc013f90175d1e50d7a22b404ad6abd31a43e38

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14374/ Subject: LU-5319 mdc: manage number of modify RPCs in flight Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1fc013f90175d1e50d7a22b404ad6abd31a43e38

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14793/
            Subject: LU-5319 ptlrpc: embed highest XID in each request
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: bf3e7f67cb33f3b4e0590ef8af3843ac53d0a4e8

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14793/ Subject: LU-5319 ptlrpc: embed highest XID in each request Project: fs/lustre-release Branch: master Current Patch Set: Commit: bf3e7f67cb33f3b4e0590ef8af3843ac53d0a4e8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14153/
            Subject: LU-5319 mdc: add max modify RPCs in flight variable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 60c05ea9f66f9bd3f5fd35942a12edb1e311c455

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14153/ Subject: LU-5319 mdc: add max modify RPCs in flight variable Project: fs/lustre-release Branch: master Current Patch Set: Commit: 60c05ea9f66f9bd3f5fd35942a12edb1e311c455

            Grégoire, it looks like your patches are regularly causing conf-sanity test_32[abd] to fail or time out. This is not the case with other patches being tested recently, so it looks like there is a regression in your patches 14860 (test fail) and 14861 (test timeout).

            One failure in 14860 is:

            LustreError: 14561:0:(class_obd.c:684:cleanup_obdclass()) obd_memory max: 53634235, leaked: 152
            shadow-10vm8: 
            shadow-10vm8: Memory leaks detected
            

            Could you please investigate.

            adilger Andreas Dilger added a comment - Grégoire, it looks like your patches are regularly causing conf-sanity test_32 [abd] to fail or time out. This is not the case with other patches being tested recently, so it looks like there is a regression in your patches 14860 (test fail) and 14861 (test timeout). One failure in 14860 is: LustreError: 14561:0:(class_obd.c:684:cleanup_obdclass()) obd_memory max: 53634235, leaked: 152 shadow-10vm8: shadow-10vm8: Memory leaks detected Could you please investigate.

            People

              bzzz Alex Zhuravlev
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              34 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: