Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5703

Quiesce client mountpoints from the server

Details

    • New Feature
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.4.3
    • 15964

    Description

      In order to minimize user disruptions NASA performs some system maintenance "Live". Typical maintance includes activities such as adding new compute node or reconfigurations of IB fabric. During such times users jobs are suspend via pbs. Although we are able to suspend user job, which does minimize usage of lustre, it does not stop all lustre client/server activity. Therefore NASA requires:
      1. mechanism to halt and block all lustre client IO.
      2. Halt client/server keep alive ping and all other network traffic.
      3. Clients should be able to recover after the quiesce without eviction.

      Attachments

        1. Dynamic Congestion Control - Qian.docx
          482 kB
          Andreas Dilger
        2. SC09-Simplified-Interop.pdf
          366 kB
          Andreas Dilger

        Issue Links

          Activity

            [LU-5703] Quiesce client mountpoints from the server
            mrasobarnett Matt Rásó-Barnett made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 40839 ]

            Related to Patrick's recent comment I've attached Dynamic Congestion Control - Qian.docx, which is a paper that Yingjin worked on long ago to have the servers dynamically manage the client max_rpcs_in_flight, in a manner similar to grants (i.e. constant flow of "RPC credits" out to clients that need them and withdrawing them from other clients that are not active). In addition to "stop all client RPCs" this could also be used to temporarily boost performance to busy clients.

            adilger Andreas Dilger added a comment - Related to Patrick's recent comment I've attached Dynamic Congestion Control - Qian.docx , which is a paper that Yingjin worked on long ago to have the servers dynamically manage the client max_rpcs_in_flight , in a manner similar to grants (i.e. constant flow of "RPC credits" out to clients that need them and withdrawing them from other clients that are not active). In addition to "stop all client RPCs" this could also be used to temporarily boost performance to busy clients.
            adilger Andreas Dilger made changes -
            Attachment New: Dynamic Congestion Control - Qian.docx [ 57069 ]

            I think this could probably be done fairly easily with some tweaks to the RPC code to allow the server to temporarily set max_rpcs_in_flight for data and metadata to zero.  It wouldn't guarantee total silence - that would require unmounting - but it might be useful if folks didn't mind taking an additional performance stoppage.  Interesting.

            paf0186 Patrick Farrell added a comment - I think this could probably be done fairly easily with some tweaks to the RPC code to allow the server to temporarily set max_rpcs_in_flight for data and metadata to zero.  It wouldn't guarantee total silence - that would require unmounting - but it might be useful if folks didn't mind taking an additional performance stoppage.  Interesting.
            adilger Andreas Dilger made changes -
            Labels New: medium
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-15250 [ LU-15250 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-13521 [ LU-13521 ]
            adilger Andreas Dilger made changes -
            Attachment New: SC09-Simplified-Interop.pdf [ 34026 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-13010 [ LU-13010 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-3290 [ LU-3290 ]

            People

              pjones Peter Jones
              mhanafi Mahmoud Hanafi
              Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated: