Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2731

Speed up the run time of "stop_services" function in lustre init script

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.0
    • Lustre 2.4.0, Lustre 2.1.4
    • None
    • 6635

    Description

      The "stop_services" function in the lustre init script can be sped up if it parallelizes the teardown of each service. This can have a huge positive performance impact on the time it takes to bring down a Lustre server. I wrote a patch to demonstrate this, and tested it on a Lustre 2.1 based OSS with up to 32 ldiskfs OSTs.

      First the patch to the init script:

      diff --git i/lustre/scripts/lustre w/lustre/scripts/lustre                      
      index b97951f..eeb5941 100644                                                   
      --- i/lustre/scripts/lustre                                                     
      +++ w/lustre/scripts/lustre                                                     
      @@ -498,6 +498,7 @@ stop_services ()                                            
       {                                                                              
              local labels=$*                                                         
              local result=0                                                          
      +       local pids=""                                                           
              local dir dev label                                                     
                                                                                     
              for label in $labels; do                                                
      @@ -512,9 +513,22 @@ stop_services ()                                           
                              # no error                                              
                              continue                                                
                      fi                                                              
      +                                                                               
                      echo "Unmounting $dir"                                          
      -               umount $dir || result=2                                         
      +               umount $dir &                                                   
      +                                                                               
      +               if [ -z "$pids" ]; then                                         
      +                       pids="$!"                                               
      +               else                                                            
      +                       pids="$pids $!"                                         
      +               fi                                                              
              done                                                                    
      +                                                                               
      +       # wait for all umount processes to complete, report any errors          
      +       for pid in $pids; do                                                    
      +               wait $pid || result=2                                           
      +       done                                                                    
      +                                                                               
              # double check!                                                         
              for label in $labels; do                                                
                      if mountpt_is_active $label; then                               
      

      The testing I performed shows this patch considerably improves the time it takes to "stop" the OSS. The raw numbers are below.

      Here is the timing information I gathered using the time command when running /etc/init.d/lustre start and /etc/init.d/lustre stop:

      $ time /etc/init.d/lustre start # (w/o patch)
        $ time /etc/init.d/lustre start # (w/o patch)
      
      +-----------+------------+-----------+-----------+                              
      | # of OSTs |    real    |    user   |    sys    |                              
      +-----------+------------+-----------+-----------+                              
      |      1    | 0m  2.184s | 0m 0.162s | 0m 0.077s |                              
      |      2    | 0m  4.285s | 0m 0.281s | 0m 0.148s |                              
      |      4    | 0m  8.508s | 0m 0.500s | 0m 0.302s |                              
      |      8    | 0m 16.961s | 0m 1.017s | 0m 0.568s |                              
      |     16    | 0m 33.884s | 0m 1.964s | 0m 1.176s |                              
      |     32    | 1m  7.744s | 0m 3.986s | 0m 2.280s |                              
      +-----------+------------+-----------+-----------+                              
      
      $ time /etc/init.d/lustre stop # (w/o patch)
        $ time /etc/init.d/lustre stop # (w/o patch)
      
      +-----------+------------+-----------+-----------+                              
      | # of OSTs |    real    |    user   |    sys    |                              
      +-----------+------------+-----------+-----------+                              
      |     1     | 0m  4.758s | 0m 0.072s | 0m 0.030s |                              
      |     2     | 0m  9.018s | 0m 0.118s | 0m 0.049s |                              
      |     4     | 0m 18.813s | 0m 0.185s | 0m 0.083s |                              
      |     8     | 0m 37.586s | 0m 0.337s | 0m 0.141s |                              
      |    16     | 1m 16.092s | 0m 0.597s | 0m 0.263s |                              
      |    32     | 2m 37.550s | 0m 1.181s | 0m 0.403s |                              
      +-----------+------------+-----------+-----------+                              
      

      Here is the timing information gathered the same way as above, but with my patch applied (all else being equal):

      $ time /etc/init.d/lustre start # (w/ patch)
        $ time /etc/init.d/lustre start # (w/ patch)
      
      +-----------+------------+-----------+-----------+                              
      | # of OSTs |    real    |    user   |    sys    |                              
      +-----------+------------+-----------+-----------+                              
      |      1    | 0m  2.183s | 0m 0.158s | 0m 0.083s |                              
      |      2    | 0m  4.282s | 0m 0.274s | 0m 0.153s |                              
      |      4    | 0m  8.519s | 0m 0.510s | 0m 0.303s |                              
      |      8    | 0m 16.966s | 0m 1.019s | 0m 0.583s |                              
      |     16    | 0m 33.878s | 0m 1.984s | 0m 1.154s |                              
      |     32    | 1m  7.745s | 0m 3.944s | 0m 2.322s |                              
      +-----------+------------+-----------+-----------+                              
      
      $ time /etc/init.d/lustre stop # (w/ patch)
        $ time /etc/init.d/lustre stop # (w/ patch)
      
      +-----------+------------+-----------+-----------+                              
      | # of OSTs |    real    |    user   |    sys    |                              
      +-----------+------------+-----------+-----------+                              
      |      1    | 0m  4.566s | 0m 0.075s | 0m 0.023s |                              
      |      2    | 0m  4.857s | 0m 0.105s | 0m 0.070s |                              
      |      4    | 0m  4.777s | 0m 0.175s | 0m 0.064s |                              
      |      8    | 0m  5.449s | 0m 0.323s | 0m 0.153s |                              
      |     16    | 0m  5.862s | 0m 0.606s | 0m 0.208s |                              
      |     32    | 0m  6.307s | 0m 1.183s | 0m 0.811s |                              
      +-----------+------------+-----------+-----------+                              
      

      This is a drastic improvement in the time it takes for /etc/init.d/lustre stop to complete as the number of OSTs increases.

      Attachments

        Activity

          People

            emoly.liu Emoly Liu
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: