[LU-2731] Speed up the run time of "stop_services" function in lustre init script Created: 31/Jan/13  Updated: 22/Mar/13  Resolved: 22/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.4
Fix Version/s: Lustre 2.4.0

Type: Improvement Priority: Minor
Reporter: Prakash Surya (Inactive) Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: None

Rank (Obsolete): 6635

 Description   

The "stop_services" function in the lustre init script can be sped up if it parallelizes the teardown of each service. This can have a huge positive performance impact on the time it takes to bring down a Lustre server. I wrote a patch to demonstrate this, and tested it on a Lustre 2.1 based OSS with up to 32 ldiskfs OSTs.

First the patch to the init script:

diff --git i/lustre/scripts/lustre w/lustre/scripts/lustre                      
index b97951f..eeb5941 100644                                                   
--- i/lustre/scripts/lustre                                                     
+++ w/lustre/scripts/lustre                                                     
@@ -498,6 +498,7 @@ stop_services ()                                            
 {                                                                              
        local labels=$*                                                         
        local result=0                                                          
+       local pids=""                                                           
        local dir dev label                                                     
                                                                               
        for label in $labels; do                                                
@@ -512,9 +513,22 @@ stop_services ()                                           
                        # no error                                              
                        continue                                                
                fi                                                              
+                                                                               
                echo "Unmounting $dir"                                          
-               umount $dir || result=2                                         
+               umount $dir &                                                   
+                                                                               
+               if [ -z "$pids" ]; then                                         
+                       pids="$!"                                               
+               else                                                            
+                       pids="$pids $!"                                         
+               fi                                                              
        done                                                                    
+                                                                               
+       # wait for all umount processes to complete, report any errors          
+       for pid in $pids; do                                                    
+               wait $pid || result=2                                           
+       done                                                                    
+                                                                               
        # double check!                                                         
        for label in $labels; do                                                
                if mountpt_is_active $label; then                               

The testing I performed shows this patch considerably improves the time it takes to "stop" the OSS. The raw numbers are below.

Here is the timing information I gathered using the time command when running /etc/init.d/lustre start and /etc/init.d/lustre stop:

$ time /etc/init.d/lustre start # (w/o patch)
  $ time /etc/init.d/lustre start # (w/o patch)

+-----------+------------+-----------+-----------+                              
| # of OSTs |    real    |    user   |    sys    |                              
+-----------+------------+-----------+-----------+                              
|      1    | 0m  2.184s | 0m 0.162s | 0m 0.077s |                              
|      2    | 0m  4.285s | 0m 0.281s | 0m 0.148s |                              
|      4    | 0m  8.508s | 0m 0.500s | 0m 0.302s |                              
|      8    | 0m 16.961s | 0m 1.017s | 0m 0.568s |                              
|     16    | 0m 33.884s | 0m 1.964s | 0m 1.176s |                              
|     32    | 1m  7.744s | 0m 3.986s | 0m 2.280s |                              
+-----------+------------+-----------+-----------+                              
$ time /etc/init.d/lustre stop # (w/o patch)
  $ time /etc/init.d/lustre stop # (w/o patch)

+-----------+------------+-----------+-----------+                              
| # of OSTs |    real    |    user   |    sys    |                              
+-----------+------------+-----------+-----------+                              
|     1     | 0m  4.758s | 0m 0.072s | 0m 0.030s |                              
|     2     | 0m  9.018s | 0m 0.118s | 0m 0.049s |                              
|     4     | 0m 18.813s | 0m 0.185s | 0m 0.083s |                              
|     8     | 0m 37.586s | 0m 0.337s | 0m 0.141s |                              
|    16     | 1m 16.092s | 0m 0.597s | 0m 0.263s |                              
|    32     | 2m 37.550s | 0m 1.181s | 0m 0.403s |                              
+-----------+------------+-----------+-----------+                              

Here is the timing information gathered the same way as above, but with my patch applied (all else being equal):

$ time /etc/init.d/lustre start # (w/ patch)
  $ time /etc/init.d/lustre start # (w/ patch)

+-----------+------------+-----------+-----------+                              
| # of OSTs |    real    |    user   |    sys    |                              
+-----------+------------+-----------+-----------+                              
|      1    | 0m  2.183s | 0m 0.158s | 0m 0.083s |                              
|      2    | 0m  4.282s | 0m 0.274s | 0m 0.153s |                              
|      4    | 0m  8.519s | 0m 0.510s | 0m 0.303s |                              
|      8    | 0m 16.966s | 0m 1.019s | 0m 0.583s |                              
|     16    | 0m 33.878s | 0m 1.984s | 0m 1.154s |                              
|     32    | 1m  7.745s | 0m 3.944s | 0m 2.322s |                              
+-----------+------------+-----------+-----------+                              
$ time /etc/init.d/lustre stop # (w/ patch)
  $ time /etc/init.d/lustre stop # (w/ patch)

+-----------+------------+-----------+-----------+                              
| # of OSTs |    real    |    user   |    sys    |                              
+-----------+------------+-----------+-----------+                              
|      1    | 0m  4.566s | 0m 0.075s | 0m 0.023s |                              
|      2    | 0m  4.857s | 0m 0.105s | 0m 0.070s |                              
|      4    | 0m  4.777s | 0m 0.175s | 0m 0.064s |                              
|      8    | 0m  5.449s | 0m 0.323s | 0m 0.153s |                              
|     16    | 0m  5.862s | 0m 0.606s | 0m 0.208s |                              
|     32    | 0m  6.307s | 0m 1.183s | 0m 0.811s |                              
+-----------+------------+-----------+-----------+                              

This is a drastic improvement in the time it takes for /etc/init.d/lustre stop to complete as the number of OSTs increases.



 Comments   
Comment by Prakash Surya (Inactive) [ 31/Jan/13 ]

Please see: http://review.whamcloud.com/5235

Comment by Peter Jones [ 04/Feb/13 ]

Thanks for the patch Prakash!

Emoly

Could you please take care of this patch?

Thanks

Peter

Comment by Emoly Liu [ 04/Feb/13 ]

OK.

Comment by Peter Jones [ 22/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:27:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.