[LU-5608] Performance regression of removal operation with mdtest stride option Created: 11/Sep/14  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Wang Shilong (Inactive) Assignee: Lai Siyao
Resolution: Incomplete Votes: 0
Labels: None

Attachments: File mdtest-HEAD-f2bf8ac.tar    
Severity: 3
Rank (Obsolete): 15687

 Description   

While comparing Lustre 1.8 series with latest master release in client. Server is running the same 2.5 series. we found there is big file removal regression.

Testing command is:

  1. mpirun -bind-to core:overload-allowed --map-by ppr:32:node --allow-run-as-root -np 512 -hostfile ./hostfile.32 ./mdtest -n 2000 -i 3 -p 10 -u -d /lustre_ {0-31}

    /mdtest.out -F -N 32

While comparing file removal performance between 1.8 and master is:
61476.175 op/seconds VS 39640.455 op/seconds.

Big regression, isn't it?

Notice here we need use '-N' option for mdtest, the problem seems only reproducible under multiple clients. Attachment is modified mdtest source codes which could help reproduce this problem.



 Comments   
Comment by Shuichi Ihara (Inactive) [ 12/Sep/14 ]

Here is benchamrk results with master branch and 1.8.9 on clients. Server is running lustre-2.5.
mdtest supports stride opton (-N <n>) which can be avoiding cache for locks for created files. This regression happens when "-N" option is enabled on lustre-2.6 clients. the performance drops by more than 45% compared lustre-1.8 client.

32 clients, 64 processes
No stride
# mdtest -n 16384 -i 3 -p 10 -d /lustre_0/mdtest.out -F -u 

Stride=2, because, two mdtest threads are running on same client
# mdtest -n 16384 -i 3 -p 10 -d /lustre_0/mdtest.out -F -u -N 2 

No Stride

  File Creation File Stat File Read File removal
1.8.9 client 90548 224747 166181 102922
2.6.62 client 83073 195469 128705 102793

Stride=2

  File Creation File Stat File Read File removal
1.8.9 client 83908 224870 162568 51753
2.6.62 client 87455 205767 156613 33621
Comment by Cory Spitz [ 12/Sep/14 ]

This seems to be related (or a duplicate) of https://jira.hpdd.intel.com/browse/LU-1167 and https://jira.hpdd.intel.com/browse/LU-3308.

Comment by Shuichi Ihara (Inactive) [ 13/Sep/14 ]

I don't know what type of metadata workload LU-1167 and LU-3308 did, but I think this is different issue.
LU-5608 seems to be related to layout lcok which is not introduced in lustre-2.1 or 2.2. LU-1167 and LU-3308 didn't mention about it becouse layout lock was not availabe at that tiem.
I will post another benchmark results to confirm our asusme is correct.

Comment by Shuichi Ihara (Inactive) [ 13/Sep/14 ]

lustre-1.8.9 dosn't have layout lock. So, just in case, in order to make sure if layout lock might be related, I applied following patches to force disable layout lock with 2.6.52 client.

Index: lustre-release.git/lustre/llite/llite_lib.c
===================================================================
--- lustre-release.git.orig/lustre/llite/llite_lib.c
+++ lustre-release.git/lustre/llite/llite_lib.c
@@ -211,7 +211,7 @@ static int client_common_fill_super(stru
                                   OBD_CONNECT_FULL20   | OBD_CONNECT_64BITHASH|
 				  OBD_CONNECT_EINPROGRESS |
 				  OBD_CONNECT_JOBSTATS | OBD_CONNECT_LVB_TYPE |
-				  OBD_CONNECT_LAYOUTLOCK | OBD_CONNECT_PINGLESS |
+				  OBD_CONNECT_PINGLESS |
 				  OBD_CONNECT_MAX_EASIZE |
 				  OBD_CONNECT_FLOCK_DEAD |
 				  OBD_CONNECT_DISP_STRIPE | OBD_CONNECT_LFSCK |
@@ -416,7 +416,6 @@ static int client_common_fill_super(stru
                                   OBD_CONNECT_MAXBYTES |
 				  OBD_CONNECT_EINPROGRESS |
 				  OBD_CONNECT_JOBSTATS | OBD_CONNECT_LVB_TYPE |
-				  OBD_CONNECT_LAYOUTLOCK |
 				  OBD_CONNECT_PINGLESS | OBD_CONNECT_LFSCK;
 
         if (sbi->ll_flags & LL_SBI_SOM_PREVIEW)

Here is test results. 32 clients, 64 process, 1M files for creation/stats/removal.

No Stride

  File Creation File Stat File Read File removal
1.8.9 client 90548 224747 166181 102922
2.6.62 client 83073 195469 128705 102793
patched 2.6.62 client 89381 186802 120346 83731

Stride=2

  File Creation File Stat File Read File removal
1.8.9 client 83908 224870 162568 51753
2.6.62 client 87455 205767 156613 33621
patched 2.6.62 client 89275 182131 153786 49672

Stride enabled "File removal" performance significant improved and it's close to lustre-1.8.9's numbers.
However, without stride, "File removal" performance dropped and "File stat" performance also dropped when stride is enabled.

Comment by Peter Jones [ 13/Sep/14 ]

Lai

Could you please comment?

Thanks

Peter

Comment by Lai Siyao [ 18/Sep/14 ]

This looks to be caused by statahead, because for mdtest stride option, statahead won't help, and cause overhead. And in current statahead implementation, each stat will try statahead, though it will fail because the stat entry is not first directory entry, which will cause more overhead. Hopefully LU-3270 can help this issue, because a patch for that will disable statahead upon previous statahead failure.

Could you disable statahead on master client, and run this test again?

In the mean time, I'll do this test against LU-3270 code to verify also.

Generated at Sat Feb 10 01:52:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.