[LU-1408] single client's performance regression test Created: 15/May/12  Updated: 02/Jun/14  Resolved: 20/Jul/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.3.0, Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Shuichi Ihara (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre-2.2, b2_1 and lustre-1.8.7
CentOS6.2 on both servers and clients


Attachments: Microsoft Word single-client-perforamnce-LU1408-rev2.xlsx     Microsoft Word single-client-perforamnce-LU1408.xlsx     File test-script.sh    
Issue Links:
Related
is related to LU-1413 difference of single client's perform... Resolved
is related to LU-744 Single client's performance degradati... Resolved
is related to LU-969 2.1 client stack overruns Resolved
Severity: 3
Rank (Obsolete): 4598

 Description   

This is a single client performance regression on 2.2 compared to 2.1.2 or 1.8.x.

I filed LU-744 before for another single client's performance regression, but that regression also happened on 2.1.x as well as 2.2 when the amount of file size is larger than client's memory size.

So, this regression might not be related to LU-744, but there is an another regression on 2.2 even if amount of file size is smaller than client's memory size.



 Comments   
Comment by Shuichi Ihara (Inactive) [ 15/May/12 ]

test script with IOR

Comment by Shuichi Ihara (Inactive) [ 15/May/12 ]

Here is test scripts and an initial single client performance testing on various lustre version.
http://jira.whamcloud.com/secure/attachment/11360/test-script.sh
http://jira.whamcloud.com/secure/attachment/11303/lustre-singleclient-comparison.xlsx

Servers are running with lustre-2.2 on CentOS6.2, just test each checksum algo with various lustre version.

Comment by Shuichi Ihara (Inactive) [ 16/May/12 ]

During test for LU-1408, I saw another big performance differences between b2_1 branch and 2.1.2RC0 tag on the single client. Please refer to LU-1413.

Comment by Peter Jones [ 16/May/12 ]

Oleg

Could you please suggest some steps here and perhaps someone else can assist in executing them?

Thanks

Peter

Comment by Shuichi Ihara (Inactive) [ 20/May/12 ]

This regression is related to LU-1413.

As I commented on LU-1413, we also saw the single client's performance regression on the latest b2_1 branch, but no regression on 2.1.2RC0 branch. The regression started from commit b9cbe3616b6e0b44c7835b1aec65befb85f848f9 (LU-969 debug: reduce stack usage). Please see LU-1413 for these test results.

http://jira.whamcloud.com/secure/attachment/11303/lustre-singleclient-comparison.xlsx
Above, my testing on b2_2 and b2_1 and 1.8.7, I used 2.1.2RC0 branch for b2_1 testing.

This is why 2.1.2 numbers were faster than b2_2, because LU-969 was not landed in 2.1.2RC0 yet, but landed in b2_2.

In order to verify, I removed commit for LU-969 from b2_2 and measured the performance.
(IOR with 4 processes with checksum=disable)

               WRITE(GB/s)  READ(GB/s)
b2_2              1.4          1.4
b2_2/wo LU969     2.4          2.8
b2_1              1.4          1.4
b2_1/wo LU969     2.7          3.2

I will try to run full testing on FDR infiniband.

Comment by Shuichi Ihara (Inactive) [ 20/May/12 ]

This is benchmark results on 2.1.2RC0, b2_2 and b2_2 without LU-969 patches.
RHEL6.2, Lustre-2.2 for servers. RHEL6.2 on an client (96GB memory, 12 CPU cores) with QDR Infiniband.

All performance were improved by disabling LU-969 patches.

Comment by Shuichi Ihara (Inactive) [ 20/May/12 ]

benchmark results with correct parameters.

Comment by Shuichi Ihara (Inactive) [ 20/May/12 ]

http://jira.whamcloud.com/secure/attachment/11413/single-client-perforamnce-LU1408-rev2.xlsx

This is correct benchmark results with the correct parameters. max_rpcs_in_flight=256 was missing on the previous benchmark results when the checksum was enabled.
As far as I can see the new results, when I removed LU-969 patches from b2_2, all numbers were alomost close to 2.1.2RC0's numbers.

I will run benchmark on FDR Infiniband to see the maximum single perforamnce on b2_2.

Comment by Peter Jones [ 20/May/12 ]

Ihahara

Thanks for this investigation work!

Hongchao

Could you please look into why the LU-969 changes are negatively affecting performance?

Thanks

Peter

Comment by Hongchao Zhang [ 21/May/12 ]

this could be related to the modification of __CHECK_STACK,

Hi,Ihahara
could you please help to check whether there are much more logs in server side when testing with patches of LU-969? thanks very much!

Comment by Oleg Drokin [ 21/May/12 ]

I suspect the root problem is because we are filling the debug structure every time now regardless of if it hits or not.

So perhaps if we move the structure filling until after the check, all should be fine?

Soemthing like this:

diff --git a/libcfs/include/libcfs/libcfs_debug.h b/libcfs/include/libcfs/libcfs_debug.h
index 8a366f9..72171ad 100644
--- a/libcfs/include/libcfs/libcfs_debug.h
+++ b/libcfs/include/libcfs/libcfs_debug.h
@@ -203,12 +203,13 @@ static inline int cfs_cdebug_show(unsigned int mask, unsigned int subsystem)
 
 #define __CDEBUG(cdls, mask, format, ...)                               \
 do {                                                                    \
-        LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, cdls);                \
                                                                         \
-        CFS_CHECK_STACK(&msgdata, mask, cdls);                          \
+        if (cfs_cdebug_show(mask, DEBUG_SUBSYSTEM)) {                   \
+                LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, cdls);        \
                                                                         \
-        if (cfs_cdebug_show(mask, DEBUG_SUBSYSTEM))                     \
+                CFS_CHECK_STACK(&msgdata, mask, cdls);                  \
                 libcfs_debug_msg(&msgdata, format, ## __VA_ARGS__);     \
+        }                                                               \
 } while (0)
 
 #define CDEBUG(mask, format, ...) __CDEBUG(NULL, mask, format, ## __VA_ARGS__)
diff --git a/lustre/include/cl_object.h b/lustre/include/cl_object.h
index 47782be..ad1cf7a 100644
--- a/lustre/include/cl_object.h
+++ b/lustre/include/cl_object.h
@@ -1065,9 +1065,9 @@ struct cl_page_operations {
  */
 #define CL_PAGE_DEBUG(mask, env, page, format, ...)                     \
 do {                                                                    \
-        LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);                \
                                                                         \
         if (cfs_cdebug_show(mask, DEBUG_SUBSYSTEM)) {                   \
+                LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);        \
                 cl_page_print(env, &msgdata, lu_cdebug_printer, page);  \
                 CDEBUG(mask, format , ## __VA_ARGS__);                  \
         }                                                               \
@@ -1078,9 +1078,9 @@ do {                                                                    \
  */
 #define CL_PAGE_HEADER(mask, env, page, format, ...)                          \
 do {                                                                          \
-        LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);                      \
                                                                               \
         if (cfs_cdebug_show(mask, DEBUG_SUBSYSTEM)) {                         \
+                LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);              \
                 cl_page_header_print(env, &msgdata, lu_cdebug_printer, page); \
                 CDEBUG(mask, format , ## __VA_ARGS__);                        \
         }                                                                     \
@@ -1789,9 +1789,9 @@ struct cl_lock_operations {
 
 #define CL_LOCK_DEBUG(mask, env, lock, format, ...)                     \
 do {                                                                    \
-        LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);                \
                                                                         \
         if (cfs_cdebug_show(mask, DEBUG_SUBSYSTEM)) {                   \
+                LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);        \
                 cl_lock_print(env, &msgdata, lu_cdebug_printer, lock);  \
                 CDEBUG(mask, format , ## __VA_ARGS__);                  \
         }                                                               \
diff --git a/lustre/include/lu_object.h b/lustre/include/lu_object.h
index 0fd61fb..b97a249 100644
--- a/lustre/include/lu_object.h
+++ b/lustre/include/lu_object.h
@@ -763,9 +763,9 @@ int lu_cdebug_printer(const struct lu_env *env,
  */
 #define LU_OBJECT_DEBUG(mask, env, object, format, ...)                   \
 do {                                                                      \
-        LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);                  \
                                                                           \
         if (cfs_cdebug_show(mask, DEBUG_SUBSYSTEM)) {                     \
+                LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);          \
                 lu_object_print(env, &msgdata, lu_cdebug_printer, object);\
                 CDEBUG(mask, format , ## __VA_ARGS__);                    \
         }                                                                 \
@@ -776,9 +776,9 @@ do {                                                                      \
  */
 #define LU_OBJECT_HEADER(mask, env, object, format, ...)                \
 do {                                                                    \
-        LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);                \
                                                                         \
         if (cfs_cdebug_show(mask, DEBUG_SUBSYSTEM)) {                   \
+                LIBCFS_DEBUG_MSG_DATA_DECL(msgdata, mask, NULL);        \
                 lu_object_header_print(env, &msgdata, lu_cdebug_printer,\
                                        (object)->lo_header);            \
                 lu_cdebug_printer(env, &msgdata, "\n");                 \
Comment by Hongchao Zhang [ 22/May/12 ]

at Toro(1 client, 1 MDT, 6 OST), several tests of b2_2 and previous b2_1(without the patch of LU-969), the read/write
performance is almost the same, then the affect of LU-969 patch could be related to your test environment,

could you please help to test with the following patch (against LU-969 patch)? Thanks in advance!

diff --git a/libcfs/include/libcfs/linux/libcfs.h b/libcfs/include/libcfs/linux/libcfs.h
index ce07e80..0dadd84 100644
— a/libcfs/include/libcfs/linux/libcfs.h
+++ b/libcfs/include/libcfs/linux/libcfs.h
@@ -79,7 +79,8 @@

#define __CHECK_STACK(msgdata, mask, cdls) \
do { \

  • if (unlikely(CDEBUG_STACK() > libcfs_stack)) { \
    + if (unlikely(CDEBUG_STACK() > 3 * THREAD_SIZE / 4 && \
    + CDEBUG_STACK() > libcfs_stack)) { \
    libcfs_stack = CDEBUG_STACK(); \
    (msgdata)->msg_mask = D_WARNING; \
    (msgdata)->msg_cdls = NULL; \
Comment by Hongchao Zhang [ 23/May/12 ]

Hi Ihahara,

could you please help to test this patch (http://review.whamcloud.com/#change,2889) ATM, thanks!

Comment by James A Simmons [ 23/May/12 ]

ORNL is also testing this patch at this time on our test beds. Will post results soon. Thanks for the patch.

Comment by Shuichi Ihara (Inactive) [ 23/May/12 ]

Hongchao, I'm very sorry, our test system is shutdowning down on this week, and I was looking for other system, but not luck.

James, I appreciate your helsp for testing!

Comment by Shuichi Ihara (Inactive) [ 23/May/12 ]

James,
For tesitng, there is my IOR command. It's basic, but the block size is needed to optimize by number of process for less than client's memory.
mpirun -np 4 IOR -b Xg -t 1m -F -C -w -r -e -vv -o /lustre/ior.out/file

Comment by James A Simmons [ 23/May/12 ]

Got it. So if the block size is greater than the client memory does this problem still exist. I did a earlier running with a block size much larger than the client memory. In that case I got results similar with and without the patch.

Comment by Shuichi Ihara (Inactive) [ 23/May/12 ]

Yes, in case of larger than client's memory size, that problem is still exist and it was filed on LU-744.

Comment by Cliff White (Inactive) [ 24/May/12 ]

Tests run on build 6296, 106 clients http://review.whamcloud.com/#change,2889

Kernel: 2.6.32-220.13.1.el6_lustre.g10a847d.x86_64
Lustre: jenkins-arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel-6296-gc1ba127-PRISTINE-2.6.32-220.13.1.el6_lustre.g10a847d.x86_64

parallel-scale.test_iorssf.test_log.hyperion244.log
        clients            = 848 (8 per node)
        aggregate filesize = 848 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        5992.72    5561.75    5837.06     195.23  148.93573 0 848 8 3 0 0 1 0 0 1 1073741824 1048576 910533066752 POSIX 0
 read         5092.44    4625.07    4896.03     197.95  177.65520 0 848 8 3 0 0 1 0 0 1 1073741824 1048576 910533066752 POSIX 0

 Finished: Thu May 24 11:40:39 2012

parallel-scale.test_iorfpp.test_log.hyperion244.log
        clients            = 848 (8 per node)
        aggregate filesize = 848 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        6858.42    6656.15    6766.77      83.66  128.34560 0 848 8 3 1 0 1 0 0 1 1073741824 1048576 910533066752 POSIX 0
 read         6196.60    6069.53    6142.16      53.44  141.38652 0 848 8 3 1 0 1 0 0 1 1073741824 1048576 910533066752 POSIX 0

 Finished: Thu May 24 11:58:08 2012
Comment by Cliff White (Inactive) [ 24/May/12 ]

Tests run on build 6296, 50 clients: (includes patch)

Kernel: 2.6.32-220.13.1.el6_lustre.g10a847d.x86_64
Lustre: jenkins-arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel-6296-gc1ba127-PRISTINE-2.6.32-220.13.1.el6_lustre.g10a847d.x86_64

parallel-scale.test_iorssf.test_log.hyperion244.log
        clients            = 400 (8 per node)
        aggregate filesize = 400 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write         716.36     610.43     659.39      43.61  623.87034 0 400 8 3 0 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0
 read         5380.47    5285.04    5336.81      39.38   76.75416 0 400 8 3 0 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0

 Finished: Thu May 24 14:16:16 2012

parallel-scale.test_iorfpp.test_log.hyperion244.log
        clients            = 400 (8 per node)
        aggregate filesize = 400 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        6763.45     176.72    4552.72    3094.36  813.10982 0 400 8 3 1 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0
 read         5981.72    5336.92    5766.69     303.89   71.23411 0 400 8 3 1 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0

 Finished: Thu May 24 15:02:52 2012
Comment by Cliff White (Inactive) [ 24/May/12 ]

Tip of 2.1, includes patch, 50 clients:

Kernel: 2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64
Lustre: jenkins-g696f7f2-PRISTINE-2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64

parallel-scale.test_iorssf.test_log.hyperion244.log
        clients            = 400 (8 per node)
        aggregate filesize = 400 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        5179.29    4777.87    4998.78     166.35   82.03194 0 400 8 3 0 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0
 read         5334.62    4381.00    4986.10     429.51   82.80132 0 400 8 3 0 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0

 Finished: Thu May 24 19:36:41 2012

parallel-scale.test_iorfpp.test_log.hyperion244.log
        clients            = 400 (8 per node)
        aggregate filesize = 400 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        6397.31    6191.68    6273.38      89.10   65.30478 0 400 8 3 1 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0
 read         6289.52    5801.98    5992.59     212.76   68.43547 0 400 8 3 1 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0

 Finished: Thu May 24 19:44:19 2012
Comment by Cliff White (Inactive) [ 24/May/12 ]

Tip of 2.1 (build #81)

Kernel: 2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64
Lustre: jenkins-g696f7f2-PRISTINE-2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64

parallel-scale.test_iorssf.test_log.hyperion244.log
        clients            = 840 (8 per node)
        aggregate filesize = 840 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        5770.81    5199.94    5412.42     254.87  159.26572 0 840 8 3 0 0 1 0 0 1 1073741824 1048576 901943132160 POSIX 0
 read         5151.68    5057.77    5117.10      42.14  168.10677 0 840 8 3 0 0 1 0 0 1 1073741824 1048576 901943132160 POSIX 0

 Finished: Thu May 24 20:06:38 2012

parallel-scale.test_iorfpp.test_log.hyperion244.log
        clients            = 840 (8 per node)
        aggregate filesize = 840 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        6177.47    5906.30    6009.84     119.62  143.18126 0 840 8 3 1 0 1 0 0 1 1073741824 1048576 901943132160 POSIX 0
 read         6144.79    5520.26    5738.26     287.71  150.26389 0 840 8 3 1 0 1 0 0 1 1073741824 1048576 901943132160 POSIX 0

 Finished: Thu May 24 20:23:21 2012
Comment by Cliff White (Inactive) [ 25/May/12 ]

http://review.whamcloud.com/2901 - build without the offending patch 50 clients

Kernel: 2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64
Lustre: jenkins-ga944961-PRISTINE-2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64

parallel-scale.test_iorssf.test_log.hyperion244.log
        clients            = 400 (8 per node)
        aggregate filesize = 400 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        5489.07    4984.54    5268.71     210.86   77.86901 0 400 8 3 0 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0
 read         5410.81    5175.37    5257.66     108.40   77.93804 0 400 8 3 0 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0

 Finished: Thu May 24 22:09:52 2012

parallel-scale.test_iorfpp.test_log.hyperion244.log
        clients            = 400 (8 per node)
        aggregate filesize = 400 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        6250.72    3615.96    5349.70    1226.26   81.68548 0 400 8 3 1 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0
 read         6329.04    5786.43    6014.79     229.69   68.19637 0 400 8 3 1 0 1 0 0 1 1073741824 1048576 429496729600 POSIX 0

 Finished: Thu May 24 22:18:16 2012
Comment by Cliff White (Inactive) [ 25/May/12 ]

105 clients

Kernel: 2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64
Lustre: jenkins-ga944961-PRISTINE-2.6.32-220.17.1.el6_lustre.g636ddbf.x86_64

parallel-scale.test_iorssf.test_log.hyperion244.log
        clients            = 840 (8 per node)
        aggregate filesize = 840 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        5372.92    5306.97    5340.69      26.95  161.06181 0 840 8 3 0 0 1 0 0 1 1073741824 1048576 901943132160 POSIX 0
 read         5136.44    5110.41    5121.05      11.14  167.96625 0 840 8 3 0 0 1 0 0 1 1073741824 1048576 901943132160 POSIX 0

 Finished: Thu May 24 22:39:19 2012

parallel-scale.test_iorfpp.test_log.hyperion244.log
        clients            = 840 (8 per node)
        aggregate filesize = 840 GiB
        blocksize          = 1 GiB
        xfersize           = 1 MiB
 Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
 write        5720.72    5560.95    5632.40      66.31  152.73755 0 840 8 3 1 0 1 0 0 1 1073741824 1048576 901943132160 POSIX 0
 read         6225.76    5433.65    5778.62     331.35  149.33184 0 840 8 3 1 0 1 0 0 1 1073741824 1048576 901943132160 POSIX 0

 Finished: Thu May 24 22:56:19 2012
Comment by Shuichi Ihara (Inactive) [ 25/May/12 ]

Please test on the single client with multiple thread, instead of multiple nodes.
The original problem of this is single client's performance regression. So, even we could get better performance on the multiple clients, still need to make sure really single client's perforamnce regression is gone by patches.

Comment by James A Simmons [ 25/May/12 ]

Oh I see. Just like Cliff I was not seeing really big difference between with and without LU-1408 patch. Also like Cliff I was testing with multiple nodes of various thread counts per node. Thank for you for clarifying this. Will do another round of testing to see if this makes a difference.

Comment by James A Simmons [ 25/May/12 ]

Finished a run against master (2.2.53). Cliff can you try running on a single client as well. The results I got were:

*************************************************************************************
No LU-1408 patch
*************************************************************************************

16 Threads one single node
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 607.24 568.29 596.03 14.05 151.81 142.07 149.01 3.51 110.01801
read 723.09 700.82 714.84 8.14 180.77 175.20 178.71 2.04 91.69101

Max Write: 607.24 MiB/sec (636.74 MB/sec)
Max Read: 723.09 MiB/sec (758.21 MB/sec)

*************************************************************************************
With LU-1408 patch
*************************************************************************************
16 Threads one single node
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 611.51 445.17 569.68 62.68 152.88 111.29 142.42 15.67 116.72391
read 771.25 746.29 757.22 9.74 192.81 186.57 189.30 2.44 86.56250

Max Write: 611.51 MiB/sec (641.21 MB/sec)
Max Read: 771.25 MiB/sec (808.71 MB/sec)

Comment by Shuichi Ihara (Inactive) [ 25/May/12 ]

hmm.. it seems to be pretty lower than I had even with patch. Server and clients are connected with QDR Infiniband? and how many CPU cores on the client? and disabled checksum just in case?

Comment by Shuichi Ihara (Inactive) [ 25/May/12 ]

hmm.. it seems to be pretty lower than I had even with patch. Server and clients are connected with QDR Infiniband? and how many CPU cores on the client? and disabled checksum just in case?

Comment by James A Simmons [ 25/May/12 ]

Yes the fiber is QDR. Each node has 8 Intel Xeon CPU core (mode E5520) at 2.27GHz. Actually for this set of test I had checksums on. I just realized that. No matter; with checksum on or off the results are pretty close to each other.

Comment by Shuichi Ihara (Inactive) [ 26/May/12 ]

ok, our system is back and tested LU-1408 patches.
Tested on an client (X5675, 3.07GHz, 48GB memory, QDR Infiniband), RHEL6.2, lustre-2.1.2-RC1 (both servers and client).

Confirmed the patches fix the performance regression problem. Here is test results.

===== without patch (original 2.1.2-RC1) =====
# mpirun -np 4 /root/IOR -b 8g -t 1m -F -C -w -e -k -vv -o /lustre/file

... snip ...

Max Write: 1401.19 MiB/sec (1469.26 MB/sec)

# mpirun -np 4 /root/IOR -b 8g -t 1m -F -C -r -e -vv -o /lustre/file

... snip ...

Max Read:  1510.69 MiB/sec (1584.07 MB/sec)
===== with LU-1408 patch =====
# mpirun -np 4 /root/IOR -b 8g -t 1m -F -C -w -e -k -vv -o /lustre/file

... snip ...

Max Write: 2578.54 MiB/sec (2703.80 MB/sec)

# pdsh -a "sync; echo 3 > /proc/sys/vm/drop_caches"

# mpirun -np 4 /root/IOR -b 8g -t 1m -F -C -r -e -vv -o /lustre/file

... snip ...

Max Read:  2663.67 MiB/sec (2793.06 MB/sec)
Comment by Oleg Drokin [ 28/May/12 ]

Thanks for confirming the results.

Can you please tell me which patches did you tests? Just the one in gerrit?

Comment by Shuichi Ihara (Inactive) [ 28/May/12 ]

I did test patch set 2 on http://review.whamcloud.com/#change,2889
After my testing is done and verified, I wondered if I could set "verified" flag as one of manual tester, but I couldn't find how.
Any advises how we can do that?

Comment by Hongchao Zhang [ 29/May/12 ]

Hi Ihara

could you please help to test the patch set 3 on http://review.whamcloud.com/#change,2889, the patch set 2 disables
the stack check, which is enabled for non x86-64 architecture, thanks!

Comment by Hongchao Zhang [ 31/May/12 ]

Hi Ihara,

the patch is updated, and could you please help to test patch set 5 on http://review.whamcloud.com/#change,2889? Thanks

Comment by Hongchao Zhang [ 05/Jun/12 ]

Hi Ihara,

Have you tested the patch set 5 at http://review.whamcloud.com/#change,2889, which is a little different from patch set 2, thanks!

Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ]

Hongchao, sorry for delay... I will test patch soon. keep you updates here once my testing is done.
any advices I can add "manual test" flag on http://review.whamcloud.com/#change,2889, after test is done and if result is OK?

Comment by Hongchao Zhang [ 05/Jun/12 ]

Hi Ihara, I have add you as one of the reviewers of the patch, you can add review feedback according to the test result, thanks!

Comment by Shuichi Ihara (Inactive) [ 05/Jun/12 ]

Hi Hongchao,

Just tested the patch with 2.1.2-RC1. (since LU-969 patches were droped on 2.1.2-RC2), the latest patch seems to be OK.

2.1.2-RC0 without any patches
# mpirun -np 4 /work/tools/bin/IOR -b 8g -t 1m -F -C -w -e -k -vv -o /lustre/file

Max Write: 2537.56 MiB/sec (2660.83 MB/sec)

# pdsh -a "sync; echo 3 > /proc/sys/vm/drop_caches"

# mpirun -np 4 /work/tools/bin/IOR -b 8g -t 1m -F -C -r -k -vv -o /lustre/file

Max Read:  2848.48 MiB/sec (2986.85 MB/sec)

2.1.2-RC1 + patches
# mpirun -np 4 /work/tools/bin/IOR -b 8g -t 1m -F -C -w -e -k -vv -o /lustre/file

Max Write: 2468.03 MiB/sec (2587.92 MB/sec)

# pdsh -a "sync; echo 3 > /proc/sys/vm/drop_caches"

# mpirun -np 4 /work/tools/bin/IOR -b 8g -t 1m -F -C -r -k -vv -o /lustre/file

Max Read:  2881.06 MiB/sec (3021.01 MB/sec)

Comment by Peter Jones [ 20/Jul/12 ]

Landed for 2.3

Generated at Sat Feb 10 01:16:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.