[LU-744] Single client's performance degradation on 2.1 Created: 09/Oct/11  Updated: 13/Mar/14  Resolved: 06/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Shuichi Ihara (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 1
Labels: None

Attachments: Microsoft Word 2.4 Single Client 3May2013.xlsx     PDF File 574.1.pdf     File ior-256gb.tar.gz     File ior-32gb.tar.gz     File lu744-20120909.tar.gz     File lu744-20120915-02.tar.gz     File lu744-20120915.tar.gz     File lu744-20121111.tar.gz     File lu744-20121113.tar.gz     File lu744-20121117.tar.gz     File lu744-20130104-02.tar.gz     File lu744-20130104.tar.gz     File lu744-dls-20121113.tar.gz     File orig-collectl.out     File orig-ior.out     File orig-opreport-l.out     File patched-collectl.out     File patched-ior.out     File patched-opreport-l.out     Microsoft Word single-client-performance.xlsx     Zip Archive stats-1.8.zip     Zip Archive stats-2.1.zip     Zip Archive test-patchset-2.zip     Zip Archive test2-various-version.zip    
Issue Links:
Duplicate
Related
is related to LU-1408 single client's performance regressio... Resolved
is related to LU-1413 difference of single client's perform... Resolved
is related to LU-3321 2.x single thread/process throughput ... Resolved
is related to LU-141 port lustre client page cache shrinke... Resolved
is related to LU-1201 Lustre crypto hash cleanup Resolved
Severity: 3
Rank (Obsolete): 4018

 Description   

During the performance testing on lustre-2.1, I saw the single client's performance degradation on it.
Here is IOR results on the single cleints with 2.1 and also lustre-1.8.6.80 for comparing.
I ran IOR (IOR -t 1m -b 32g -w -r -vv -F -o /lustre/ior.out/file) on the single client with 1, 2, 4 and 8 processes.

Write(MiB/sec)
v1.8.6.80 v2.1
446.25 411.43
808.53 761.30
1484.18 1151.41
1967.42 1172.06

Read(MiB/sec)
v1.8.6.80 v2.1
823.90 595.71
1449.49 1071.76
2502.49 1517.79
3133.43 1746.30

Tested on same infrastracture(hardware and network). The client just turned off the checksum on both testing.



 Comments   
Comment by Shuichi Ihara (Inactive) [ 09/Oct/11 ]
Here is IOR results. (post again)

Write(MiB/sec)
v1.8.6.80    v2.1
  446.25    411.43
  808.53    761.30
 1484.18   1151.41
 1967.42   1172.06

Read(MiB/sec)
v1.8.6.80    v2.1
  823.90    595.71
 1449.49   1071.76
 2502.49   1517.79
 3133.43   1746.30

during testing, I saw high CPU usages with ptlrpcd-brw and kswapd process on 2.1. small kswapd is showed up on 1.8's testing, but it's not frequency. Howerver, with 2.1, kswapd is always high CPU usages.

(during write testing)
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6922 root      16   0     0    0    0 R 77.5  0.0  13:37.23 ptlrpcd-brw
  409 root      11  -5     0    0    0 R 67.5  0.0  19:26.72 kswapd1
  408 root      10  -5     0    0    0 R 64.5  0.0  20:09.53 kswapd0
13897 root      15   0  190m 7528 2840 R 36.3  0.1   0:52.97 IOR
13898 root      15   0  190m 7516 2828 S 35.6  0.1   0:52.70 IOR
13900 root      15   0  190m 7536 2844 S 35.3  0.1   0:52.12 IOR
13899 root      15   0  191m 7528 2836 S 34.6  0.1   0:54.06 IOR
13902 root      15   0  191m 7524 2828 S 34.6  0.1   0:53.32 IOR
13895 root      15   0  190m 7688 2992 S 33.9  0.1   0:52.92 IOR
13901 root      15   0  191m 7520 2832 R 33.3  0.1   0:53.05 IOR
13896 root      15   0  190m 7516 2832 S 32.9  0.1   0:53.15 IOR
  406 root      15   0     0    0    0 R  4.7  0.0   0:28.83 pdflush
 6915 root      15   0     0    0    0 S  1.0  0.0   0:16.27 kiblnd_sd_02
 6916 root      15   0     0    0    0 S  1.0  0.0   0:16.33 kiblnd_sd_03
 6917 root      15   0     0    0    0 S  1.0  0.0   0:16.17 kiblnd_sd_04
 6918 root      15   0     0    0    0 S  1.0  0.0   0:16.26 kiblnd_sd_05
 6919 root      15   0     0    0    0 S  1.0  0.0   0:16.29 kiblnd_sd_06
 6920 root      15   0     0    0    0 S  1.0  0.0   0:16.33 kiblnd_sd_07
 6913 root      15   0     0    0    0 S  0.7  0.0   0:16.28 kiblnd_sd_00
 6914 root      15   0     0    0    0 S  0.7  0.0   0:16.15 kiblnd_sd_01
13921 root      15   0 12896 1220  824 R  0.3  0.0   0:00.14 top

(during read testing)
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13896 root      18   0  190m 7540 2856 R 88.3  0.1   1:35.29 IOR
  409 root      10  -5     0    0    0 R 86.6  0.0  20:44.50 kswapd1
13901 root      18   0  191m 7572 2884 R 83.9  0.1   1:40.79 IOR
  408 root      10  -5     0    0    0 R 83.3  0.0  21:23.82 kswapd0
13899 root      18   0  191m 7668 2920 R 81.3  0.1   1:43.45 IOR
13902 root      18   0  191m 7544 2848 R 79.6  0.1   1:43.58 IOR
13898 root      19   0  190m 7536 2848 R 72.7  0.1   1:43.15 IOR
13895 root      18   0  190m 7860 3104 R 70.7  0.1   1:32.06 IOR
 6922 root      15   0     0    0    0 R 66.0  0.0  14:53.78 ptlrpcd-brw
13900 root      23   0  190m 7552 2860 R 48.4  0.1   1:39.15 IOR
13897 root      23   0  190m 7584 2896 R 22.6  0.1   1:33.74 IOR
 6913 root      15   0     0    0    0 S  1.7  0.0   0:17.39 kiblnd_sd_00
 6914 root      15   0     0    0    0 S  1.7  0.0   0:17.24 kiblnd_sd_01
 6917 root      15   0     0    0    0 S  1.7  0.0   0:17.31 kiblnd_sd_04
 6916 root      15   0     0    0    0 S  1.3  0.0   0:17.44 kiblnd_sd_03
 6918 root      15   0     0    0    0 S  1.3  0.0   0:17.40 kiblnd_sd_05
 6919 root      15   0     0    0    0 S  1.3  0.0   0:17.41 kiblnd_sd_06
 6920 root      15   0     0    0    0 S  1.3  0.0   0:17.45 kiblnd_sd_07
 6915 root      15   0     0    0    0 S  1.0  0.0   0:17.39 kiblnd_sd_02
13924 root      15   0 12896 1220  824 R  0.3  0.0   0:00.17 top
    1 root      15   0 10372  632  540 S  0.0  0.0   0:01.66 init

note, I turned off the lustre checksum in this testing, so this is not cuased by checksum overhead.

Comment by Jinshan Xiong (Inactive) [ 10/Oct/11 ]

A key difference between 2.1 and 1.8 is that there is no caching memory(max_dirty_mb) limitation in 2.1, this will cause high CPU usage of kswapd, but I'm not sure if this is the root cause of performance degradation for this IO intensive program. For read case, the first thing we need to know is the RPC size.

Can you please collect the following information on both read and write case for 1.8 and 2.1 specifically:
1. `vmstat 1' while running IOR;
2. `lctl get_param osc.lustre-OST*-osc-ffff*.rpc_stats' after the IOR is finished, make sure it's cleared before starting the test;
3. it will be perfect to have oprofile running during the test.

Thanks.

Comment by Oleg Drokin [ 10/Oct/11 ]

I wonder what's the raw speed capability of the link?

We have a caching bug in 1.8 that manifests itself as too fast reads if you just did the writes even if more writes than RAM.
Often read come out faster than the link speed - that's how we noticed.

Comment by Shuichi Ihara (Inactive) [ 10/Oct/11 ]

The network between server and client is QDR Infiniband. So, numbers should be reasonable. Also, an client only has 12GB memory and I'm writing the larger data (256GB) than memory size - no cache effect on this.
btw, the servers (4 OSSs) have 16GB memory, but turned off the read cache. Even turn on it, the total file size (256GB) is still larger then amount of server's memory.
I'm going to collect the data Jay requested.

Comment by Oleg Drokin [ 10/Oct/11 ]

I understand your test size is bigger than the client RAM.
In our Oracle testing we found that there is a caching bug in 1.8 that leads to old write data not being discarded. So later on when you read the data, some significant part of it comes from client cache.

This may or may not contribute to the problem you are seeing of course and I see writes are also somewhat slower which could not be explained by that caching problem we saw.

Also just to confirm, this is 4x QDR, right that can have up to 4 gigabytes/sec of useful bandwidth.

Getting the data Jinshan requested is a good start indeed.

Comment by Shuichi Ihara (Inactive) [ 11/Oct/11 ]

attached the all stats which I got on 2.1.
I did write and read separately on IOR benchmark and got vmstat, oprofile and rpc stats during the benchmark.

will run same benchamrk on 1.8.

Comment by Shuichi Ihara (Inactive) [ 27/Oct/11 ]

Got vmstat and rpc_stats during IOR bencharmk with lsutre-1.8.

oprofile didn't work on this kernel.. due to the following error messages when I ran opreport.

opreport error: basic_string::_S_construct NULL not valid

Comment by Shuichi Ihara (Inactive) [ 06/Jan/12 ]

plesae have a look at log files and oprofile, and let me know if you need more information.

Comment by Peter Jones [ 06/Jan/12 ]

Reassign to Jinshan

Comment by Shuichi Ihara (Inactive) [ 09/Feb/12 ]

Tested again on the current master branch, the write number are a littile bit improved, but the read number is same and still big gap compared to 1.8.x. I think the current master code should have multiple ptlrpc threads, right? but it doesn't seem to affect yet for the single performance improvement.

  write(MB/s) read(MB/s)
1    515        644
2   1041       1172
4   1438       1529
8   1601       1683
Comment by Shuichi Ihara (Inactive) [ 12/Feb/12 ]

From more testing and monitoring storage IO statistics, it looks like the performance is good if amount of file size < client's memory.
As total file size is close to client memory size and go beyond, client performance goes down.

Comment by Jinshan Xiong (Inactive) [ 14/Feb/12 ]

we'll address this in 2.3 due to the io engine work taking place under LU-1030.

Comment by Eric Barton (Inactive) [ 24/Mar/12 ]

Can we confirm this is a client-side issue - e.g. by measuring 1.8 and 2.x clients v. 2.x servers?

Comment by Jinshan Xiong (Inactive) [ 26/Mar/12 ]

To eeb: from what Ihara had seen, if the writing file size exceeded memory size, the performance dropped a lot. So I think this may be a client side. However, I didn't see any other report for this kind of issue, one reason may be this is not noticed or they can't generate this fast IO.

For LU-1030, I'm clarifying that it won't help fix this issue. It will change OSC behavior a lot so any attempt to address performance issue before that is no use.

Actually I have done performance benchmark many times but I didn't see this issue. I guess one reason would be I can't generate such high speed IO with the hardwares in our lab.

Comment by Shuichi Ihara (Inactive) [ 30/Mar/12 ]

Here is what I demonstrated. Tested IOR on single client (12 therads) and collected memory size and Lustre throughput on an client during the IO. This is tested on Lustre-2.2RC2 for servers and client.
Once the client's memory usage reaches the memory size, lustre IO performance goes down.

# IOR -o /lustre/ior.out/file -b 8g -t 1m -F -C -w -e -vv -k
# sync;echo 3 > /proc/sys/vm/drop_caches
# IOR -o /lustre/ior.out/file -b 8g -t 1m -F -C -r -e -vv -k

Write Performance

# collectl -scml
waiting for 1 second sample...
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
   0   0    66     83  46G    0  15M   5M 118M  32M       0      0        0      0
   0   0   152    110  46G    0  15M   5M 118M  32M       0      0        0      0
   0   0  1605    809  46G    0  17M   6M 118M  33M       0      0        0      0
   2   1  8353  19105  46G    0  22M   8M 119M  68M       0      0        0      0
  39  39 14389  22964  44G    0   1G   1G 405M  91M       0      0  1277952   1248
  96  96 29362  47501  41G    0   3G   3G   1G  92M       0      0  2676736   2614
  96  96 29109  46887  38G    0   6G   6G   1G  92M       0      0  2678784   2616
  95  95 28936  46208  35G    0   8G   8G   2G  92M       0      0  2669568   2607
  96  96 28813  46264  32G    0  11G  11G   2G  92M       0      0  2683904   2621
  96  96 27957  43106  29G    0  13G  13G   3G  92M       0      0  2603008   2542
  96  96 29186  47093  25G    0  16G  16G   3G  92M       0      0  2673664   2611
  96  96 28878  46397  22G    0  19G  19G   4G  92M       0      0  2670592   2608
  96  96 28736  46291  19G    0  21G  21G   4G  92M       0      0  2670592   2608
  95  95 29202  47151  16G    0  24G  24G   5G  92M       0      0  2673664   2611
  96  96 27200  42103  13G    0  26G  26G   5G  92M       0      0  2608128   2547
  96  95 28900  46153  10G    0  29G  29G   6G  92M       0      0  2671616   2609
  96  96 28962  46393   7G    0  31G  31G   7G  92M       0      0  2661376   2599
  96  96 28982  46711   4G    0  34G  34G   7G  92M       0      0  2650112   2588
  96  96 27615  43289   1G    0  36G  36G   8G  92M       0      0  2530304   2471
  98  98 27524  34935 183M    0  37G  37G   8G  92M       0      0  1996800   1950
  99  99 24298  30965 227M    0  37G  37G   8G  92M       0      0  1708032   1668
 100 100 24578  31559 276M    0  37G  37G   8G  92M       0      0  1694720   1655
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
 100  99 24758  32204 194M    0  37G  37G   8G  92M       0      0  1708032   1668
  99  99 24367  30946 184M    0  37G  37G   8G  92M       0      0  1689600   1650
 100 100 24772  31223 222M    0  37G  37G   8G  92M       0      0  1709056   1669
  99  99 24742  31196 224M    0  37G  37G   8G  92M       0      0  1680751   1641
 100 100 24502  31292 285M    0  37G  37G   8G  92M       0      0  1729218   1689
  98  98 23817  31563 186M    0  37G  37G   8G  92M       0      0  1754112   1713
  99  99 26300  32065 203M    0  37G  37G   8G  92M       0      0  1696096   1656
 100  99 23777  30225 274M    0  37G  37G   8G  92M       0      0  1704617   1665
  99  99 24663  31760 259M    0  37G  37G   8G  92M       0      0  1740800   1700
 100 100 24885  32234 221M    0  37G  37G   8G  92M       0      0  1721344   1681
  99  99 23912  30622 206M    0  37G  37G   8G  92M       0      0  1732608   1692
  99  99 25136  32743 184M    0  37G  37G   8G  92M       0      0  1748992   1708
  99  99 24931  31094 218M    0  37G  37G   8G  92M       0      0  1679360   1640
  99  99 28119  33561 221M    0  37G  37G   8G  92M       0      0  1709056   1669
 100 100 24796  32077 201M    0  37G  37G   8G  92M       0      0  1703936   1664
 100  99 24805  32263 196M    0  37G  37G   8G  92M       0      0  1715506   1675
 100 100 24191  30959 185M    0  37G  37G   8G  92M       0      0  1696386   1657
 100  99 23907  30445 203M    0  37G  37G   8G  92M       0      0  1696768   1657
 100 100 24488  31350 276M    0  37G  37G   8G  92M       0      0  1665024   1626
 100  99 28522  32064 231M    0  37G  37G   8G  91M       0      0  1717248   1677
  99  99 24475  30399 296M    0  37G  37G   8G  90M       0      0  1657856   1619
 100 100 24613  31539 232M    0  37G  37G   8G  90M       0      0  1717248   1677
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
  99  99 23441  29388 203M    0  37G  37G   8G  90M       0      0  1691648   1652
  99  99 24446  30918 194M    0  37G  37G   8G  90M       0      0  1666048   1627
  99  99 23687  29693 224M    0  37G  37G   8G  90M       0      0  1708032   1668
 100 100 24159  30834 207M    0  37G  37G   8G  90M       0      0  1713152   1673
  99  99 23732  29601 260M    0  37G  37G   8G  90M       0      0  1652736   1614
 100 100 24107  30571 259M    0  37G  37G   8G  90M       0      0  1705984   1666
  99  98 27459  31224 268M    0  37G  37G   8G  88M       0      0  1613824   1576
  99  96 24363  31208 190M    0  37G  37G   8G  87M       0      0  1603584   1566
  99  95 23538  28255 206M    0  37G  37G   8G  87M       0      0  1478656   1444
  99  93 22302  26656 242M    0  37G  37G   8G  85M       0      0  1384448   1352
  99  90 20754  22468 217M    0  37G  37G   8G  81M       0      0  1137664   1111
  99  86 17894  16390 216M    0  37G  37G   8G  80M       0      0   833536    814
  99  85 16683  12804 219M    0  37G  37G   8G  80M       0      0   685056    669
  80  67 14944  12993 277M    0  37G  37G   8G  34M       0      0   415744    406
   0   0    67     78 279M    0  37G  37G   8G  32M       0      0        0      0
   0   0    66     83 279M    0  37G  37G   8G  32M       0      0        0      0
   0   0    73     82 279M    0  37G  37G   8G  32M       0      0        0      0
   0   0    58     75 280M    0  37G  37G   8G  32M       0      0        0      0
   0   0    72     79 280M    0  37G  37G   8G  32M       0      0        0      0
   0   0    53     75 281M    0  37G  37G   8G  32M       0      0        0      0

Read Performance

# collectl -scml
waiting for 1 second sample...
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
   0   0    57     67  46G    0  16M   5M 117M  32M       0      0        0      0
   0   0   125     85  46G    0  16M   5M 117M  32M       0      0        0      0
   2   1  9829  19730  46G    0  22M   8M 119M  68M       0      0        0      0
   1   0   821   1872  46G    0  48M  31M 124M  91M    7168      7        0      0
  91  91 30999  56887  42G    0   2G   2G 722M  92M 2331648   2277        0      0
 100  99 33243  57527  39G    0   5G   5G   1G  92M 2747392   2683        0      0
 100  99 33049  57389  36G    0   7G   7G   1G  92M 2745344   2681        0      0
 100 100 33696  58307  33G    0  10G  10G   2G  92M 2745344   2681        0      0
 100  99 32735  56514  30G    0  13G  13G   2G  92M 2745344   2681        0      0
 100  99 34001  58043  26G    0  15G  15G   3G  92M 2732032   2668        0      0
 100  99 33038  57275  23G    0  18G  18G   4G  92M 2745344   2681        0      0
 100 100 33305  58068  20G    0  21G  21G   4G  92M 2755584   2691        0      0
 100  99 32786  56625  17G    0  23G  23G   5G  92M 2743296   2679        0      0
 100  99 32977  57459  14G    0  26G  26G   5G  92M 2742272   2678        0      0
 100  99 32748  56989  10G    0  28G  28G   6G  92M 2747392   2683        0      0
 100  99 33028  57293   7G    0  31G  31G   6G  92M 2753536   2689        0      0
 100  99 32779  56924   4G    0  34G  34G   7G  92M 2720768   2657        0      0
 100  99 31996  54526   1G    0  36G  36G   8G  92M 2591744   2531        0      0
  99  99 31920  45036 200M    0  37G  37G   8G  92M 2096128   2047        0      0
  99  99 26673  38911 185M    0  37G  37G   8G  92M 1813504   1771        0      0
 100 100 26120  38482 183M    0  37G  37G   8G  92M 1853440   1810        0      0
  99  99 26358  38794 185M    0  37G  37G   8G  92M 1819648   1777        0      0
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
  99  99 27138  40461 226M    0  37G  37G   8G  92M 1903616   1859        0      0
  99  99 27660  41331 188M    0  37G  37G   8G  92M 1892352   1848        0      0
  99  99 26490  38244 218M    0  37G  37G   8G  92M 1785856   1744        0      0
  99  99 27106  40421 190M    0  37G  37G   8G  92M 1820672   1778        0      0
  99  99 26841  40338 251M    0  37G  37G   8G  92M 1804288   1762        0      0
 100  99 26798  39658 187M    0  37G  37G   8G  92M 1831129   1788        0      0
  99  99 27658  41055 217M    0  37G  37G   8G  92M 1872721   1829        0      0
  99  99 27175  40097 240M    0  37G  37G   8G  92M 1830912   1788        0      0
  99  99 27205  40167 253M    0  37G  37G   8G  92M 1846272   1803        0      0
  99  99 27506  41196 231M    0  37G  37G   8G  92M 1861632   1818        0      0
  99  99 29622  41786 250M    0  37G  37G   8G  92M 1835008   1792        0      0
  99  99 27734  41179 238M    0  37G  37G   8G  92M 1894400   1850        0      0
  99  99 28140  40126 260M    0  37G  37G   8G  92M 1799168   1757        0      0
 100 100 26986  39996 301M    0  37G  37G   8G  92M 1825792   1783        0      0
  99  99 28804  41224 195M    0  37G  37G   8G  92M 1841152   1798        0      0
 100  99 28819  41024 209M    0  37G  37G   8G  92M 1795072   1753        0      0
 100 100 26511  38828 227M    0  37G  37G   8G  92M 1797120   1755        0      0
  99  99 31510  40406 206M    0  37G  37G   8G  91M 1826816   1784        0      0
  99  99 27219  39596 202M    0  37G  37G   8G  90M 1814761   1772        0      0
  98  98 27858  39520 216M    0  37G  37G   8G  90M 1831720   1789        0      0
  99  99 27691  39656 270M    0  37G  37G   8G  90M 1830912   1788        0      0
  99  99 26331  37845 242M    0  37G  37G   8G  90M 1778688   1737        0      0
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
 100  99 25922  37428 214M    0  37G  37G   8G  90M 1738752   1698        0      0
 100  99 25756  37430 246M    0  37G  37G   8G  90M 1766400   1725        0      0
  99  99 26469  39634 184M    0  37G  37G   8G  90M 1866752   1823        0      0
  99  99 25477  36527 257M    0  37G  37G   8G  90M 1738752   1698        0      0
  99  99 26725  39618 248M    0  37G  37G   8G  90M 1840128   1797        0      0
  99  99 25512  36383 261M    0  37G  37G   8G  90M 1755428   1714        0      0
  99  98 27001  36997 193M    0  37G  37G   8G  87M 1857346   1814        0      0
  99  93 24895  33872 202M    0  37G  37G   8G  84M 1631232   1593        0      0
  64  57 13832  15585 274M    0  37G  37G   8G  34M  711680    695        0      0
   0   0    62     76 276M    0  37G  37G   8G  32M       0      0        0      0
   0   0    76     84 276M    0  37G  37G   8G  32M       0      0        0      0
Comment by Shuichi Ihara (Inactive) [ 30/Mar/12 ]

This is another test results when servers are running with 2.2, but client is 1.8.7. (The checksum is diabled.)

write test

# collectl -scml
waiting for 1 second sample...
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
   0   0  1041    165  46G    0 104M  89M 127M  25M       0      0        0      0
   2   1  3580  31174  46G    0 105M  90M 127M  57M       0      0        0      0
   9   8  1357   1582  45G    0 863M 848M 197M  83M       0      0   739336    722
  25  25  7239  31394  42G    0   3G   3G 481M  83M       0      0  2985870   2916
  27  27  7219  33245  39G    0   6G   6G 810M  84M       0      0  3274752   3198
  29  28  7069  33014  35G    0   8G   9G   1G  84M       0      0  3337973   3260
  29  29  7190  31910  32G    0  11G  13G   1G  84M       0      0  3326976   3249
  30  29  7064  32743  29G    0  13G  16G   1G  84M       0      0  3352576   3274
  29  29  6250  26734  31G    0  14G  14G   1G  84M       0      0  2642944   2581
  31  31  6881  32248  28G    0  16G  16G   1G  84M       0      0  3367936   3289
  33  33  7000  31743  28G    0  17G  16G   1G  84M       0      0  3303424   3226
  33  32  6981  31856  26G    0  18G  18G   1G  84M       0      0  3352576   3274
  36  36  6858  31487  25G    0  18G  19G   1G  84M       0      0  3354624   3276
  28  28  6219  26187  25G    0  19G  19G   1G  84M       0      0  2723840   2660
  31  31  7111  33539  23G    0  21G  21G   2G  84M       0      0  3350528   3272
  40  40  7015  31281  24G    0  19G  19G   2G  84M       0      0  3338240   3260
  38  38  6909  30377  25G    0  19G  19G   2G  84M       0      0  3331072   3253
  29  29  6945  31835  22G    0  22G  22G   2G  84M       0      0  3314688   3237
  37  37  6427  26715  24G    0  20G  20G   2G  84M       0      0  2864945   2798
  37  37  6656  29771  24G    0  20G  20G   2G  84M       0      0  3322379   3245
  35  35  6789  30431  23G    0  21G  21G   2G  84M       0      0  3369257   3290
  38  38  6850  30454  23G    0  21G  21G   2G  84M       0      0  3315469   3238
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
  41  41  7005  29638  25G    0  19G  19G   2G  84M       0      0  3340288   3262
  36  36  6219  25574  26G    0  18G  18G   1G  84M       0      0  2834194   2768
  35  35  6753  30124  25G    0  19G  19G   1G  84M       0      0  3354624   3276
  41  41  6823  30396  27G    0  17G  17G   1G  84M       0      0  3360768   3282
  34  34  6876  30411  25G    0  19G  19G   2G  84M       0      0  3308544   3231
  35  33  7409  36583  22G    0  22G  21G   2G  83M       0      0  3203964   3129
  51  48  7302  27904  23G    0  21G  21G   2G  82M       0      0  2851619   2785
  53  49  8236  38922  23G    0  21G  21G   2G  80M       0      0  3172372   3098
  73  63  7348  27435  22G    0  22G  22G   2G  79M       0      0  2996569   2926
  73  63  7155  26827  22G    0  22G  22G   2G  79M       0      0  3010498   2940
  67  57  7516  30396  19G    0  25G  25G   2G  79M       0      0  3147625   3074
  65  55  7078  27202  16G    0  27G  27G   2G  79M       0      0  2726901   2663
  41  34  5189  29696  15G    0  28G  28G   2G  27M       0      0  1465344   1431
   0   0  1015     81  15G    0  28G  28G   2G  27M       0      0        0      0
   0   0  1002     60  15G    0  28G  28G   2G  25M       0      0        0      0
   0   0  1003     42  15G    0  28G  28G   2G  25M       0      0        0      0
   0   0  1002     50  15G    0  28G  28G   2G  25M       0      0        0      0
   0   0  1006     40  15G    0  28G  28G   2G  25M       0      0        0      0

Read test

# collectl -scml
waiting for 1 second sample...
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
   0   0  1490    161  46G    0  18M  14M 115M  24M       0      0        0      0
   2   0  7539  19019  46G    0  24M  17M 117M  58M       0      0        0      0
  14  13  7356  29789  45G    0   1G   1G 214M  82M 1507328   1472        0      0
  27  27 14755  62755  42G    0   3G   3G 467M  84M 3478528   3397        0      0
  26  26 15313  65755  39G    0   6G   6G 784M  84M 3476480   3395        0      0
  25  25 15640  65148  35G    0  10G  10G   1G  84M 3471760   3390        0      0
  26  26 15144  63964  31G    0  13G  13G   1G  84M 3474432   3393        0      0
  26  26 15456  65563  28G    0  16G  16G   1G  84M 3462777   3382        0      0
  32  32 15122  64064  27G    0  17G  17G   1G  84M 3481600   3400        0      0
  28  28 14963  62571  24G    0  20G  19G   1G  84M 3471360   3390        0      0
  39  39 14595  61223  27G    0  17G  17G   1G  84M 3484672   3403        0      0
  33  33 14488  61248  26G    0  18G  18G   1G  84M 3482624   3401        0      0
  34  34 14629  61651  26G    0  19G  19G   1G  84M 3477504   3396        0      0
  31  31 14135  59877  23G    0  20G  20G   2G  84M 3473408   3392        0      0
  29  29 14266  61184  21G    0  23G  23G   2G  84M 3458048   3377        0      0
  41  41 14247  59518  24G    0  20G  20G   2G  84M 3504128   3422        0      0
  30  30 13945  60057  22G    0  22G  22G   2G  84M 3464823   3384        0      0
  36  36 14128  62603  22G    0  22G  22G   2G  84M 3479960   3398        0      0
  41  41 13759  59870  24G    0  20G  20G   2G  84M 3483648   3402        0      0
  36  36 14322  62899  24G    0  20G  20G   2G  84M 3457024   3376        0      0
  38  38 14106  60171  25G    0  19G  19G   1G  84M 3508811   3427        0      0
  31  31 13888  60317  23G    0  21G  21G   2G  84M 3483447   3402        0      0
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
  36  36 13804  59705  23G    0  21G  21G   2G  84M 3480576   3399        0      0
  35  35 13627  58586  23G    0  21G  21G   2G  84M 3474432   3393        0      0
  38  38 14811  62796  23G    0  20G  20G   2G  84M 3490397   3409        0      0
  38  38 14070  60788  24G    0  20G  20G   2G  84M 3468288   3387        0      0
  44  43 14348  61167  24G    0  20G  20G   2G  82M 3470961   3390        0      0
  57  51 14323  59216  23G    0  21G  21G   2G  81M 3462534   3381        0      0
  66  58 13738  57619  21G    0  23G  23G   2G  79M 3464823   3384        0      0
  79  67 13357  55589  19G    0  24G  24G   2G  78M 3427996   3348        0      0
  76  64 13451  56659  16G    0  28G  28G   2G  77M 3432144   3352        0      0
  65  52  8608  35219  14G    0  29G  29G   2G  27M 1908736   1864        0      0
   0   0  1004     40  14G    0  29G  29G   2G  27M       0      0        0      0
   0   0  1004     56  14G    0  29G  29G   2G  24M       0      0        0      0
   0   0  1003     42  14G    0  29G  29G   2G  24M       0      0        0      0
   0   0  1004     46  14G    0  29G  29G   2G  24M       0      0        0      0
   0   0  1004     42  14G    0  29G  29G   2G  24M       0      0        0      0
Comment by Jinshan Xiong (Inactive) [ 30/Mar/12 ]

I guess this is because there is no LRU for async_pages in 2.x clients. The LRU mechanism is way too complex in 1.8 clients so I have an idea to limit # of caching pages at OSC layer.

Comment by Jinshan Xiong (Inactive) [ 06/Apr/12 ]

I'm working on a workaround patch to limit the max caching pages per OSC.

Comment by Jinshan Xiong (Inactive) [ 11/Apr/12 ]

Hi Ihara,

Can you please try patch at: http://review.whamcloud.com/2514 to see if it can help. Please notice that this patch is for debug purpose only and shouldn't be applied to production system. Also please collect memory usage statistic and oprofile results while you're running the test, thanks.

Comment by Minh Diep [ 12/Apr/12 ]

Here is the data I ran IOR file-per-process on hyperion.

Server: lustre 2.1.0/rhel5/x86_64
Client: lustre 2.2.0/rhel5/x86_64
lustre 1.8.7/rhel5/x86_64
number of client: 1
Transfer size: 1M
blocksize: 192G/#thread
filesize: 192G

Write:

Thread 1.8.7 2.2.0
1 178 245
2 296 328
4 568 434
8 608 492
12 778 498
16 677 489

Read:

Thread 1.8.7 2.2.0
1 154 250
2 272 329
4 393 426
8 402 447
12 400 454
16 391 455

Comment by Jinshan Xiong (Inactive) [ 12/Apr/12 ]

Thank you, Minh, I guess fast IO is a necessity to reproduce this problem. How many OSS nodes are there on Hyperion and what's their peak IO speed?

Comment by Christopher Morrone [ 12/Apr/12 ]

I don't know what they are currently using, but we have more than enough hardware available to swamp one client's QDR IB link.

I know there are at least 18 (soon to be 20) NetApp 60-bay enclosures with dual controllers, so the hyperion folks can set up more Lustre servers if needed.

Comment by Shuichi Ihara (Inactive) [ 13/Apr/12 ]

Jay,
Thank you for posting patches for testing/debugging.
I'm now traveling, but will test this patches quickly when I'm back.

Comment by Minh Diep [ 13/Apr/12 ]

On chaos4, we have 4 OSS, each have 2 LUNS connected to DDN 9550

Here is obdfilter-survey from 1 of the oss

Tue Apr 10 11:42:35 PDT 2012 Obdfilter-survey for case=disk from hyperion1155
ost 2 sz 33554432K rsz 1024K obj 2 thr 2 write 293.83 [ 133.86, 157.84] rewrite 284.56 [ 127.87, 149.85] read 307.96 [ 137.86, 165.83]
ost 2 sz 33554432K rsz 1024K obj 2 thr 4 write 414.07 [ 156.84, 271.72] rewrite 387.44 [ 131.86, 239.76] read 384.67 [ 155.84, 263.73]
ost 2 sz 33554432K rsz 1024K obj 2 thr 8 write 513.81 [ 142.85, 364.63] rewrite 480.19 [ 189.97, 272.73] read 437.79 [ 181.90, 249.75]
ost 2 sz 33554432K rsz 1024K obj 2 thr 16 write 545.77 [ 158.84, 363.63] rewrite 491.09 [ 151.84, 319.67] read 454.01 [ 205.79, 247.75]
ost 2 sz 33554432K rsz 1024K obj 2 thr 32 write 554.57 [ 126.87, 403.59] rewrite 469.79 [ 153.84, 306.69] read 427.92 [ 194.80, 233.76]
ost 2 sz 33554432K rsz 1024K obj 4 thr 4 write 359.17 [ 120.88, 252.74] rewrite 410.65 [ 151.85, 239.76] read 417.68 [ 169.83, 227.77]
ost 2 sz 33554432K rsz 1024K obj 4 thr 8 write 506.46 [ 160.84, 411.58] rewrite 506.73 [ 173.82, 324.67] read 459.12 [ 191.80, 255.74]
ost 2 sz 33554432K rsz 1024K obj 4 thr 16 write 587.72 [ 225.84, 362.63] rewrite 528.92 [ 178.82, 318.68] read 443.22 [ 197.80, 249.75]
ost 2 sz 33554432K rsz 1024K obj 4 thr 32 write 572.57 [ 190.81, 394.60] rewrite 497.53 [ 185.81, 316.68] read 400.52 [ 144.85, 331.66]
ost 2 sz 33554432K rsz 1024K obj 8 thr 8 write 476.39 [ 191.80, 268.75] rewrite 531.77 [ 202.79, 314.68] read 449.59 [ 180.82, 260.73]
ost 2 sz 33554432K rsz 1024K obj 8 thr 16 write 530.49 [ 187.81, 313.68] rewrite 529.15 [ 188.81, 323.67] read 463.28 [ 191.80, 253.74]
ost 2 sz 33554432K rsz 1024K obj 8 thr 32 write 572.40 [ 203.79, 335.66] rewrite 486.10 [ 162.83, 309.69] read 431.22 [ 189.81, 254.74]
ost 2 sz 33554432K rsz 1024K obj 16 thr 16 write 520.57 [ 194.80, 306.69] rewrite 469.09 [ 134.86, 291.70] read 460.47 [ 199.80, 254.74]
ost 2 sz 33554432K rsz 1024K obj 16 thr 32 write 600.63 [ 199.80, 355.64] rewrite 492.24 [ 127.87, 323.68] read 428.18 [ 103.89, 249.75]
ost 2 sz 33554432K rsz 1024K obj 32 thr 32 write 567.20 [ 225.77, 313.68] rewrite 457.29 [ 147.85, 296.70] read 428.57 [ 183.81, 236.76]
done!

Comment by Shuichi Ihara (Inactive) [ 16/Apr/12 ]

attached are IOR resutls, memory usages and oprofile output when I ran IOR benchmark on the original 2.2 and patched 2.2.
tested on the single client (QDR Infiniband, 48GB memory) with 12 IOR threads. The amount of file size is 384GB (32GB x 12 thread).

original 2.2
Max Write: 1708.26 MiB/sec (1791.24 MB/sec)
Max Read:  1656.73 MiB/sec (1737.21 MB/sec)

patched 2.2
Max Write: 2028.24 MiB/sec (2126.76 MB/sec)
Max Read:  2179.34 MiB/sec (2285.21 MB/sec)
Comment by Jinshan Xiong (Inactive) [ 16/Apr/12 ]

Thanks for the test, Ihara.

Now each OSC uses 128M memory at maximum and this looks too much small in your case, especially for the read case.

Please try patch set 2(http://review.whamcloud.com/2514) where you can set how much memory will be used for cache, for example:

lctl set_param osc.<osc1>.max_cache_mb=256

to set osc1 will use 256M memory for caching.

Also, you forgot to tell oprofile where to find lustre object files so it couldn't interpret address to symbol name.

Comment by Shuichi Ihara (Inactive) [ 16/Apr/12 ]

Hi Jay,
I just tested second patchset. It was tested with 256m, 512m, 1024m and 2048m for cache_mb.

cache_mb   Write(MB/sec) Read(MB/sec)
   256     2108.90       2263.33
   512     2189.78       2266.49
  1024     2353.56       2318.94
  2048     2330.62       2313.96

It looks like still lower than the case when filesize < client's memory size. Attached includes ior results, memory usage and oprofile results on each testing.

Sorry, previous oprofile results, I was not pointing kernel modules. This one contains what you want?

Comment by Jinshan Xiong (Inactive) [ 16/Apr/12 ]

Hi Ihara, thanks. Can you please refresh my memory for what was the performance of writing/reading a file less than memory size?

It looks like the performance data varied a lot from time to time, is this because different clients were used? In that case, it may make more sense to run patched/unpatched/b1_8, along with the case that file size is less than memory on the same kind of client to make comparison a bit easier.

Yes, this oprofile result is better, but it can be even better to print instruction address the cpu was busy on(I forgot the opreport option). However, I'm suppose contention on lock client_obd_list_lock would be significant in this case and I'm fixing this at LU-1030.

Comment by Shuichi Ihara (Inactive) [ 17/Apr/12 ]

The previous 2.2 numbers, I did test 2.2 without patch on the same client and confirmed the performance is better if filesize < client's memory.

anyway, I just tested again various versions on the same hardware.
Server : lustre-2.2
Client : lustre-1.8.7.80, lustre-2.1.1 (w/patch and wo/patch), 2.2.0 (w/patch and wo/patch)
checksum is disabled.
lctl set_param osc.<osc1>.max_cache_mb=1024m (if we use patched client)

Here is results.

Version     Write(MB/s)  Read(MB/s)
1.8.7.80      3030        3589
2.1.1         1843        2466
2.1.1/patch   1863        2384
2.2           2012        2151      
2.2/patch     2360        2398

The test is simple. it runs IOR on single client with 12 Thread.
IOR.bin -o /lustre/ior.out/file -b 96g -t 1m -F -C -w -e -vv -k (write)
sync; echo 3 > /proc/sys/vm/drop_caches (all servers and client)
IOR.bin -o /lustre/ior.out/file -b ${fsize}g -t 1m -F -C -r -e -vv (read)

The attachment includes all IOR results, oprofile output and memory usages.

You can see the following results on 2.2.0/collectl.out.
If there is enough memory, Write speed is around 3GB/sec, but once it reaches the memory size, write speed goes down to 2GB/sec.

#<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
   0   0  1289    799  45G  21M  47M  19M 182M  72M       0      0        0      0
   0   0   927    644  45G  21M  47M  19M 182M  72M       0      0        0      0
   1   1 12546   1805  45G  21M  51M  22M 182M 106M       0      0        0      0
  47  46  389K  57064  43G  23M   2G   2G 723M 138M       0      0  2433024   2376
  63  62  495K  68661  39G  23M   5G   5G   1G 140M       0      0  3064832   2993
  64  63  491K  64459  35G  23M   8G   8G   2G 140M       0      0  3028992   2958
  64  63  474K  62071  32G  23M  11G  11G   2G 140M       0      0  3011584   2941
  64  63  472K  61732  28G  23M  13G  13G   3G 140M       0      0  3014656   2944
  72  72  526K  46775  25G  23M  16G  16G   3G 140M       0      0  2934784   2866
  63  63  456K  57693  22G  23M  19G  19G   4G 140M       0      0  2902016   2834
  64  63  458K  62259  18G  23M  22G  22G   5G 140M       0      0  2960384   2891
  63  63  454K  67528  15G  23M  25G  25G   5G 140M       0      0  3014656   2944
  63  63  429K  62035  11G  23M  28G  28G   6G 140M       0      0  2989056   2919
  71  71  467K  49752   8G  23M  30G  30G   6G 140M       0      0  2865152   2798
  64  63  425K  61280   4G  23M  33G  33G   7G 140M       0      0  2962432   2893
  64  64  440K  60282   1G  23M  36G  36G   8G 140M       0      0  2936832   2868
  63  63  395K  43352 226M  19M  37G  37G   8G 140M       0      0  2059264   2011
  64  64  366K  34137 185M  18M  37G  37G   8G 140M       0      0  1609728   1572

With patches, free memory is keeping, but write speed is around 2.3GB/sec.

#<----CPU[HYPER]-----><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
   0   0   993    650  45G  17M  44M  20M 181M  72M       0      0        0      0
   1   0 10327   1534  45G  17M  47M  23M 181M 107M       0      0        0      0
  19  19  149K  19335  44G  19M 956M 933M 389M 132M       0      0   920576    899
  64  64  458K  62063  41G  19M   3G   3G   1G 136M       0      0  2990080   2920
  64  64  383K  62531  37G  19M   6G   6G   1G 136M       0      0  3001344   2931
  64  63  400K  64377  34G  19M   9G   9G   2G 136M       0      0  3001344   2931
  63  63  372K  60830  30G  19M  12G  12G   2G 136M       0      0  2925568   2857
  65  64  372K  43364  31G  19M  12G  12G   2G 136M       0      0  2252800   2200
  63  63  350K  50420  31G  19M  12G  12G   2G 136M       0      0  2247680   2195
  61  60  336K  54894  31G  19M  12G  12G   2G 136M       0      0  2313216   2259
  60  60  336K  55181  31G  20M  12G  12G   2G 136M       0      0  2310144   2256
  61  60  338K  55468  31G  20M  12G  12G   2G 136M       0      0  2306048   2252
  63  63  351K  51601  31G  20M  12G  12G   2G 136M       0      0  2278400   2225
  61  60  318K  49503  31G  20M  12G  12G   2G 137M       0      0  2285568   2232
  61  60  332K  50634  31G  20M  12G  12G   2G 137M       0      0  2300928   2247
  61  60  332K  50507  31G  20M  12G  12G   2G 137M       0      0  2302721   2249
  61  61  350K  53943  31G  20M  12G  12G   2G 137M       0      0  2296056   2242
  64  63  336K  50402  31G  20M  12G  12G   2G 137M       0      0  2273280   2220
  61  60  296K  49321  31G  20M  12G  12G   2G 137M       0      0  2291712   2238
Comment by Andreas Dilger [ 10/May/12 ]

In bug LU-1201 there is also http://jira.whamcloud.com/secure/attachment/11303/lustre-singleclient-comparison.xlsx with a good comparison of single-client performance with and without checksums enabled.

Comment by Andreas Dilger [ 06/Jun/12 ]

The problem I see with this patch is that it is moving in the wrong direction. Administrators want to be able to specify the cache limit for all Lustre filesystems on a node, while adding a cache limit per OSC doesn't really improve anything for them.

At a site like LLNL, they have over 3000 OSCs on the client, so any per-OSC limit will either have to be so small that it hurts performance, or it will be so large that it is much larger than the total RAM, and effectively no limit at all and only adding extra overhead to do useless LRU management.

I'd rather see more effort put into understanding why Lustre IO pages do not work well with the Linux VM page cache management, and fix that. This will provide global cache management, and avoid memory pressure for all users, and will also improve more as the Linux page cache management improves also.

Comment by Jinshan Xiong (Inactive) [ 07/Jun/12 ]

From the oprofile and other stats, it was obvious that the CPU was busy evicting pages(kswapd used 100% CPU). Based on this situation, I think the problem is that the kswapd couldn't free caching pages as fast as the RPC engine wrote dirty pages back(otherwise, the writing process will be blocked at waiting for obd_dirty_pages). At last, there is no free pages in the system and writing processes are choked up freeing pages themselves and this degraded the write performance a lot. An obvious way to fix this problem is to free pages while the processes are producing them, this way we can distribute the overhead of freeing pages to every write syscall and also limit the total memory consumed by lustre. This is why I worked out this patch and I agree with the problem you mentioned so I didn't start the landing process. Anyway after this patch is used, the performance improved ~20% and I can no longer see 100% CPU time on kswapd.

However, this is based on my educated guess I'm not sure this is correct. Can you please elaborate on your idea and I will be pretty happy to verify and implement it? Thanks.

Comment by Andreas Dilger [ 08/Jun/12 ]

In the oprofile results during page eviction, are there any functions that show up as being very expensive that might be optimized? In the past there were code paths in CLIO that did too many expensive locking operations, and there may still be some paths that can be improved. My gut feeling is that we are keeping pages "active" somehow (references, pgcache_balance, etc) that makes it harder for the kernel to clear Lustre pages. I tried looking into this a bit, but there aren't very good statistics for seeing how many pages are in use (pgcache_balance is missing from 2.x clients, and dump_page_cache is too verbose).

Also, in the normal kernel code paths, I believe that kswapd is rarely doing page cleaning. Instead, this is normally handled on a per-block-device basis, so that it can be done in parallel on both the block devices and the CPUs. Is there some way that we could get ptlrpcd to do page cleaning itself, or re-introduce the per-CPU LRU as was done in 1.8?

Comment by Jinshan Xiong (Inactive) [ 12/Jun/12 ]

From the oprofile result, the busiest part was osc_teardown_async_page() which should be called to destroy a caching page. I did find some env staff ate lots of CPU time as well, but it didn't help replace them by using journal_info to cache cl_env so I believe that is not a problem.

I don't think the # of dirty pages is the problem because they're limited by obd_dirty_pages and cl_dirty_max per osc, which is 384M at maximum on Ihara's node. So I guess the problem is that kswapd was evicting the caching pages too slow - remember that kswapd is per numa node(correct me if I'm wrong); in other words, if there existed per-cpu kswapd daemon, we couldn't see this problem at all.

Per_CPU LRU code was complex in 1.8, sorry about that because it was me who implemented it. So I just wanted to work out some thing simpler to address the problem - but all in all I have to know the patch does fix the problem(actually it did because from Ihara's test, the performance didn't drop if the file size was over memory size); based on this result, the next step will be to address the problem when there are many OSTs.

Comment by Gregoire Pichon [ 25/Jun/12 ]

We have also identified this degradation of the single client's performance at Customer sites (Tera100 for instance) and in Bull's R&D lab, and would be interested in having a fix provided for b2_1.

Please, note that we can help testing new versions of a patch when available.

Comment by Jinshan Xiong (Inactive) [ 25/Jun/12 ]

Hi Pichon, I'm working on this issue to make it be production ready.

As usual, before we're working on a new hardware, we have to understand what performance is with the current code, this way we can know if our fix is really working in the future.

Can you please do the performance benchmark with the following branches/patches:

1. 1.8
2. 2.1
3. 2.1 + LU-744 (patch 2929)
4. master
5. master + LU-744 (patch 2514)

Please be sure that the file size is far bigger than the memory size, and do the test with optimal block size(stripe_size * OST #) and different thread #.

Comment by Gregoire Pichon [ 29/Jun/12 ]

Jinshan,

I will not be able to do the performance benchmark in the 1.8 release since this release has never been integrated in the Bull distribution.

However, we can still compare the single client's performance in lustre 2.1 without and with the patch, when application workload is small (application memory + application pagecache is less than client's memory) and large (twice client's memory).

When we will have integrated lustre master (in few weeks) we could do the same kind of tests.

By the way, I think Shuichi Ihara has provided comprehensive results in various branches. Would you need additional information or results to help your current developments ?

Comment by Jinshan Xiong (Inactive) [ 29/Jun/12 ]

I think performance benchmark on master is necessary because new RPC engine is not landed in 2.1. This patch affects performance a lot so I think we should do performance improvment based on that patch.

The reason I also want the performance result on 1.8 is that people always compare the performance between 1.8 and 2.x so I guess it would be helpful to have that number.

Just in case, when I'm talking about different versions, I only mean the version on the client side. You can use the same version of server all the time, and you will need only one client to run those tests.

Comment by Shuichi Ihara (Inactive) [ 01/Jul/12 ]

Sorry for long suspending on this work, but I will resume these benchmarks on this week.
Keep updates on here.

Thanks!

Comment by Shuichi Ihara (Inactive) [ 08/Jul/12 ]

Hi Jay,

This is new test results. I did test on a Sandybrige Server, PCIgen3 and FDR which means we have
more bandwidth. But, we are seeing more big performance gap between b1_8 and master (and also b2_1).

This is still single client testing. Just ran 12 IOR threads on the single client
and each of thread write/read 84GB files. (total file size > 1TB.)

Please see attached. Here is quick summary.

1. b1_8 client regardless server is running with 2.1.2 or master, we can get mostly same performance
of b1_8 (server and client)
2. With 2.1.2 and master client, we got 42-55% performance regression
3. LU-744 patch for 2.1.2 didn't help
4. LU-744 patch for master helps a little bit, but still have 34% regressions

Comment by Jinshan Xiong (Inactive) [ 09/Jul/12 ]

Ihara,

Thank you very much for the test results - this is helpful.

From the test result, the new RPC engine improved the performance significantly on your node; the purpose of LU-744 is to not make you frustrated when you're writing a file with size greater than the memory size .

The next step for me is to generalize the LU-744 patch to make it work for large number of OSCs. I will check with senior engineers at whamcloud for the solution. After that, I will focus on the performance issue again.

Comment by Jinshan Xiong (Inactive) [ 19/Jul/12 ]

I've pushed patch set 13 to http://review.whamcloud.com/2514, this patch should address the problem of having too many OSCs. Please give it a try.

Comment by James A Simmons [ 20/Jul/12 ]

So is this working going to go into 2.3?

Comment by Peter Jones [ 20/Jul/12 ]

James

We'd love to have this in 2.3 (and even 2.1.3) but we'll have to see when it is ready. At the moment we are still iterating to find a fix that is suitable for production.

Peter

Comment by Jinshan Xiong (Inactive) [ 20/Jul/12 ]

Hi James,

I just pushed patch set 15 which is pretty close to production use. Please give it a try if you're interested.

Comment by James A Simmons [ 20/Jul/12 ]

Integrated into our image. Will do regression testing then a performance evaluation after.

Comment by Shuichi Ihara (Inactive) [ 05/Aug/12 ]

Tested again with 1.8.8 and the master with latest LU-744 patches (patch set 16), but it was not big performance improvements compared to previous results.

4 x Server : E5-2670 2.6GHz 16CPU cores, 64GB memory, FDR Infiniband, lustre-master (2.2.92) + LU-744 patches 
1 x Client : E5-2670 2.6GHz 16CPU cores, 64GB memory, FDR Infiniabnd, lustre-master (2.2.92) + LU-744 patches or lustre-1.8.8

Lustre params:
lctl set_param osc.*.max_rpcs_in_flight=256
lctl set_param osc.*.checksums=0

Write/Read total 1TB files (64GB x 16 threads)
# mpirun -np 16 IOR -b 64g -t 1m -F -C -w -r -e -vv -o /lustre/ior.out/file

lustre-1.8.8 based client
Max Write: 4580.87 MiB/sec (4803.39 MB/sec)
Max Read:  3794.79 MiB/sec (3979.12 MB/sec)

master(2.2.92)+LU-744 (patchset 16) patches
Max Write: 2661.68 MiB/sec (2790.97 MB/sec)
Max Read:  2100.93 MiB/sec (2202.98 MB/sec)

Comment by Jinshan Xiong (Inactive) [ 17/Aug/12 ]

Hi Ihara,

LLNL has seen a huge performance improvement with patch http://review.whamcloud.com/3627, can you please apply the patch along with LU-744 and see if it can help on your machine?

Comment by Shuichi Ihara (Inactive) [ 19/Aug/12 ]

will test and update results soon. Thanks!

Comment by Shuichi Ihara (Inactive) [ 26/Aug/12 ]

Hi Jay,

Tested the latest two LU-744 patches and LU-1666 on the single client (16CPUs and 64GB memory, FDR).

We saw some improvements with patches, but the overall behavior was same. The amount of file size (32GB) < client's memory size, we got 5.1GB/sec, but once no free memory on the client, the performance goes down. (2.7GB/sec)

I collected oprofile and collectl (memory usages and client's throughput) during the both IOR. (amount of file size is 32GB and 256GB)

Comment by Jinshan Xiong (Inactive) [ 30/Aug/12 ]

from the opreport of 256gb, the aggregate of obdclass.ko consumes 37.3% CPU. However, it lacked function names I guess you missed some switches for opreport.

Usually I run opreport as follows:

opreport -alwdg -p /lib/modules/`uname -r`/updates/kernel/fs/lustre -s sample -o out.txt

Comment by Nathan Rutman [ 07/Sep/12 ]

We've noticed client memory swapping with 2.x causes significant performance loss. Attaching a graph of some "dd" operations against lustre, with and without sysctl vm.drop_caches=1 in between. Scales are memory bytes versus time.

Comment by Jinshan Xiong (Inactive) [ 07/Sep/12 ]

Hi Nathan, there is LRU for llite pages in 1.x. So it would be interesting in figuring out whether it is lustre LRU or kernel page swapping process to free pages significantly in 1.8. Also it would be helpful to see what it will be with the patch in this ticket.

Comment by Shuichi Ihara (Inactive) [ 08/Sep/12 ]

attached is re-tested results (the latest LU-744 and LU-1666 patches). write total 4 times larger files (256GB) of client's memory size. and it included fixed opreport outputs.

Comment by Jinshan Xiong (Inactive) [ 10/Sep/12 ]

Hi Ihara, I pushed a combined patch at http://review.whamcloud.com/3924 with some changes to remove contention at cs_pages stats. Please benchmark it and collect stats of collectl and oprofile with switches -alwdgp. Thanks.

Comment by Shuichi Ihara (Inactive) [ 14/Sep/12 ]

Hi Jay, Sure, I will test with latest RPMs and feedback you sooner.

Comment by Shuichi Ihara (Inactive) [ 15/Sep/12 ]
#<--------CPU--------><-----------Memory-----------><--------Lustre Client-------->
#cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map  KBRead  Reads  KBWrite Writes
...
  98  97  513K  74444  43G   4M  14G  14G   3G 189M       0      0  6201344   6056
  96  95  498K  72091  36G   4M  20G  20G   4G 189M       0      0  6027264   5886
  95  94  493K  71878  29G   4M  26G  26G   5G 189M       0      0  6011904   5871
  97  97  503K  64727  22G   4M  31G  31G   6G 189M       0      0  6089728   5947
  96  95  488K  56319  15G   4M  37G  37G   8G 189M       0      0  6054912   5913
  96  95  487K  56600   8G   4M  43G  43G   9G 189M       0      0  6083584   5941
...

above if client free memory size > 4GB, this is also really improved. 5.9GB/sec per client that is actually I got bandwidth between server and client on the RDMA bandwidth testing and lnet selftesting on FDR!

But, after the client exceeds the memory size, it goes down to 4GB/sec. This is also big improvements from previous results, but still slower than b1.8. I'm attaching all collected information. (collectl, ior results and opreport output)

Comment by Andreas Dilger [ 15/Sep/12 ]

5.9GB/sec per client that is actually I got bandwidth between server and client on the RDMA bandwidth testing and lnet selftesting on FDR!

But, after the client exceeds the memory size, it goes down to 4GB/sec.

Ihara, I think you & Jinshan just set a new record for single-client IO performance with Lustre.

Looking at the memory usage, it does seem that most of the memory is in inactive, but doesn't even start to get cleaned up in the 7s it takes to fill the memory, let alone being cleaned up at the rate that Lustre is writing it. I'm assuming that the collectl output above for "Lustre Client" is real data RPCs sent over the network, since it definitely shouldn't be caching nearly so much data, so it isn't a case of "data going directly into cache, then getting slower when it starts writing out cache".

Also of interest is that the "Slab" usage is growing by 1GB/s. That is 1/6 of the memory used by the cache, and a sign of high memory overhead from Lustre for each page of dirty data and/or RPCs. While not directly related to this bug, if Lustre used less memory for itself it would delay the time before the memory ran out...

Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ]

it looks good. Can you please add this patch and run the benchmark again: http://review.whamcloud.com/4001? Thanks.

Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ]

Ihara, can you please tell me the configuration of llite.*.max_cached_mb?

Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ]

Also of interest is that the "Slab" usage is growing by 1GB/s. That is 1/6 of the memory used by the cache, and a sign of high memory overhead from Lustre for each page of dirty data and/or RPCs. While not directly related to this bug, if Lustre used less memory for itself it would delay the time before the memory ran out...

I suspect this is because those lustre pages are still in the kernel's LRU cache even after OSCs try to discard them. So my recent patch tries to remove them out of kernel's LRU and free them voluntarily.

Comment by Shuichi Ihara (Inactive) [ 15/Sep/12 ]

Jinshan, here is current max_cached_mb on the client. I will try your new patches.

  1. lctl get_param llite.*.max_cached_mb
    llite.lustre-ffff88106a250c00.max_cached_mb=
    users: 24
    max_cached_mb: 48398
    used_mb: 48398
    unused_mb: 0
    reclaim_count: 0
Comment by Jinshan Xiong (Inactive) [ 15/Sep/12 ]

I see, please try the patch anyway though it may not help your case.

Comment by Shuichi Ihara (Inactive) [ 15/Sep/12 ]

Jinshan,

I just applied new patches as well, but it didn't help very much. Attached test results after patches applied.

Comment by Robin Humble [ 26/Sep/12 ]

the above seems to be mostly about big streaming i/o. should I open a new bug for random i/o problems, or does it fit into this discussion?

I've been doing some 2.1.3 pre-rollout testing, and there seems to be a client problem with small random reads. performance is considerably worse on 2.1.3 clients than 1.8.8 clients. it's about a 35x slowdown for 4k random read i/o.

tests use the same files on a rhel6 x86_64 2.1.3 server (stock 2.1.3 rpm is used), QDR IB fabric, single disk or md8+2 lun for an OST, all client & server VFS caches dropped between trials.

checksums on or off and client rpc's 8 or 32 makes little difference. I've also tried umount'ing the fs from the 1.8.8 client between tests to make sure there's no hidden caching, but that didn't change anything.

random read ->
IOR -a POSIX -C -r -F -k -e -t $i -z -b 1024m -o /mnt/yo96/rjh/blah

i/o size client version single disk lun md8+2 lun
i=4k 1.8.8 20.70 MB/s 22.0 MB/s
i=4k 2.1.3 0.55 MB/s 0.6 MB/s
i/o size client version single disk lun md8+2 lun
i=1M 1.8.8 87 MB/s 137 MB/s
i=1M 2.1.3 63 MB/s 83 MB/s

although these numbers are for a single process, the same trend applies when the IOR is scaled up to 8 processes/node and to multiple nodes.

Comment by Andreas Dilger [ 26/Sep/12 ]

Robin, it would be better to file the random IO issue as a separate bug. This one is already very long and complex, and it is likely that the solution to the random IO performance will be different than what is being implemented here.

Comment by Robin Humble [ 26/Sep/12 ]

10-4. I've created LU-2032 for it.

Comment by Gregoire Pichon [ 19/Oct/12 ]

Hi,

Here are the measurements I made with different Lustre versions.
I don't see any improvement with the patch in lustre 2.1.
In lustre 2.3, results are good. The patch gives more than 40% improvement.

By the way, I don't understand why the read performance is lower than write (although obdfilter performance is better in read than in write).

Hardware configuration:
30 OSTs
2 OSS (2 sockets, 16 cores, 32GB memory, 2xIB FDR, 4xFC8-2port)
1 Client (2 sockets, 16 cores, 32 GB memory, 1xIB FDR)

Software configuration:
OSS
lustre2.1.3 : lustre 2.1.3 + ORNL-22 + a few other patches

Client
lustre1.8.8-wc1 : standard lustre1.8.8-wc1 client rpms
lustre2.1.3 : lustre2.1.3 + ORNL-22 + a few other patches
lustre2.1.3+lu-744 : same as lustre2.1.3 plus patch #2929
lustre2.2.93 : lustre 2.2.93
lustre2.2.93+lu-744 : lustre 2.2.93 plus patches #3924 and #4001

In the lustre1.8.8-wc1, lustre2.1.3+lu-744 and lustre2.2.93+lu-744 configurations, I have left default value for max_cached_mb (24GiB).

IOR file per process, 16 processes, blockSize=4GiB, xfersize=1MiB, fsync=1.
This gives an aggregate filesize of 64 GiB.

                       write   read  (MiB/s)
lustre1.8.8-wc1        4307    2478
lustre2.1.3            2341    1975
lustre2.1.3+lu-744     2351    1958
lustre2.2.93           2427    1988
lustre2.2.93+lu-744    3571    2808

With the last configuration I have results with several max_cached_mb tuning.

max_cached_mb   write   read  (MiB/s)
1024            2956    1621
2048            3028    2341
4096            3036    2388
8192            3245    2499
16384           3398    3069
24576           3575    3032
Comment by Andreas Dilger [ 20/Oct/12 ]

Gregoire, if it isn't too much to ask, could you please also try the current master client (2.3.53+). It already has the LU-744 patch landed, so it should be at least as fast as 2.2.93 + LU-744.

Comment by Gregoire Pichon [ 22/Oct/12 ]

Here are the results with lustre 2.3.53+ (master up to patch a9444bc)

                       write   read  (MiB/s)
lustre2.3.53+          2582    2134

Results are not as good as version lustre2.2.93+lu-744 described above.
In fact, master does not include all the fixes nested in patches #3924 and #4001:
fix related to coh_page_guard (LU-1666 obdclass: reduce lock contention on coh_page_guard)
fix that removes some stats overhead (LU-744 osc: remove stats)
and fix that removes LRU pages voluntarily (LU-744 llite: remove LRU pages voluntarily)

Comment by Jinshan Xiong (Inactive) [ 22/Oct/12 ]

For b2_1, probably we need new IO engine to boost performance. I will work out a production patch for removing stats so that we can get better performance on 2.3 and master.

Comment by Shuichi Ihara (Inactive) [ 31/Oct/12 ]

Hi Jinshan, any updates on this or we can do something to see how much performance improvements?

Comment by Jinshan Xiong (Inactive) [ 31/Oct/12 ]

There is one more patch needs productizing, I will finish it soon

Comment by Frederik Ferner (Inactive) [ 05/Nov/12 ]

I'm quite interest in these patches, as I'm currently trying to implement a file system where all traffic is via Ethernet, OSS attached with (dual bonded) 10GigE. A small number of clients connected via 10GigE should each be able to write with 900MB/s from a single stream. Currently with a 1.8.8 client writing to 2.3.0 OSSes and network checksums turned off, I get about 700MB/s. Upgrading the client to 2.3.0 I don't seem to get above 450MB/s, checksums don't make much difference here. (IOR 1M block size)

I've so far not had much luck trying the patches attached to this ticket without OOM on my client.

Comment by Jinshan Xiong (Inactive) [ 05/Nov/12 ]

Hi Ihara, I pushed two patches to address the stats problem:

http://review.whamcloud.com/

{4471,4472}

Can you please give them a try? Please collect stats when you're running the patch, thanks.

Hi Frederik, can you please try patches http://review.whamcloud.com/

{4245,4374,4375}

, it may solve your problem if you're hitting the same one by LLNL.

Comment by Frederik Ferner (Inactive) [ 08/Nov/12 ]

Using those patches, I managed to compile a client from the git master branch and run my ior benchmark. It didn't improve performance but my client didn't suffer OOM either. I've not added any other patches on top of master (as off Monday evening: commit 82297027514416985a5557cfe154e174014804ba), as none of them seemed to apply cleanly. Were you expecting me to see higher performance? Are there any other patches I should test?

Frederik

Comment by Jinshan Xiong (Inactive) [ 08/Nov/12 ]

Can you please describe the test env in detail and tell me the specific performance number before and after applying patch? Also, please collect performance data with oprofile and collectl as what Ihara did.

There are two new patches(4471 and 4472) I submitted yesterday, can you please also give it a try?

Comment by Shuichi Ihara (Inactive) [ 09/Nov/12 ]

Jinshan, so, just two patches (4471 and 4472) to master is fine? then, collect stats during the IOR. no need to apply any patches to the master for this debuging, right?

Comment by Jinshan Xiong (Inactive) [ 09/Nov/12 ]

Yes, only those two on master.

Comment by Shuichi Ihara (Inactive) [ 10/Nov/12 ]

Hi Jinshan, just ran same testing after applied two patches (4471 and 4472) to the master. Please check all results and statistics.

Comment by Jinshan Xiong (Inactive) [ 12/Nov/12 ]

Hi Ihara, I still saw high contention in cl_page_put and stats. Can you please try this patch 4519 where I disabled stats completely. For the cl_page_put() part, I will think about a way to solve it.

Comment by Frederik Ferner (Inactive) [ 13/Nov/12 ]

Jinshan,

apologies for not providing the information from the start, I've also now realised that this might be better suited in a new ticket. So let me know if you prefer me to open a new ticket.

My current test setup is a small file system with all servers on Lustre 2.3, 2 OSSes, 6 OSTs in total (3 per OSS). All servers and test clients are attached via 10GigE. Network throughput has been tested and the test client can send with 1100MB/s to each server in turn using netperf. LNET selftest throughput also reaches 1100MB/s sending from one client to both servers at the same time.

I've now repeated a small test with ior and different version on the clients. The test client only has 4GB RAM, in my tests on 2.3.54 (master up to commit 8229702 with patches 4245,4374,4375,4471,4472) I can write small files relatively fast but 4GB files are slow. I've not tested reading as this is not my main concern at the moment. (I'm hoping to achieve 900MB/s sustained write speed over 10GigE from a single process to accomodate a new detector we will commission early next year, my hope was on 2.X clients to provide higher single thread performance than 1.8.)

Ior command used:
ior -o /mnt/play01/tmp/stripe-all/ior_dat -w -k -t1m -b 4g -i 1 -e

client details write speed [MiB/s]
1.8.8, checksums on 487.61
1.8.8, checksums off 592.90
2.3.0, checksums on 440.36
2.3.0, checksums off 441.63
2.3.54+patches, checksums on 30.21
2.3.54+patches, checksums off 34.12
2.3.54+patches, checksums on, 1GB file 313.47

opreport and collectl output for all the tests with 4GB files are attached in lu744-dls-20121113.tar.gz

Let me know if you need anything else or if I need to run oprofile differently as I wasn't familiar with oprofile before.

Comment by Andreas Dilger [ 13/Nov/12 ]

Frederik, I'm assuming for your test results that you are running the same version on both the client and aerver? Would it also be possible for you to test 2.3.0 clients with 2.3.54 servers and vice versa? That would allow us to isolate if the slowdown seen with 2.3.54 is due to changes in the client or server.

Comment by Frederik Ferner (Inactive) [ 13/Nov/12 ]

So far all these tests have been done with 2.3.0 on the servers. I've not tried 2.3.54 on any of my test servers yet. I'll try to find some time over the next few days.

Comment by Shuichi Ihara (Inactive) [ 16/Nov/12 ]

Jinshan, tested master + patch 4519 on both servers and client, but it seems to be still same results.

Comment by Jinshan Xiong (Inactive) [ 16/Nov/12 ]

Frederik for delay response. From the test results, it looks there may be some issues with LU-2139 patches. You can see it in the collectl stats:

46 45 72551 10187 2G 10M 410M 399M 188M 162M 0 0 379904 371
47 47 74169 7109 2G 11M 786M 775M 269M 162M 0 0 385024 376
25 25 40289 4190 2G 11M 982M 971M 312M 162M 0 0 200704 196
6 6 8440 229 2G 11M 982M 971M 313M 162M 0 0 0 0
7 7 10545 249 2G 11M 982M 971M 313M 164M 0 0 0 0

.....

Unknown macro: {20 seconds later}

7 7 10639 241 2G 11M 983M 973M 311M 163M 0 0 0 0
9 8 12408 236 2G 11M 983M 973M 311M 163M 0 0 0 0
34 34 53022 4218 1G 11M 1G 1G 357M 163M 0 0 258048 252
50 50 77645 7414 1G 11M 1G 1G 447M 163M 0 0 422912 413

There were some IO activities for 2 or 3 seconds and then stay quiet for around 20 seconds and then do IO again. It seems like LRU budget was running out so OSC had to wait for commit on OST to be finished.

I will work on this. Thanks for testing.

Comment by Jinshan Xiong (Inactive) [ 16/Nov/12 ]

Hi Ihara, I saw there is significant CPU usage for library mca_btl_sm.so(11.7%) and libopen-pal.so.0.0.0(4.7%) but the performance data shown on Sep 5 they only consumed 0.13% and 0.05%. They are openmpi libraries. Did you do any upgrade on these libraries?

Anyway, I revised patch 4519 and restored 4472 to remove memory stalls, please apply them in your next benchmark. However we have to figure out why openmpi libraries consumed so much cpu before seeing the performance improvement.

Comment by Shuichi Ihara (Inactive) [ 17/Nov/12 ]

Jinshan,

Yes, I upgraded MPI libbary a couple of weeks ago. I found a hardware problem and fixed it. Now mca_btl_sm_component_progress less consuming. it's still high compared to previous library though...

This attachment includes three test results

1. master without any patches
2. master + 4519 (2nd patch) + 4472 (2nd patch)
3. master + 4519 (2nd patch) + 4472 (2nd patch) and run mpi with pthread, instead of shared memory.

patches help less CPU consuming and improve the performance, but still drop the performance when the client is no free memory.

Comment by Prakash Surya (Inactive) [ 19/Nov/12 ]

Jinshan, Frederik, When using the LU-2139 patches on the client but not on the server, it is normal to see the IO pause/stall as you are seeing. I'm not sure if this is happening for this this, but what can happen is:

1. Client performs IO
2. Client receives completion callback for bulk RPC
3. Bulk pages now clean but "unstable" (uncommitted on OST)
4. NR_UNSTABLE_NFS incremented for each unstable page (due to http://review.whamcloud.com/4245)
5. NR_UNSTABLE_NFS grows larger than (background_thresh + dirty_thresh)/2
6. Kernel stalls IO waiting for NR_UNSTABLE_NFS to decrease (via kernel function: balance_dirty_pages)
7. Client receives Lustre ping sometime in future (around 20 seconds later?), updating last_committed
8. Bulk pages now "stable" on client and can be reclaimed, lowering NR_UNSTABLE_NFS
9. Go back to step 1.

Reading the above comments, it looks like the LU-2139 patches are working as intended (avoiding OOMs at the cost of performance). Although I admit, the performance is terrible when you hit the NR_UNSTABLE_NFS limit and the kernel halts all IO (put is better than OOM, IMO). To improve on this, http://review.whamcloud.com/4375 needs to be applied to both clients and servers. This will allow the server to proactively commit bulk pages as they come in, hopefully preventing the client from exhausting its memory with unstable pages and avoiding the "stall" in balance_dirty_pages. With it applied to the server, I'd expect NR_UNSTABLE_NFS to remain "low", and the 4GB file speeds to reflect the 1GB speeds.

Please keep in mind, the LU-2139 patches are all experimental and subject to change.

On the client, with the LU-2139 patches applied, you might find it interesting to watch lctl get_param llite.*.unstable_stats and cat /proc/meminfo | grep NFS_Unstable as the test is running.

For example:

$ watch -n0.1 'lctl get_param llite.*.unstable_stats'
$ watch -n0.1 'cat /proc/meminfo | grep NFS_Unstable'

Those will give you an idea for the amount of unstable pages the client has at a given time. If that value gets "high" (exact value depends on your dirty limits, but probably around 1/4 of RAM) then what I detailed above is most likely the cause for the bad performance.

Comment by Jinshan Xiong (Inactive) [ 19/Nov/12 ]

Hi Ihara, this is because CPU is still under contention so the performance dropped when the hosekeeping work started. Can you please run the benchmark one more time with patches 4519, 4472 and 4617. This should help a little bit.

Comment by Jinshan Xiong (Inactive) [ 02/Jan/13 ]

There is a new patch for performance tune at: http://review.whamcloud.com/4943. Please give it a try.

Comment by Jinshan Xiong (Inactive) [ 02/Jan/13 ]

My next patch will be to remove top cache of cl_page.

Comment by Shuichi Ihara (Inactive) [ 03/Jan/13 ]

Jinshan,

I just tested http://review.whamcloud.com/4943

attached includes all results and oprofile output.
it looks obviously better than previous numbers. but I wonder if we could get more better performance since we are getting 5.6GB/sec sometimes. (see collectl.out) want to keep these around these numbers

Comment by Prakash Surya (Inactive) [ 03/Jan/13 ]

It might help with interpreting the opreport data if the -p option is used. According the the opreport man page:

       --image-path / -p [paths]
              Comma-separated list of additional paths to search for binaries.  This is needed to find modules in kernels 2.6 and upwards.

Without it, external module symbols don't get resolved:

samples  %        image name               app name                 symbol name
6340482  25.2096  obdclass                 obdclass                 /obdclass
3473020  13.8087  osc                      osc                      /osc
1972900   7.8442  lustre                   lustre                   /lustre
1374077   5.4633  vmlinux                  vmlinux                  copy_user_generic_string
842569    3.3500  lov                      lov                      /lov
551880    2.1943  libcfs                   libcfs                   /libcfs

Although the opreport-alwdg-p_lustre.out file seems to have all the useful bits.

Comment by Jinshan Xiong (Inactive) [ 03/Jan/13 ]

CPU is still a bottleneck. The write speed dropped after OSC LRU cache stepped in and immediately drove the CPU usage to 100%. Let me see if I can optimize it.

Comment by Jinshan Xiong (Inactive) [ 03/Jan/13 ]

Hi Ihara, what's the performance of b1_8 again on the same platform?

Comment by Andreas Dilger [ 03/Jan/13 ]

Ihara, could you please extract out the performance numbers for this patch and the previous ones in a small table like was done for the previous tests?

Comment by Shuichi Ihara (Inactive) [ 03/Jan/13 ]

OK, tested again on client with b1_8, master mater+4943 patches, and from this test, I ran multiple iterations of IOR.

Configuration
8 x OSS : 2 x E5-2670 (2.6GHz), 64GB memory, Centos6.3+master(2.3.58)/w FDR, total 32 OSTs
1 x Client : 2 x E5-2680 (2.7GHz), 64GB memory, Centos6.3/w FDR (tested with b1_8, master and master+patch as patchless client)

nproc=12
                   iteration=1   iteration=2   iteration=3
master(2.3.58)     3547 MiB/s    2754 MiB/s    2633 MiB/s
master+patch(4943) 3775 MiB/s    3407 MiB/s    2841 MiB/s  
b1_8               4212 MiB/s    4012 MiB/s    3750 MiB/s

nproc=16
                   iteration=1   iteration=2   iteration=3
master(2.3.58)     3617 MiB/s    3286 MiB/s    3149 MiB/s
master+patch(4943) 4077 MiB/s    3269 MiB/s    3511 MiB/s  
b1_8               4851 MiB/s    4255 MiB/s    4277 MiB/s
Comment by Shuichi Ihara (Inactive) [ 03/Jan/13 ]

new test results includes b1_8, master and master+patch.

Comment by Gregoire Pichon [ 23/Jan/13 ]

Jinshan,

What is the status of the patch http://review.whamcloud.com/#change,2929 you posted several months ago for b2_1 release ?
Why has it never been landed ?

I have made some measurements and results are significant: from 4% to 50% improvement depending on the platform I tested on.

Here are the results.

Hardware configuration:
30 OSTs
2 OSS : 4 sockets, 32 cores, 64GB memory, 2xIB, 4xFC8-2port
ClientA : 4 sockets Nehalem-EX, 32 cores, 64GB memory, 1xIB
ClientB : 2 sockets SandyBridge-EP, 16 cores, 64GB memory, 1xIB
Interconnect is QDR Infiniband

Software configuration:
kernel 2.6.32-220
lustre 2.1.3 + ORNL-22 + a few other patches

IOR file per process, blockSize=4GiB, xfersize=1MiB, fsync=1.
This gives an aggregate filesize of 120 GiB.

          #tasks    write   read   configuration
ClientA       30     1121   1079   lustre 2.1.3
ClientA       30     1782   1413   lustre 2.1.3 + #2929

ClientB       16     2482   2149   lustre 2.1.3
ClientB       16     2616   2244   lustre 2.1.3 + #2929
Comment by Prakash Surya (Inactive) [ 23/Jan/13 ]

Gregoire, that's interesting. I wouldn't immediately expect #2929 to make much of a performance impact. How many iterations did you run? I'm curious if those numbers are within the natural variance of the test, or if they're actually because of the changes in #2929. Jinshan, would you expect performance to increase because of that patch?

Comment by Andreas Dilger [ 20/Feb/13 ]

Jinshan,
with http://review.whamcloud.com/4943 landed to master, are there any patches left to land under this bug, or can it be closed?

Comment by Shuichi Ihara (Inactive) [ 20/Feb/13 ]

Andreas,

As far as I tested, 4943 helped perforamnce improveemnts, but even that patches applied, perforamnce is still lower than b1_8.

Comment by Jinshan Xiong (Inactive) [ 20/Feb/13 ]

All patches have been landed. More work is also needed.

Comment by Cliff White (Inactive) [ 06/May/13 ]

Test single client performance against 2.3.64 servers, versions tested: 1.8.8, 2.1.5,2.3.0,2.3.64

Comment by Shuichi Ihara (Inactive) [ 07/May/13 ]

Cliff, what are server's CPU type, memory size? IOR options, file size. The performance depends client's specs, network and storage.
We are getting much better perforamnce on the current master and 1.8 client is still fast on some of numbers.
I will file the latest numbers here.

Comment by Cliff White (Inactive) [ 07/May/13 ]

Servers are Intel Xeon, 64GB RAM. IOR options were taken from this bug, -t 1m -b 32g
Client had same/similar CPU, 64GB RAM

Comment by Peter Jones [ 06/Feb/14 ]

This should have been addressed by LU-3321

Generated at Sat Feb 10 01:10:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.