[LU-744] Single client's performance degradation on 2.1 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.3.0
Labels:
None

Severity:
3
Rank (Obsolete):
4018

Description

During the performance testing on lustre-2.1, I saw the single client's performance degradation on it.
Here is IOR results on the single cleints with 2.1 and also lustre-1.8.6.80 for comparing.
I ran IOR (IOR -t 1m -b 32g -w -r -vv -F -o /lustre/ior.out/file) on the single client with 1, 2, 4 and 8 processes.

Write(MiB/sec)
v1.8.6.80 v2.1
446.25 411.43
808.53 761.30
1484.18 1151.41
1967.42 1172.06

Read(MiB/sec)
v1.8.6.80 v2.1
823.90 595.71
1449.49 1071.76
2502.49 1517.79
3133.43 1746.30

Tested on same infrastracture(hardware and network). The client just turned off the checksum on both testing.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2.4 Single Client 3May2013.xlsx
34 kB
06/May/13 2:44 PM
574.1.pdf
169 kB
07/Sep/12 4:18 PM
ior-256gb.tar.gz
32 kB
26/Aug/12 1:49 PM
ior-32gb.tar.gz
24 kB
26/Aug/12 1:49 PM
lu744-20120909.tar.gz
883 kB
08/Sep/12 2:27 PM
lu744-20120915.tar.gz
874 kB
15/Sep/12 1:03 AM
lu744-20120915-02.tar.gz
1.02 MB
15/Sep/12 4:23 AM
lu744-20121111.tar.gz
849 kB
10/Nov/12 12:22 PM
lu744-20121113.tar.gz
846 kB
16/Nov/12 7:58 AM
lu744-20121117.tar.gz
2.45 MB
17/Nov/12 12:03 PM
lu744-20130104.tar.gz
915 kB
03/Jan/13 11:18 AM
lu744-20130104-02.tar.gz
26 kB
03/Jan/13 11:30 PM
lu744-dls-20121113.tar.gz
10 kB
13/Nov/12 5:40 AM
orig-collectl.out
81 kB
16/Apr/12 11:11 AM
orig-ior.out
2 kB
16/Apr/12 11:11 AM
orig-opreport-l.out
146 kB
16/Apr/12 11:11 AM
patched-collectl.out
34 kB
16/Apr/12 11:11 AM
patched-ior.out
2 kB
16/Apr/12 11:11 AM
patched-opreport-l.out
137 kB
16/Apr/12 11:11 AM
single-client-performance.xlsx
42 kB
08/Jul/12 10:00 AM
stats-1.8.zip
14 kB
27/Oct/11 11:23 AM
stats-2.1.zip
64 kB
11/Oct/11 3:26 AM
test2-various-version.zip
264 kB
17/Apr/12 8:01 PM
test-patchset-2.zip
147 kB
16/Apr/12 8:38 PM

Issue Links

is related to

LU-1408 single client's performance regression test

Resolved

LU-1413 difference of single client's performance between b2_1 and 2.1.2RC0

Resolved

LU-3321 2.x single thread/process throughput degraded from 1.8

Resolved

LU-141 port lustre client page cache shrinker back to clio

Resolved

LU-1201 Lustre crypto hash cleanup

Resolved

Activity

[LU-744] Single client's performance degradation on 2.1

Andreas Dilger added a comment - 03/Jan/13 4:28 PM

Ihara, could you please extract out the performance numbers for this patch and the previous ones in a small table like was done for the previous tests?

Andreas Dilger added a comment - 03/Jan/13 4:28 PM Ihara, could you please extract out the performance numbers for this patch and the previous ones in a small table like was done for the previous tests?

Jinshan Xiong (Inactive) added a comment - 03/Jan/13 2:57 PM

Hi Ihara, what's the performance of b1_8 again on the same platform?

Jinshan Xiong (Inactive) added a comment - 03/Jan/13 2:57 PM Hi Ihara, what's the performance of b1_8 again on the same platform?

Jinshan Xiong (Inactive) added a comment - 03/Jan/13 2:11 PM - edited

CPU is still a bottleneck. The write speed dropped after OSC LRU cache stepped in and immediately drove the CPU usage to 100%. Let me see if I can optimize it.

Jinshan Xiong (Inactive) added a comment - 03/Jan/13 2:11 PM - edited CPU is still a bottleneck. The write speed dropped after OSC LRU cache stepped in and immediately drove the CPU usage to 100%. Let me see if I can optimize it.

Prakash Surya (Inactive) added a comment - 03/Jan/13 11:36 AM

It might help with interpreting the opreport data if the -p option is used. According the the opreport man page:

       --image-path / -p [paths]
              Comma-separated list of additional paths to search for binaries.  This is needed to find modules in kernels 2.6 and upwards.

Without it, external module symbols don't get resolved:

samples  %        image name               app name                 symbol name
6340482  25.2096  obdclass                 obdclass                 /obdclass
3473020  13.8087  osc                      osc                      /osc
1972900   7.8442  lustre                   lustre                   /lustre
1374077   5.4633  vmlinux                  vmlinux                  copy_user_generic_string
842569    3.3500  lov                      lov                      /lov
551880    2.1943  libcfs                   libcfs                   /libcfs

Although the opreport-alwdg-p_lustre.out file seems to have all the useful bits.

Prakash Surya (Inactive) added a comment - 03/Jan/13 11:36 AM It might help with interpreting the opreport data if the -p option is used. According the the opreport man page: --image-path / -p [paths] Comma-separated list of additional paths to search for binaries. This is needed to find modules in kernels 2.6 and upwards. Without it, external module symbols don't get resolved: samples % image name app name symbol name 6340482 25.2096 obdclass obdclass /obdclass 3473020 13.8087 osc osc /osc 1972900 7.8442 lustre lustre /lustre 1374077 5.4633 vmlinux vmlinux copy_user_generic_string 842569 3.3500 lov lov /lov 551880 2.1943 libcfs libcfs /libcfs Although the opreport-alwdg-p_lustre.out file seems to have all the useful bits.

Shuichi Ihara (Inactive) added a comment - 03/Jan/13 11:18 AM

Jinshan,

I just tested http://review.whamcloud.com/4943

attached includes all results and oprofile output.
it looks obviously better than previous numbers. but I wonder if we could get more better performance since we are getting 5.6GB/sec sometimes. (see collectl.out) want to keep these around these numbers

Shuichi Ihara (Inactive) added a comment - 03/Jan/13 11:18 AM Jinshan, I just tested http://review.whamcloud.com/4943 attached includes all results and oprofile output. it looks obviously better than previous numbers. but I wonder if we could get more better performance since we are getting 5.6GB/sec sometimes. (see collectl.out) want to keep these around these numbers

Jinshan Xiong (Inactive) added a comment - 02/Jan/13 5:11 PM

My next patch will be to remove top cache of cl_page.

Jinshan Xiong (Inactive) added a comment - 02/Jan/13 5:11 PM My next patch will be to remove top cache of cl_page.

Jinshan Xiong (Inactive) added a comment - 02/Jan/13 3:12 PM

There is a new patch for performance tune at: http://review.whamcloud.com/4943. Please give it a try.

Jinshan Xiong (Inactive) added a comment - 02/Jan/13 3:12 PM There is a new patch for performance tune at: http://review.whamcloud.com/4943 . Please give it a try.

Jinshan Xiong (Inactive) added a comment - 19/Nov/12 7:01 PM

Hi Ihara, this is because CPU is still under contention so the performance dropped when the hosekeeping work started. Can you please run the benchmark one more time with patches 4519, 4472 and 4617. This should help a little bit.

Jinshan Xiong (Inactive) added a comment - 19/Nov/12 7:01 PM Hi Ihara, this is because CPU is still under contention so the performance dropped when the hosekeeping work started. Can you please run the benchmark one more time with patches 4519, 4472 and 4617. This should help a little bit.

Prakash Surya (Inactive) added a comment - 19/Nov/12 11:38 AM - edited

Jinshan, Frederik, When using the ~~LU-2139~~ patches on the client but not on the server, it is normal to see the IO pause/stall as you are seeing. I'm not sure if this is happening for this this, but what can happen is:

1. Client performs IO
2. Client receives completion callback for bulk RPC
3. Bulk pages now clean but "unstable" (uncommitted on OST)
4. NR_UNSTABLE_NFS incremented for each unstable page (due to http://review.whamcloud.com/4245)
5. NR_UNSTABLE_NFS grows larger than (background_thresh + dirty_thresh)/2
6. Kernel stalls IO waiting for NR_UNSTABLE_NFS to decrease (via kernel function: balance_dirty_pages)
7. Client receives Lustre ping sometime in future (around 20 seconds later?), updating last_committed
8. Bulk pages now "stable" on client and can be reclaimed, lowering NR_UNSTABLE_NFS
9. Go back to step 1.

Reading the above comments, it looks like the ~~LU-2139~~ patches are working as intended (avoiding OOMs at the cost of performance). Although I admit, the performance is terrible when you hit the NR_UNSTABLE_NFS limit and the kernel halts all IO (put is better than OOM, IMO). To improve on this, http://review.whamcloud.com/4375 needs to be applied to both clients and servers. This will allow the server to proactively commit bulk pages as they come in, hopefully preventing the client from exhausting its memory with unstable pages and avoiding the "stall" in balance_dirty_pages. With it applied to the server, I'd expect NR_UNSTABLE_NFS to remain "low", and the 4GB file speeds to reflect the 1GB speeds.

Please keep in mind, the ~~LU-2139~~ patches are all experimental and subject to change.

On the client, with the ~~LU-2139~~ patches applied, you might find it interesting to watch lctl get_param llite.*.unstable_stats and cat /proc/meminfo | grep NFS_Unstable as the test is running.

For example:

$ watch -n0.1 'lctl get_param llite.*.unstable_stats'
$ watch -n0.1 'cat /proc/meminfo | grep NFS_Unstable'

Those will give you an idea for the amount of unstable pages the client has at a given time. If that value gets "high" (exact value depends on your dirty limits, but probably around 1/4 of RAM) then what I detailed above is most likely the cause for the bad performance.

Prakash Surya (Inactive) added a comment - 19/Nov/12 11:38 AM - edited Jinshan, Frederik, When using the LU-2139 patches on the client but not on the server, it is normal to see the IO pause/stall as you are seeing. I'm not sure if this is happening for this this, but what can happen is: 1. Client performs IO 2. Client receives completion callback for bulk RPC 3. Bulk pages now clean but "unstable" (uncommitted on OST) 4. NR_UNSTABLE_NFS incremented for each unstable page (due to http://review.whamcloud.com/4245 ) 5. NR_UNSTABLE_NFS grows larger than (background_thresh + dirty_thresh)/2 6. Kernel stalls IO waiting for NR_UNSTABLE_NFS to decrease (via kernel function: balance_dirty_pages) 7. Client receives Lustre ping sometime in future (around 20 seconds later?), updating last_committed 8. Bulk pages now "stable" on client and can be reclaimed, lowering NR_UNSTABLE_NFS 9. Go back to step 1. Reading the above comments, it looks like the LU-2139 patches are working as intended (avoiding OOMs at the cost of performance). Although I admit, the performance is terrible when you hit the NR_UNSTABLE_NFS limit and the kernel halts all IO (put is better than OOM, IMO). To improve on this, http://review.whamcloud.com/4375 needs to be applied to both clients and servers. This will allow the server to proactively commit bulk pages as they come in, hopefully preventing the client from exhausting its memory with unstable pages and avoiding the "stall" in balance_dirty_pages. With it applied to the server, I'd expect NR_UNSTABLE_NFS to remain "low", and the 4GB file speeds to reflect the 1GB speeds. Please keep in mind, the LU-2139 patches are all experimental and subject to change. On the client, with the LU-2139 patches applied, you might find it interesting to watch lctl get_param llite.*.unstable_stats and cat /proc/meminfo | grep NFS_Unstable as the test is running. For example: $ watch -n0.1 'lctl get_param llite.*.unstable_stats' $ watch -n0.1 'cat /proc/meminfo | grep NFS_Unstable' Those will give you an idea for the amount of unstable pages the client has at a given time. If that value gets "high" (exact value depends on your dirty limits, but probably around 1/4 of RAM) then what I detailed above is most likely the cause for the bad performance.

Shuichi Ihara (Inactive) added a comment - 17/Nov/12 12:03 PM

Jinshan,

Yes, I upgraded MPI libbary a couple of weeks ago. I found a hardware problem and fixed it. Now mca_btl_sm_component_progress less consuming. it's still high compared to previous library though...

This attachment includes three test results

1. master without any patches
2. master + 4519 (2nd patch) + 4472 (2nd patch)
3. master + 4519 (2nd patch) + 4472 (2nd patch) and run mpi with pthread, instead of shared memory.

patches help less CPU consuming and improve the performance, but still drop the performance when the client is no free memory.

Shuichi Ihara (Inactive) added a comment - 17/Nov/12 12:03 PM Jinshan, Yes, I upgraded MPI libbary a couple of weeks ago. I found a hardware problem and fixed it. Now mca_btl_sm_component_progress less consuming. it's still high compared to previous library though... This attachment includes three test results 1. master without any patches 2. master + 4519 (2nd patch) + 4472 (2nd patch) 3. master + 4519 (2nd patch) + 4472 (2nd patch) and run mpi with pthread, instead of shared memory. patches help less CPU consuming and improve the performance, but still drop the performance when the client is no free memory.

Jinshan Xiong (Inactive) added a comment - 16/Nov/12 3:24 PM

Hi Ihara, I saw there is significant CPU usage for library mca_btl_sm.so(11.7%) and libopen-pal.so.0.0.0(4.7%) but the performance data shown on Sep 5 they only consumed 0.13% and 0.05%. They are openmpi libraries. Did you do any upgrade on these libraries?

Anyway, I revised patch 4519 and restored 4472 to remove memory stalls, please apply them in your next benchmark. However we have to figure out why openmpi libraries consumed so much cpu before seeing the performance improvement.

Jinshan Xiong (Inactive) added a comment - 16/Nov/12 3:24 PM Hi Ihara, I saw there is significant CPU usage for library mca_btl_sm.so(11.7%) and libopen-pal.so.0.0.0(4.7%) but the performance data shown on Sep 5 they only consumed 0.13% and 0.05%. They are openmpi libraries. Did you do any upgrade on these libraries? Anyway, I revised patch 4519 and restored 4472 to remove memory stalls, please apply them in your next benchmark. However we have to figure out why openmpi libraries consumed so much cpu before seeing the performance improvement.

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Shuichi Ihara (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 35 Start watching this issue

Dates

Created:: 09/Oct/11 9:24 PM

Updated:: 13/Mar/14 9:40 PM

Resolved:: 06/Feb/14 2:53 PM