[LU-1447] MDS Load avarage Created: 30/May/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Task Priority: Minor
Reporter: Fabio Verzelloni Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

MDS HW
----------------------------------------------------------------------------------------------------
Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
Vendor ID: AuthenticAMD
CPU family: 16
64Gb RAM
Interconnect IB 40Gb/s

MDT LSI 5480 Pikes Peak
SSDs SLC
----------------------------------------------------------------------------------------------------

OSS HW
----------------------------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
Vendor ID: GenuineIntel
CPU family: 6
64Gb RAM
Interconnect IB 40Gb/s

OST LSI 7900
----------------------------------------------------------------------------------------------------

1 MDS + 1 fail over
12 OSS - 6 OST per OSS


Attachments: File ps_D     PNG File top.png    
Rank (Obsolete): 10075

 Description   

Dear support,
we are running Lustre 2.2 on our system and we noticed that also with a low usage of the file system, the MDS is always with a load avarage around 10 and comparing with Lustre 1.8.4 with a high I/O the MDS is basically always around load avarage of 1.80. Is this the normal behavior of Lustre 2.2?



 Comments   
Comment by Oleg Drokin [ 30/May/12 ]

I see that there is not an issue with high CPU usage, rather some of the threads are sleeping in D state.
The top snapshot is not long enough to show them.
Can you please do ps ax and filter out only threads in the D state and then post the output here?
(e.g. ps ax | grep D)

Comment by Fabio Verzelloni [ 31/May/12 ]

Yesterday evening we had a hang of the file system ( ticket http://jira.whamcloud.com/browse/LU-1451 ) and now the load avarage is back to 'normal' (load average: 0.26, 0.24, 0.30 ) also during heavy I/O but we are having drop down of performance, ( http://jira.whamcloud.com/browse/LU-1455 ).

Comment by Fabio Verzelloni [ 31/May/12 ]

This is the all cluster ps -ax | grep D after a while of working.

Comment by Oleg Drokin [ 31/May/12 ]

Based on these it seems that weisshorn07, 09, 13 ... are having serious overload issues (likely induced by the disk subsystems there). Any chance you can survey your disk subsystem to see what's going on? I suspect it's not really happy to have a lot of parallel IO ongoing. Also since some other OSSes are less busy, it appears the IO is not distributed all that evenly.

In a lot of cases with "weak" (parallel-io wise) disk subsystems limiting number of ost io threads possible should help the situation I think, by reducing the overload.
Ongoing work on NRS should help the issue even more by allowing to limit number of in-progress RPCs per target.

The MDS does not have any processes in D state, and I assume at the time this snapshot was taken MDS Load Average was pretty small?

Comment by Fabio Verzelloni [ 01/Jun/12 ]

The disk HW is:
LSI 7900
6 controllers
8 enclosure x controller
RAID 6 - SATA 7.2RPM

Our 'max_rpmcs_in_flight' on the MDS is:
[root@weisshonr01 lustre]# cat ./osc/scratch-OST0047-osc-MDT0000/max_rpcs_in_flight
8

and the threads_[max,min,started] are:

[root@weisshorn01 lustre]# cat ./mgs/MGS/mgs/threads_max
32
[root@weisshorn01 lustre]# cat ./mgs/MGS/mgs/threads_started
32
[root@weisshorn01 lustre]# cat ./mgs/MGS/mgs/threads_min
3

[root@weisshorn01 scratch-MDT0000]# cat ./mdt_mds/threads_max
512
[root@weisshorn01 scratch-MDT0000]# pwd
/proc/fs/lustre/mdt/scratch-MDT0000
[root@weisshorn01 scratch-MDT0000]# cat ./mdt_mds/threads_min
2
[root@weisshorn01 scratch-MDT0000]# cat ./mdt_mds/threads_started
2

on the OSS:

[root@weisshorn03 lustre]# cat ./ost/OSS/ost/threads_max
512
[root@weisshorn03 lustre]# cat ./ost/OSS/ost/threads_min
128
[root@weisshorn03 lustre]# cat ./ost/OSS/ost/threads_started
512

on the client/MDS side the max_rpcs_in_flight is:

[root@weisshorn01 lustre]# cat ./osc/scratch-OST0005-osc-MDT0000/max_rpcs_in_flight
8

So far we didn't see anymore the high load average on the MDS instead the load average on the OSS when we run benchmark with block size of 4096k OSS load goes to 200-300 and with 'top' we see "x.x%wa" in some cases increasing.

Do you have any suggestion about the right tuning based on our hardware/configuration? Also on the client side? ( Cray XE6 1500 nodes~ )
Thanks

Fabio

Comment by Oleg Drokin [ 01/Jun/12 ]

Well, it's somewhat expected that as you increase the write activity, load average on OSTs goes up.

Essentially what's going on is every write RPC (in 1M chunks) from every client is going to consume 1 OSS io thread.
If the thread blocks doing IO, you get +1 to LA.
The more threads in this state, the higher is the LA. Additionally a lot of the disk controllers hate a lot of parallel IO (because it's in effect highly random IO from their perspective).

So solutions for you are multiple:
1. If it's just the high LA that bothers you and the rest of the system performs well and the speed is acceptable - just ignore the LA.
2. If you think your system could perform better under load (I am not really familiar with that particular LSI controller, you might want to speak with LSI guys to see what's the optimal IO pattern for it) - you might try decreasing OSS max thread number )to 256, 128 and so on) and see what impact would it have.
If you gather some more stats from your disk controllers to see the IO pattern as seen by it we might see if something looks out of place too (e.g. if you have a lot of small IO, that would not be good). What is the load-generating application doing?

There is mostly no client-specific tuning on the client side that you can do that would relieve the situation without additionally ruining e.g. single client performance, so I suggest you to concentrate on servers here. (the possible exception is some read-ahead settings when you expect to have a lot of small read traffic).

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:16:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.