[LU-5995] Apparent scale issue with 2.5.2 clients to 2.5.3 servers Created: 05/Dec/14 Updated: 07/Jun/16 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2, Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Major |
| Reporter: | Joe Mervini | Assignee: | Oleg Drokin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Compute Clusters - Sun X6275 blades, QDR IB torus |
||
| Epic/Theme: | Performance |
| Rank (Obsolete): | 16720 |
| Description |
|
I am running into a significant performance issue when running IOR primarily to the file system mentioned in the environment but is being observed on other global lustre file systems as well. On the smaller of the two systems with a single router I am getting near wire speed with running IOR on 32 nodes with 8 threads. On the larger system I am getting only approximately 10% of what I expect to see going through the routers and where I'm expecting to see ~10GB/sec to the file system I am only seeing roughly 6% of that performance. I have run both netperf and lnet_selftest from the routers to the servers and in the case of netperf am seeing basically wire speed. lnet_selftest show approximately the same result when using a concurrency of 8 but with a concurrency of 1 I am seeing about half. Loads on the servers and routers are insignificant. I have increased the credits on both the servers and gateways with no observable impact (although changes to the routers has not been implement only on 2 gateways do to it being a production environment). Credits changes have not been made on the client side (again because this is a production system). I am unsure what to try next nor do I know whether there is a known compatibility issue between client and server versions. Any help would be greatly appreciated. |
| Comments |
| Comment by Peter Jones [ 16/Dec/14 ] |
|
Oleg What do you advise here? Peter |
| Comment by Oleg Drokin [ 16/Dec/14 ] |
|
There should not be any compatibility issues between 2.4 and 2.5 versions. I guess you already have looked into router statistics to ensure the load is evenly distributed across all the routers and not going through one-two of them only making them the bottleneck of the entire thing (same goes for your file striping, but I guess you have checked that too already?) When you say "lnet_selftest show approximately the same result" do you mean that you get wirespeed in such a test as well? |
| Comment by Joe Mervini [ 16/Dec/14 ] |
|
The data through the (12) routers was relatively balanced. I was monitoring the router stats during all my testing. The performance of the lnet_selftest was consistent between the 2 systems in terms that on a single router the results were similar and the average when up on the larger system when addition routers were added to the test (with lst only running between the router nodes themselves and the lustre servers as opposed to running lst on clients through the routers) One thing we were wondering about is credits and peer_credits and whether their setting might be a factor. To be honest we haven't really monkeyed with the module load options in years with the exception of adding new networks and routers. In terms of load time option we are using they are shown below:
###############################################################################
options lnet tiny_router_buffers=4096
|
| Comment by Amir Shehata (Inactive) [ 20/Dec/14 ] |
|
Here the tunings that might affect performance: You can try manipulating these on clients and servers to see if the increase performance. |
| Comment by Liang Zhen (Inactive) [ 21/Dec/14 ] |
|
Hi Joe, did you try lnet_selftest with all 32 clients and all servers (OSSs) or at least 12 servers, what the average server performance in that case? Also, when you ran IOR on the large system, were IO requests from clients evenly spreading over all servers? |