[LU-4518] OSS1 provides less performance than OSS2 Created: 21/Jan/14  Updated: 03/Mar/14  Resolved: 03/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Chakravarthy Nagarajan (Inactive) Assignee: Malcolm Cowe (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

CentOS-6.4
Mellanox OFED-2.0-3.0


Attachments: Zip Archive messages.zip     Text File sgpdd_oss1.txt     Text File sgpdd_oss2.txt    
Severity: 3
Rank (Obsolete): 12360

 Description   

Hi,

We've 24 OSTs load balanced between OSS1 and OSS2 Servers. We are getting a degrade in performance while 24 OSTs load balanced between OSS1/OSS2 and also all OSTs handled by OSS1 alone. But we could see 3X performance while OSS2 handles all of the OSTs.

I've attached the logs from both the nodes for your reference. Appreciate your help.



 Comments   
Comment by John Fuchs-Chesney (Inactive) [ 23/Jan/14 ]

Malcolm,
Can you take a look at this please. It sounds like it may be quite specific to the Wipro configuration.

Thanks.

Comment by Malcolm Cowe (Inactive) [ 23/Jan/14 ]

Hi,

Are you able to eliminate hardware as a source of the problem? A difference in performance between OSS1 and OSS2 often indicates a difference in the hardware or software. If the two servers are identically similar in configuration, then there may be a fault in the OSS1 server or its network or storage connections.

Some low level testing of the storage access and the network interfaces may help to highlight any differences in performance.

Benchmarking the low level disk performance from each OSS can be done with applications such as VDBench or Lustre's spgdd_survey tool. These are destructive tests and will destroy any existing content on the target storage but are effective tools to measure baseline performance.

To check the network connection, one can use the ib_read_bw command line utility for InfiniBand networks as a very simple and quick test. I would suggest running this between OSS1 and the MDS, as well as between OSS2 and the MDS for comparison. There are more complex tests, such as lnet_selftest, but the ib_read_bw and ib_write_bw tools give a good indication of basic IB point to point connection throughput. One can check for errors in the IB HCA counters by using the perfquery command on each node. If you see symbol errors (or other error indicators), then there may be a problem with the IB cable, the HCA (on either the source or target node) or the switch.

You may already have conducted a hardware survey, and we'll continue to review the logs, but since each server should be operating with the same performance it is important to be certain that the hardware is functioning correctly (DIMMs, CPUs, network, storage).

Comment by Chakravarthy Nagarajan (Inactive) [ 23/Jan/14 ]

Hi Malcolm,

Thanks for the response.

I already touched upon all the areas which you have mentioned except hardware. I've found that the read performance is lower in OSS1 comapre to OSS2. PFA the sgpdd-survey results and appreciate your inputs. There is not much of difference in ib_read_bw, ib_write_bw and also latency tests. Also the Lnet self-test doesn't show any difference from client to the individual OSS servers.

Meanwhile let me check the hardware and revert back.

Comment by Malcolm Cowe (Inactive) [ 23/Jan/14 ]

Based on your information, it would seem that the logical place to start looking is at the connection between OSS1 and the storage controller. Slot placement of the HBA can also be a factor – if the two servers are the same, then make sure that the PCIe cards are in the same slots on each machine.

Comment by Chakravarthy Nagarajan (Inactive) [ 03/Mar/14 ]

Sorry for the delayed response and looks to be a hardware issue. Please close this ticket.

Thanks for your help.

Comment by John Fuchs-Chesney (Inactive) [ 03/Mar/14 ]

Marked as resolved per feedback from Nagarajan/Customer.
Thanks,
~ jfc.

Generated at Sat Feb 10 01:43:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.