[LU-4518] OSS1 provides less performance than OSS2 Created: 21/Jan/14 Updated: 03/Mar/14 Resolved: 03/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chakravarthy Nagarajan (Inactive) | Assignee: | Malcolm Cowe (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS-6.4 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12360 |
| Description |
|
Hi, We've 24 OSTs load balanced between OSS1 and OSS2 Servers. We are getting a degrade in performance while 24 OSTs load balanced between OSS1/OSS2 and also all OSTs handled by OSS1 alone. But we could see 3X performance while OSS2 handles all of the OSTs. I've attached the logs from both the nodes for your reference. Appreciate your help. |
| Comments |
| Comment by John Fuchs-Chesney (Inactive) [ 23/Jan/14 ] |
|
Malcolm, Thanks. |
| Comment by Malcolm Cowe (Inactive) [ 23/Jan/14 ] |
|
Hi, Are you able to eliminate hardware as a source of the problem? A difference in performance between OSS1 and OSS2 often indicates a difference in the hardware or software. If the two servers are identically similar in configuration, then there may be a fault in the OSS1 server or its network or storage connections. Some low level testing of the storage access and the network interfaces may help to highlight any differences in performance. Benchmarking the low level disk performance from each OSS can be done with applications such as VDBench or Lustre's spgdd_survey tool. These are destructive tests and will destroy any existing content on the target storage but are effective tools to measure baseline performance. To check the network connection, one can use the ib_read_bw command line utility for InfiniBand networks as a very simple and quick test. I would suggest running this between OSS1 and the MDS, as well as between OSS2 and the MDS for comparison. There are more complex tests, such as lnet_selftest, but the ib_read_bw and ib_write_bw tools give a good indication of basic IB point to point connection throughput. One can check for errors in the IB HCA counters by using the perfquery command on each node. If you see symbol errors (or other error indicators), then there may be a problem with the IB cable, the HCA (on either the source or target node) or the switch. You may already have conducted a hardware survey, and we'll continue to review the logs, but since each server should be operating with the same performance it is important to be certain that the hardware is functioning correctly (DIMMs, CPUs, network, storage). |
| Comment by Chakravarthy Nagarajan (Inactive) [ 23/Jan/14 ] |
|
Hi Malcolm, Thanks for the response. I already touched upon all the areas which you have mentioned except hardware. I've found that the read performance is lower in OSS1 comapre to OSS2. PFA the sgpdd-survey results and appreciate your inputs. There is not much of difference in ib_read_bw, ib_write_bw and also latency tests. Also the Lnet self-test doesn't show any difference from client to the individual OSS servers. Meanwhile let me check the hardware and revert back. |
| Comment by Malcolm Cowe (Inactive) [ 23/Jan/14 ] |
|
Based on your information, it would seem that the logical place to start looking is at the connection between OSS1 and the storage controller. Slot placement of the HBA can also be a factor – if the two servers are the same, then make sure that the PCIe cards are in the same slots on each machine. |
| Comment by Chakravarthy Nagarajan (Inactive) [ 03/Mar/14 ] |
|
Sorry for the delayed response and looks to be a hardware issue. Please close this ticket. Thanks for your help. |
| Comment by John Fuchs-Chesney (Inactive) [ 03/Mar/14 ] |
|
Marked as resolved per feedback from Nagarajan/Customer. |