[LU-6177] LFSCK 4: namespace LFSCK scalability - Whamcloud Community JIRA

Details

Type: Technical task
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.7.0
Labels:
None

Rank (Obsolete):
17278

Description

Currently, for namespace LFSCK routine check without inconsistency repaired, the best bundle performance is under 4-MDTs configuration. As more MDTs join, the performance decreased. It is totally out of our expectation, should be resolved.

Attachments

Activity

[LU-6177] LFSCK 4: namespace LFSCK scalability

nasf (Inactive) added a comment - 12/Feb/15 1:57 PM

As the MDTs increased, the waiting time (as described above) increased also, so the aggregated performance does not scale as expected.

nasf (Inactive) added a comment - 12/Feb/15 1:57 PM As the MDTs increased, the waiting time (as described above) increased also, so the aggregated performance does not scale as expected.

Alex Zhuravlev added a comment - 12/Feb/15 1:36 PM

even so, that should give us performance multiplied by (#MDTs-1), it shouldn't stop to scale?

Alex Zhuravlev added a comment - 12/Feb/15 1:36 PM even so, that should give us performance multiplied by (#MDTs-1), it shouldn't stop to scale?

nasf (Inactive) added a comment - 12/Feb/15 1:32 PM - edited

It should be, but unfortunately, because of the test script issue, the master MDT-object of striped directory is always created on MDT0, as to the objects count on the MDTs are not balance unexpectedly.

On the other hand, we should not assume that every MDT has the same processing capability. We still need to adjust the performance calculating method.

nasf (Inactive) added a comment - 12/Feb/15 1:32 PM - edited It should be, but unfortunately, because of the test script issue, the master MDT-object of striped directory is always created on MDT0, as to the objects count on the MDTs are not balance unexpectedly. On the other hand, we should not assume that every MDT has the same processing capability. We still need to adjust the performance calculating method.

Andreas Dilger added a comment - 12/Feb/15 2:13 AM

Shouldn't the number of files per MDT be about the same? Should the test config create balanced file creation? I thought the top-level directories are spread across all MDTs and then all the files are created in those directories?

Andreas Dilger added a comment - 12/Feb/15 2:13 AM Shouldn't the number of files per MDT be about the same? Should the test config create balanced file creation? I thought the top-level directories are spread across all MDTs and then all the files are created in those directories?

nasf (Inactive) added a comment - 11/Feb/15 2:55 AM

The main reason for the bad aggregated namespace LFSCK performance is that the performance calculating method is not suitable. After studying the test data, I found that it was always the MDT0 scanned more objects than the other MDTs. That caused the other MDTs had to wait the MDT0 to finish its first-stage scanning, then their performance became very slow because of the long time waiting for the MDT0.

In fact, for each MDT, the real performance should be calculated as: the scanned objects is divided by the scanned time, not including the waiting time after the first-stage scanning. With such new calculating method, the real performance for each MDT is approximately equal. I will make patch for that and re-test the performance.

nasf (Inactive) added a comment - 11/Feb/15 2:55 AM The main reason for the bad aggregated namespace LFSCK performance is that the performance calculating method is not suitable. After studying the test data, I found that it was always the MDT0 scanned more objects than the other MDTs. That caused the other MDTs had to wait the MDT0 to finish its first-stage scanning, then their performance became very slow because of the long time waiting for the MDT0. In fact, for each MDT, the real performance should be calculated as: the scanned objects is divided by the scanned time, not including the waiting time after the first-stage scanning. With such new calculating method, the real performance for each MDT is approximately equal. I will make patch for that and re-test the performance.

Andreas Dilger added a comment - 29/Jan/15 2:27 AM

I don't think it is only a matter of performance going down after 4 MDTs. The biggest issue is that aggregate performance isn't scaling at all when new MDTs are added. With only a small percentage of cross-MDT and hard-linked objects, most of the MDT namespace scanning should be local to the MDT and the aggregate scanning performance should scale almost linearly with the addition of each MDT.

Since the performance was flat for 2-6 MDTs then either:

the performance results are actually per-MDT and not aggregate
there is some kind of bottleneck or too much communication between MDTs that is preventing scaling.

Andreas Dilger added a comment - 29/Jan/15 2:27 AM I don't think it is only a matter of performance going down after 4 MDTs. The biggest issue is that aggregate performance isn't scaling at all when new MDTs are added. With only a small percentage of cross-MDT and hard-linked objects, most of the MDT namespace scanning should be local to the MDT and the aggregate scanning performance should scale almost linearly with the addition of each MDT. Since the performance was flat for 2-6 MDTs then either: the performance results are actually per-MDT and not aggregate there is some kind of bottleneck or too much communication between MDTs that is preventing scaling.

People

Assignee:: nasf (Inactive)

Reporter:: nasf (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Jan/15 1:39 AM

Updated:: 01/May/15 3:57 AM

Resolved:: 01/May/15 3:57 AM