LFSCK 4: improve LFSCK performance (LU-6361)

[LU-6177] LFSCK 4: namespace LFSCK scalability Created: 29/Jan/15  Updated: 01/May/15  Resolved: 01/May/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Technical task Priority: Major
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Rank (Obsolete): 17278

 Description   

Currently, for namespace LFSCK routine check without inconsistency repaired, the best bundle performance is under 4-MDTs configuration. As more MDTs join, the performance decreased. It is totally out of our expectation, should be resolved.



 Comments   
Comment by Andreas Dilger [ 29/Jan/15 ]

I don't think it is only a matter of performance going down after 4 MDTs. The biggest issue is that aggregate performance isn't scaling at all when new MDTs are added. With only a small percentage of cross-MDT and hard-linked objects, most of the MDT namespace scanning should be local to the MDT and the aggregate scanning performance should scale almost linearly with the addition of each MDT.

Since the performance was flat for 2-6 MDTs then either:

  • the performance results are actually per-MDT and not aggregate
  • there is some kind of bottleneck or too much communication between MDTs that is preventing scaling.
Comment by nasf (Inactive) [ 11/Feb/15 ]

The main reason for the bad aggregated namespace LFSCK performance is that the performance calculating method is not suitable. After studying the test data, I found that it was always the MDT0 scanned more objects than the other MDTs. That caused the other MDTs had to wait the MDT0 to finish its first-stage scanning, then their performance became very slow because of the long time waiting for the MDT0.

In fact, for each MDT, the real performance should be calculated as: the scanned objects is divided by the scanned time, not including the waiting time after the first-stage scanning. With such new calculating method, the real performance for each MDT is approximately equal. I will make patch for that and re-test the performance.

Comment by Andreas Dilger [ 12/Feb/15 ]

Shouldn't the number of files per MDT be about the same? Should the test config create balanced file creation? I thought the top-level directories are spread across all MDTs and then all the files are created in those directories?

Comment by nasf (Inactive) [ 12/Feb/15 ]

It should be, but unfortunately, because of the test script issue, the master MDT-object of striped directory is always created on MDT0, as to the objects count on the MDTs are not balance unexpectedly.

On the other hand, we should not assume that every MDT has the same processing capability. We still need to adjust the performance calculating method.

Comment by Alex Zhuravlev [ 12/Feb/15 ]

even so, that should give us performance multiplied by (#MDTs-1), it shouldn't stop to scale?

Comment by nasf (Inactive) [ 12/Feb/15 ]

As the MDTs increased, the waiting time (as described above) increased also, so the aggregated performance does not scale as expected.

Comment by Andreas Dilger [ 12/Feb/15 ]

It should be, but unfortunately, because of the test script issue, the master MDT-object of striped directory is always created on MDT0, as to the objects count on the MDTs are not balance unexpectedly.

Is that because all of the striped directories are created at the top level directory (on MDT0)? Otherwise, I would think that the master MDT object should be on the same MDT as the parent directory. If not, I think that is a bug in the DNE code.

Secondly, even if the master MDT object of each striped directory is on MDT0, this should only be a few thousand more objects, but the actual files created inside the striped directories should be balanced evenly across all MDTs, or again this would be a bug in the DNE code.

Comment by nasf (Inactive) [ 17/Feb/15 ]

The striped directories were created under each sub-directory. The master MDT-object of the striped directory should reside on the same MDT as its parent directory, but because of test scripts issue, it was created on the MDT0 always. On the other hand, the test scripts did not handle the remote sub-directory properly, and caused the remote sub-directory were also unbalanced among the MDTs. I have fixed the test scripts and made them to be balanced.

Comment by Gerrit Updater [ 09/Mar/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/14014
Subject: LU-6177 lfsck: calculate the phase2 time correctly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: edf9f948ad9f5c86ddf1a891dae8ce0cdde07593

Comment by nasf (Inactive) [ 09/Mar/15 ]

Above patch fixed an serious issue that will cause the phase2 time is longer than the real used time by the second-stage scanning.

Comment by Gerrit Updater [ 01/May/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14014/
Subject: LU-6177 lfsck: calculate the phase2 time correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0f4875343e22bcdfe18708806e172aa234da23a6

Comment by nasf (Inactive) [ 01/May/15 ]

Related patches have been landed to master

Generated at Sat Feb 10 01:57:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.