[LUDOC-244] LFSCK Adjustment Interface documentation is inconsistent Created: 12/Jun/14  Updated: 11/Nov/14  Resolved: 11/Nov/14

Status: Resolved
Project: Lustre Documentation
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ryan Haasken Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LUDOC-254 Complete Lustre Manual updates for LF... Closed
Severity: 3
Rank (Obsolete): 14395

 Description   

The LFSCK Adjustment Interface documentation in the Lustre manual (section 28.4.3) has a couple of inconsistencies.

In section 28.4.3.1, the lfsck_speed_limit example shows it being set to 'N', while the table of possible values shows that it should be set to 0 or a positive integer.

The same thing goes for section 28.4.3.2, in which the auto_scrub example shows it being set to 'N', while the table shows possible values of 0 or a positive integer.

I think the integer guidance is correct, but I'll need to verify. Then the examples should be updated to use nonnegative integers instead of 'N'.

Additionally, the two options in the table for auto_scrub also seem to be saying the same thing to me:

"Do not start OI Scrub automatically."

VS.

"Manually start OI Scrub if needed."

Not starting something automatically implies it will need to be started manually, so these are saying the same thing.

Finally, it wouldn't hurt to change the "Synopsis" headings to "Example usage" or something and to update "Mount Options" to "Auto Scrub".



 Comments   
Comment by Ryan Haasken [ 13/Jun/14 ]

Oh, I see. In the synopsis of each command, 'N' is just an arbitrary non-negative integer. I guess I interpreted it as being shorthand for "No" in the auto_scrub case. Perhaps putting that N in a <replaceable> element will make it more clear that is just a placeholder for a non-negative integer.

However, I think the description of LFSCK behavior when auto_scrub is set to a positive integer should still be updated. When set to a positive integer, this sets dev->no_scrub=0, which allows the OI scrub to be triggered automatically by RPC in osd_fid_lookup(). I would suggest something like "Automatically start OI Scrub if inconsistency detected in OI lookup."

Comment by Ryan Haasken [ 13/Jun/14 ]

I'm not entirely clear on "Section 28.4.3.2. Mount options". It seems that the auto_scrub parameter/noscrub mount option will activate an OI scrub in two different situations.

The first situation would be when an MDT file-level restore is detected. The second situation would be when inconsistency is detected during OI lookup. I took these situations from the OI Scrub solution architecture document.

Both the auto_scrub parameter and the noscrub mount option use the same underlying osd_device->no_scrub data, and this data controls whether OI scrub is triggered in either of these situations. Am I understanding this correctly?

If I am understanding this correctly, I think this section should be renamed "Auto scrub", and it should describe the set_param way of setting auto_scrub as well as the mount option way of setting no_scrub. It should describe the two situations in which an OI scrub will be triggered automatically. Can anybody with some LFSCK expertise weigh in on this?

Comment by Ryan Haasken [ 13/Jun/14 ]

Should the noscrub mount option also be described in the section on mount.lustre options (Section 37.15.3)?

I think the manual should also give an example of setting the noscrub mount option in section 28.4.3.2.

Comment by nasf (Inactive) [ 14/Jun/14 ]

As you said, there are two switches for the administrator to control whether trigger OI scrub automatically when some inconsistent OI mapping or file-level backup/restore is detected:

1) The lproc parameter "auto_scrub". It is a dynamica interface. The administrator can change it anytime after the system online. The default value is non-zero, means enable the auto trigger mechanism. If the administrator changes it as zero via "lctl conf_param" (or "lctl set_param -P"), then such parameter changing will be remembered, and will take effect all the time in spite of mount/umount, until the administrator change it again:

2) The server mount options "-o noscrub". It is used to control the OI scrub at the beginning. To guarantee that the OI scrub will NOT by auto triggered just during the mount, the administrator can specify the mount options "-o noscrub", then "auto_scrub" parameter will be set as zero by force. It is better to explain related things in the mount.lustre section in the manual.

Comment by Ryan Haasken [ 16/Jun/14 ]

Thank you for the clarification. I have another question for you. If the "noscrub" mount option is set, then will "auto_scrub" remain set to zero while the file system is mounted? Meaning, will automatic scrub be disabled when an inconsistent OI mapping is detected in addition to being disabled at mount time?

Comment by nasf (Inactive) [ 17/Jun/14 ]

Currently, if "noscrub" mount option is specified, then the "auto_scrub" parameter will be set as zero by force when the OSD processing the mount. So before you change the "auto_scrub" next time via lproc interface, the "auto_scrub" will keep zero and the OI scrub will not be auto triggered even if detect some OI inconsistency. But if you want, you can change "auto_scrub" as non-zero after the mount, then the OI scrub auto triggering mechanism will be enabled again.

Comment by nasf (Inactive) [ 10/Sep/14 ]

Ryan, is there anything else to be updated for the Lustre document about LFSCK/OI_scrub? Do you want me to make a document patch for that or you will do that by yourself? Thanks!

Comment by Ryan Haasken [ 11/Sep/14 ]

Yes, thanks for answering my questions. I'll submit a documentation patch shortly, and I would appreciate it if you could review.

Is the noscrub mount option a Lustre server-specific mount.lustre option that should be documented in section 37.15.3?

Comment by Ryan Haasken [ 11/Sep/14 ]

Here is a patch. Can you please review?

http://review.whamcloud.com/#/c/11886/

One other thing that ought to be explained in the Lustre manual is the meaning of phases 1 and 2 within both the namespace LFSCK and the layout LFSCK. Can you please explain? It's not very clear to me from reading the LFSCK design docs either.

Comment by nasf (Inactive) [ 12/Oct/14 ]

Currently, LFSCK use two phases scanning to guarantee all the inconsistency can be handled completely and efficiently.

1) The first-stage scanning
There is a LFSCK main engine on every MDT/OST that involves the LFSCK. During the first-stage scanning, each LFSCK main engine scans its local device via low layer object-table based iteration that uses linear scanning method and guarantees that all the objects related with this server (MDT or OST) will be checked. But sometimes, the LFSCK cannot know whether the object is inconsistency or cannot know how to repair the inconsistency until the first-stage scanning finished. Then the LFSCK needs the second-stage scanning.

2) The second-stage scanning
During the first stage-scaninng, some uncertain objects will be recorded, depends on the LFSCK type.

2.1) For namespace LFSCK, the object will multiple hard-links, or with multiple linkEA entries, or with remote parent, and so on, will be recorded in the namespace LFSCK tracing file. And then, in the second-stage scanning, the namespace LFSCK will scan the objects in the namespace LFSCK tracing file in turn and handle the uncertain inconsistency.

2.2) For layout LFSCK, the OST-objects that are not referenced by any MDT-object are recorded in a bitmap. When the LFSCK moves to the second-stage scanning, the OST-objects in such bitmap will be re-scanned to check whether they are really orphans or not.

I am not sure whether it is your want or not. Please let me know if you still be confused by anything else.

Comment by Ryan Haasken [ 14/Oct/14 ]

Thanks, that is the explanation I was looking for. I think the explanation of the two phases should be added to the Lustre manual so that the user can understand the meaning of the "[Checked|Updated|Failed] Phase[1|2]" output from the LFSCK status interface. I think it would be appropriate to describe these phases in the "Description" sections of the "LFSCK status of namespace via procfs" and "LFSCK status of layout via procfs" sections.

Comment by Ryan Haasken [ 11/Nov/14 ]

http://review.whamcloud.com/#/c/11886/ has landed, so this ticket can be resolved. I've opened LUDOC-261 to track the separate issue of the missing documentation on the phases of the LFSCK namespace and layout checks.

Generated at Sat Feb 10 03:41:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.