[LU-8742] lfsck > 1000 seconds Created: 20/Oct/16  Updated: 05/Mar/17  Resolved: 05/Mar/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Soak cluster, tip of master with one patch


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We did a clean reboot of the entire cluster and restarted, after the first MDS failover we had a very long lfsck session > 1000 seconds

2016-10-19 12:16:18,529:fsmgmt.fsmgmt:INFO     lfsck still in progress for soaked-MDT0000 after 945s
2016-10-19 12:27:01,370:fsmgmt.fsmgmt:ERROR    lfsck found errors lola-8/soaked-MDT0000: lf_repaired: 0
2016-10-19 12:27:01,370:fsmgmt.fsmgmt:ERROR    lfsck found errors lola-8/soaked-MDT0000: repaired_dangling: 0
2016-10-19 12:27:01,370:fsmgmt.fsmgmt:ERROR    lfsck found errors lola-8/soaked-MDT0000: repaired_unmatched_pair: 0
2016-10-19 12:27:01,370:fsmgmt.fsmgmt:ERROR    lfsck found errors lola-8/soaked-MDT0000: repaired_multiple_referenced: 0
2016-10-19 12:27:01,371:fsmgmt.fsmgmt:ERROR    lfsck found errors lola-8/soaked-MDT0000: repaired_orphan: 0
2016-10-19 12:27:01,371:fsmgmt.fsmgmt:ERROR    lfsck found errors lola-8/soaked-MDT0000: repaired_inconsistent_owner: 164693
2016-10-19 12:27:01,371:fsmgmt.fsmgmt:ERROR    lfsck found errors lola-8/soaked-MDT0000: repaired_others: 0

Does the value for 'repaired_inconsistent_owner:' indicate the number of errors? Seems rather a lot, if so.



 Comments   
Comment by Joseph Gmitter (Inactive) [ 20/Oct/16 ]

Hi Fan Yong,

Could you please look into this issue?

thanks.
Joe

Comment by nasf (Inactive) [ 25/Oct/16 ]

The "repaired_inconsistent_owner" means the layout LFSCK found and repaired inconsistency owner information between the MDT-object and the OST-object. Its count means there are 164693 inconsistent pairs, that is usually greater than the bad files count. Generally, such inconsistent owners are related with chown/chgrp operations. Because the chown/chgrp operation will update the MDT side owner with sync mode, then generate some OST side setattr requests that will be handled in async mode. When handling these async OST side setattr requests depends on the system schedule. Anyway, there is interval between the MDT side owner change and the OST side owner change. The layout LFSCK can detect inconsistency between such interval.

As for why the LFSCK took more that 1000 seconds, that depends on the objects count in the system. How do you trigger the LFSCK? It is triggered automatically or by some command manually? Would you please to show me the output of "lctl get_param -n mdd.${MDT_DEV}.lfsck_namespace" and "lctl get_param -n mdd.${MDT_DEV}.lfsck_layout" on the MDTs? Thanks!

Comment by nasf (Inactive) [ 01/Nov/16 ]

Cliff,

Any feedback/logs for this ticket? Thanks!

Generated at Sat Feb 10 02:20:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.