[LU-9670] Advise on e2fsck fixing for OST backend Created: 15/Jun/17  Updated: 25/Aug/17  Resolved: 23/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 2.8 servers using ldiskfs in a RHEL6.9 environment.


Attachments: File atlas-oss_e2fsck_output.tgz     File bad_luns.out    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Recently our main production file system experienced a outage which has
required a run of e2fsck on the back end OSTs. During the run we encountered
some issues which might lead to data lost. We would like to ask Intel engineers
who have a better understanding of the ext4 filesystem to look at the logs
and report back what will be lost and how safe is it to repair. Their are
two logs attached to this ticket. One is for raw data and the other, bad_luns.out,
is the one we are most concern about.



 Comments   
Comment by Peter Jones [ 15/Jun/17 ]

Fan Yong

Could you please advise on this one?

Thanks

Peter

Comment by Andreas Dilger [ 16/Jun/17 ]

James, there isn't necessarily anything in the e2fsck that looks like it is very unusual, if one assumes that the RAID controllers lost their cache. There may be more or less things to be fixed by e2fsck once the journal is replayed (assuming these e2fsck runs were done on a live filesystem).

Comment by nasf (Inactive) [ 17/Jun/17 ]

I have checked the hundreds of logs, although there are a lot of inconsistency, most of them are Quota accounting inconsistency, that are not fatal and will not cause data lose. In further, you can recheck the quota by force via reset enable quota after the Lustre mounted up:

lctl conf_param $FSNAME.quota.{mdt,ost}=none
lctl conf_param $FSNAME.quota.{mdt,ost}=ug

If you are afraid of losing any data, you can make some check in advance before e2fsck repairing. For example:

1. Deleted inode XXX has zero dtime.
That means the inode XXX has zero nlink count, but its dtime is not set. You can get the FID-in-LMA via:

debugfs -c -R "stat <XXX>" $OST_dev

With the FID, we can calculate its original namepath on the OST. It is expected that the original namepath should has been removed (because the target OST object was destroyed).

debugfs -c -R "stat $namepath" $OST_dev

But if unfortunately the original namepath is still there, then we can use "debugfs sif <XXX> inodes_count 1", then e2fsck will either link it back to the original OST namespace. In fact, it is NOT recommended to fix the inconsistency like that manually, because if the original inode was really destroyed and its blocks have been reassigned to other inodes, then recovering it by force may cause double reference blocks and then cause data crash or lose. So it only can be used for some rare unfortunate corners.

2. Inode bitmap differences: -NNN
That means inode is not used, but marked in inode bitmap. If the indoe NNN was in using before the crash, then it is lost.

3. Block bitmap differences: (NNN-MMM)
That means blocks are not used, but marked in block bitmap. Since no inode reference the blocks, just let e2fsck to fix the bitmap.

4. Inode XXX, i_blocks is NNN, should be MMM.
That means the i_blocks does not match the real blocks usage. Under such case, we have to trust the real block usage even if the i_blocks was right. Just let e2fsck to fix the i_blocks.

5. Inode XXX, i_size is NNN, should be MMM.
That the means the i_size is smaller than the real blocks usage. Trust the real blocks usage and let e2fsck to fix the I_size. If the inconsistency was caused by crashed truncated (shrink) operation, then some stale data may be recovered at the tail.

...

Anyway, as Andreas commented, most of the inconsistency may be disappear after journal replayed.

Comment by nasf (Inactive) [ 07/Aug/17 ]

Anything can I do for this ticket?

Comment by Brad Hoagland (Inactive) [ 18/Aug/17 ]

Hi simmonsja,

Is there anything more you'd like us to do for this ticket?

Regards,

Brad

Comment by Jesse Hanley [ 23/Aug/17 ]

Brad,

We can close this out. Thanks for the help.

Comment by Brad Hoagland (Inactive) [ 23/Aug/17 ]

Thanks, Jesse

Generated at Sat Feb 10 02:28:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.