[LU-860] Lustre quota inconsistencies after multiple usages of LU-601 work-around Created: 16/Nov/11  Updated: 05/Jan/12  Resolved: 05/Jan/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Patrick Valentin (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

linux-2.6.32-71.24.1


Attachments: File Client_side     File MDT_side     File OST_side    
Severity: 3
Rank (Obsolete): 6521

 Description   

Hi,

Some users at CEA site complain about inconsistencies between "lfs quota -u" vs "du -s" report.

After long investigations, on site support finally found that the lost file system space is consumed by orphaned objids on OSTs, and is a consequence of LU-601 work-around.
When it was impossible to restart the MDS (systematically asserting in "tgt_recov"), the only solution was to mount the volume in ldiskfs mode and rename the PENDING subdirectory.
Now, there are several old "PENDING* directories", and a lot of orphaned objids belonging to FIDs in these directories.

In order to recover all this lost space, the support is asking if it is safe to run "lfsck", or if they have to build their own tool to offline parse all OSTs and remove all objids that belongs to FIDs in PENDING* directories ?

Perhaps the PENDING directory was sometimes removed instead of renamed. In this case, is the recovery identical, or is there something else to do?

TIA
Patrick

Below is the support report, and I have also attached the files containing the traces of the commands executed on Client, MDT and OST.

#context: 

Some times ago, a few users started to report Lustre quotas inconsistencies regarding to the "lfs quota -u"
report vs "du -s" over their full hierachy/sub-tree. "lfs quotacheck" did not fix inconsistencies.

#consequences: 
Quotas are unusable and inaccurate for these users and (a lot ??) filesystem space is consumed by orphaned objids on
OSTs.

#details:

1st check made was to identify that the inconsistencies are due to real (and orphaned) filesystem space/blocks consumption
and not only/just a bad Quota value !!...

2nd thing has been to identify that the orphaned objids belong to FIDs in the MDS multiple PENDING* directories that
have been moved as part of LU-601 work-around !!!

See [Client,MDT,OST]_side files showing the details.

So what can we do now to recover all the space/blocks used by the orphaned objids ??? Can we safelly run "lfsck"
or do we need to build our own tool to offline parse all OSTs and remove all objids that belongs to FIDs in PENDING*
directories ???



 Comments   
Comment by Johann Lombardi (Inactive) [ 16/Nov/11 ]

I would not recommend to run lfsck which is time consuming. What about moving the files of all the PENDING* directories back to the namespace and unlinking them again (with "unlink" command instead of rm to avoid the stat) from a lustre client?

Comment by Peter Jones [ 17/Nov/11 ]

Bobi

Could you please comment on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 26/Nov/11 ]

I think manually unlink the file is a doable makeshift. While we will try to fix LU-601 issue to make things right, lest this issue happens again.

Comment by Bruno Faccini (Inactive) [ 08/Dec/11 ]

Hello all,

Just a quick comment on the "moving the files of all the PENDING* directories back to the namespace and unlinking them again" proposal from Johann, having a look to the MDT inodes in PENDING* directories I have found they don't have any "lov" EA !!... So how will the Unlink process be able to find the associated+orphaned objids we want to remove on the OSTs ???

In the case it will fail, what do you think on mounting all OSTs and remove all objids referring to any PENDING*/<FID> in their "fid" EA ??

And last, ok about "lfsck" heavy time consuming, but you did not answer on my original question "Can we safelly run lfsck ??" and I will precise it with "Is lfsck THE tool pushed/supported by WhamCloud to repair Lustre inconsistencies or not ???".

Bruno.

Comment by Zhenyu Xu [ 08/Dec/11 ]

Yes, it is safe to run lfsck.

       lfsck is used to check and repair the distributed coherency of a Lustre filesystem.
OPTIONS
       -c     Create (empty) missing OST objects referenced by MDS inodes.

       -d     Delete orphaned objects from the filesystem.  Since objects on the OST are  often  only  one  of
              several  stripes of a file it can be difficult to put multiple objects back together into a sin-
              gle usable file.

       -h     Print a brief help message.

       -l     Put orphaned objects into a lost+found directory in the root of the filesystem.

       -n     Do not repair the filesystem, just perform a read-only check (default).

       -v     Verbose operation - more verbosity by specifing option multiple times.

BTW, it would be tedious work if you want mount all OSTs and finding all objects which have intended fid, while the fid info lies in the "fid" EA of OST objects.

Comment by Andreas Dilger [ 09/Dec/11 ]

Note that it is possible to generate the lfsck mdsdb and ostdb while the filesystem is mounted, by running:

e2fsck -fn --mdsdb

{mdsdb_file} /dev/{mdsdev}

Note "-n" option here. The database file may be slightly inconsistent (e.g. contain files that were deleted during the run) but the code should handle this internally.

When the databases are created, please run:

lfsck -nv --mdsdb {mdsdb_file}

--ostdb

{ostdb_file ...}

{lustre mountpoint}

(note again -n here) to ensure that this is doing what you expect it to (e.g. the number of orphaned objects is reasonable compared to the amount of missing space). The -n option will prevent lfsck from actually making any changes to the filesystem. I would recommend to check several of the OST objects that lfsck thinks should be deleted to get their MDS FID (this should be possible with 'debugfs -c -R "stat <O/0/d$((objid % 32))/$objid" /dev/

{ostdev}

' which should print out the filter_fid xattr with the parent MDS inode number, and then run 'debugfs -c -R "ncheck $ino1 $ino2 $ino3 ..." /dev/

{mdsdev}

' to check whether any MDS inodes reference those objects.

Comment by Bruno Faccini (Inactive) [ 13/Dec/11 ]

Yeah !! The "back to namespace"+unlink method has been applied to all "PENDING*/*" files/FIDs and 40TB have been "magically" recovered/freed with associated Quotas corrected !!!

BTW, I am still puzzled on how the FID<->ObjID[s] relation has been reconstructed ??

Comment by Johann Lombardi (Inactive) [ 13/Dec/11 ]

That's the magic of Christmas
More seriously, are you sure that the files under PENDING*/* had no LOV EA? Do you still have the output of debugfs against one of those file?

Comment by Bruno Faccini (Inactive) [ 13/Dec/11 ]

I will try to find it in my logs and attach it ...

In fact it takes more time/work finally to gat to Christmas !!...

I need to tell the whole story now ... :

_ instead of moving the "PENDING.old*" content back in the namespace, I moved the directories, uninked all their content/files, and finally "rmdir" all dirs !!..

_ it took quite some time to understand the following "live" ... This leaded to a situation where the original/1st PENDING directory was no longer present to satisfy to the multiple conditions/controls/attributes/... (link EA, inode/generation number in OI database, ...) occuring during FS/MDT start/mount !!!!

But now FS is started finally and seems that we begin to ear some Christmas songs ...

Comment by Peter Jones [ 05/Jan/12 ]

Bull confirm that this issue is now resolved

Generated at Sat Feb 10 01:11:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.