[LU-11291] recovering from LU-10437 Created: 28/Aug/18  Updated: 05/Oct/22  Resolved: 05/Oct/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Jian Yu
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9463 lcme_flags should be printed in comma... Resolved
is related to LU-10437 sanity-pfl test_8: dbench failed Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

We has some files that were created with 2.11 client and 2.10.3 server that hit LU-10437 bug. We have since update our server to 2.10.5. But the old files still can't bee seen by the user. We get this.

on 2.11 clients + updated 2.10.5 server

$ ls -l
ls: cannot access 'test': Invalid argument
total 1040384
-????????? ? ? ? ? ? test

on 2.10.3 clients + updated 2.10.5 server

ls -l test
-rw------- yyyy xxx 8388608 Aug 28 15:44 test

Only root can view them correctly.
Can we recover those files with-out copying them.



 Comments   
Comment by Mahmoud Hanafi [ 29/Aug/18 ]

here is what we get in the logs.


[588862.457231] LustreError: 96988:0:(lcommon_cl.c:187:cl_file_inode_init()) Failure to initialize cl object [0x200000bd6:0xc7d1:0x0]: -22
[588862.457245] LustreError: 96988:0:(llite_lib.c:2357:ll_prep_inode()) new_inode -fatal: rc -22
Comment by Andreas Dilger [ 29/Aug/18 ]

If the problem relates to FLR functionality added in 2.11 as indicated in LU-10437, it is possible that running a layout LFSCK on the MDT would detect and correct this problem. However, the FLR support for LFSCK was only recently landed (commit v2_11_53_0-33-g36ba989 patch https://review.whamcloud.com/32705 "LU-10288 lfsck: layout LFSCK for mirrored file") so that functionality is not available in the MDS version you are using, nor in any released version to date.

My recommendation would be to find the inaccessible files with a 2.11 client, and then use "lfs migrate" on a 2.10 client to fix the layout. Depending on what "lfs getstripe -v" reports for such files (e.g. strange lcme_flags) it may be possible to use something like "lfs find /mnt/lustre --comp-count +1 --comp-flags=stale,prefer,offline" to find these files on a 2.10 client directly. Depending on how many files the lfs find operation locates, it may well be faster to migrate them to clear the flags rather than waiting for a code fix to be developed, tested, and be installed on your system.

Jian,
for future usage, it would be desirable for "lfs getstripe" to also print out unknown flags in hex form after it has printed all of the known flags, like "init,prefer,0x18c40" so that we have some forward compatibility when new flags are added. Similarly, "lfs find" should be able to search for flags by hex value in addition to named flags for the same reason. That would allow something like "lfs find --comp-flags 0x7fffffe0 ..." to locate any files with flags that we don't currently have assigned. It might be desirable to allow a modified master "lfs --component-set" to clear some the offending flags directly from the client without doing the migration, but that is not possible for all of the flags (e.g. stale at least). We might consider to allow clearing the stale flag from a component if all of the init'd components in the file are stale.

Comment by Jian Yu [ 30/Aug/18 ]

Sure, Andreas, I'll work on these improvements.

Generated at Sat Feb 10 02:42:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.