[LU-9836] Issues with 2.10 upgrade and files missing LMAC_FID_ON_OST flag Created: 04/Aug/17 Updated: 31/Oct/18 Resolved: 25/Jan/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Julien Wallior | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
3.10.0-514.21.1.el7_lustre.x86_64 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Last weekend, we've upgraded our lustre from 2.7 to 2.10. After the upgrade, we were missing about 36M objects. After a bunch of troubleshooting, we ended up running e2fsck (which recovered the objects to lost+found) and ll_recover_lost_found_objs (which moved them back to the proper place in the ldiskfs filesystem). It's worth noting that lfsck couldn't recover the objects from lost+found (because of some kind of incompatibility between the objects EA and lfsck, details following). Couple of remarks: Overall, we just wanted to report this on the mailing list in case someone else runs into this issue and see if we should open bugs about 1.a. and 1.b. And also, we were curious whether anybody had any explanation on how we got there and whether 2. could explain it. This is pretty dense, but overall reports 3 issues:
|
| Comments |
| Comment by Peter Jones [ 04/Aug/17 ] |
|
Fan Yong Could you please advise on this issue? Julien Can you please confirm the original configuration before the upgrade? This was RHEL 6.x and vanilla community 2.7 (i.e no patches)? Which version of e2fsprogs are you using? Thanks Peter |
| Comment by Julien Wallior [ 04/Aug/17 ] |
|
Before the upgrade we were on lustre 2.7 with kernel 2.6.32_504.8.1.el6_lustre.x86_64. Vanilla community 2.7.0 + RHEL6. Currently we have e2fsprogs-1.42.13.wc5-7.el7.x86_64. |
| Comment by nasf (Inactive) [ 08/Aug/17 ] |
If the OST-object is created under lustre-2.7, it should have the compat flag LMAC_FID_ON_OST, at least it is true in my local test. On the other hand, since some orphans (255) can be recovered from lost+found, then those OST-objects have LMAC_FID_ON_OST flag. According to your description, it seems that the left non-recovered OST-objects have no LMAC_FID_ON_OST flag, right? If yes, then it is difficult to explain why some OST-objects have LMAC_FID_ON_OST flag, but others not, although all of them are created under Lustre-2.7. Julien, Would you please to show me two OST-objects via debugfs -c -R "stat", one is recovered by the initial OI scrub, the other is NOT. Thanks! |
| Comment by Julien Wallior [ 08/Aug/17 ] |
|
I think we are mixing 2 issues. On one filesystem (prod), ldiskfs was corrupted somehow and we had to run e2fsck to recover lost inodes in lost+found. None of the inode in lost+found had LMAC_FID_ON_OST and none were recovered by initial_OI_scrub. They look like this: Creation time is July 2017 and we were definitely running 2.7 at that time. The second issue was happening on an other filesystem (lab). As we figured out what had happened in prod, we tried to repro things in the lab to understand them better. One of the experiment we did was: mount the ost ldiskfs and move ~500 objects from O/<group>/d<mod>/<obj> to lost+found. At that point, we didn't know about the LMAC_FID_ON_OST flag and in retrospect, all the object had it. When starting the lustre filesystem, initial_OI_scrub claimed to have recovered 255 objects and we could confirm there were only ~250 objects left in lost+found. We tried umount/mount and it would not recover more objects. Let me know if that helps. |
| Comment by nasf (Inactive) [ 08/Aug/17 ] |
|
For the test on lab system, you used Lustre-2.7 or Lustre-2.10 to when mount the OST? (then found ~250 object unrecovered) |
| Comment by Julien Wallior [ 08/Aug/17 ] |
|
The lab system was running 2.10 (but I can't say whether the files were created with 2.7 or 2.10). |
| Comment by nasf (Inactive) [ 08/Aug/17 ] |
|
What is the output on lab OST0004? debugfs "stat /O/0/d0/18848" |
| Comment by Julien Wallior [ 08/Aug/17 ] |
|
I couldn't find that object on OST0004, but I found it on OST0006 if that helps. 9:28 wallior@lstosstestbal801 /proc/fs/lustre% cat osd-ldiskfs/dlustre-OST0004/mntdev /dev/mapper/801a 9:29 wallior@lstosstestbal801 /proc/fs/lustre% sudo debugfs -c /dev/mapper/801a debugfs 1.42.13.wc5 (15-Apr-2016) /dev/mapper/801a: catastrophic mode - not reading inode or group bitmaps debugfs: stat /O/0/d0/18848 /O/0/d0/18848: File not found by ext2_lookup 9:30 wallior@lstosstestbal801 /proc/fs/lustre% sudo debugfs -c /dev/mapper/801c debugfs 1.42.13.wc5 (15-Apr-2016) /dev/mapper/801c: catastrophic mode - not reading inode or group bitmaps debugfs: stat /O/0/d0/18848 Inode: 19010 Type: regular Mode: 0666 Flags: 0x80000 Generation: 2666600371 Version: 0x00000004:000074ba User: 3162 Group: 200 Size: 4194304 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 8192 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x597f2343:00000000 -- Mon Jul 31 08:32:03 2017 atime: 0x00000000:00000000 -- Wed Dec 31 19:00:00 1969 mtime: 0x597f2343:00000000 -- Mon Jul 31 08:32:03 2017 crtime: 0x597f2340:d8d5faec -- Mon Jul 31 08:32:00 2017 Size of extra inode fields: 32 Extended attributes stored in inode body: lma = "18 00 00 00 00 00 00 00 00 00 06 00 01 00 00 00 a0 49 00 00 00 00 00 00 a6 13 00 40 04 00 00 00 fc e7 01 00 00 00 01 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (64) lma: fid=[0x100060000:0x49a0:0x0] compat=18 incompat=0 EXTENTS: (0-1023):28288000-28289023 |
| Comment by nasf (Inactive) [ 08/Aug/17 ] |
|
non-exist is the expected result, that means the OI slot on the OST0004 for the object (with #18991) is not reused by others. |
| Comment by Julien Wallior [ 08/Aug/17 ] |
|
oh i see. sure, because we had moved that object by hand in an attempt to recreate the state prod was in after running e2fsck. so initial_OI_scrub() should be moving that file back in place, no? |
| Comment by nasf (Inactive) [ 08/Aug/17 ] |
|
In theory, it should move the entry from /lost+found back to its original OI slot. I am studying related logic. |
| Comment by nasf (Inactive) [ 09/Aug/17 ] |
|
Julien, I have made some tests locally. Firstly, I create 300 files under Lustre-2.7 + el6.6 with loop devices. Then remount the OST as "ldiskfs" and move all related OST-objects from "O/0/dN" directory to "lost+found". And then, I stopped Lustre system. And then, I copy (scp) the Lustre devices (loop files) to another server, and mount it as "lustre" under Lustre-2.10 + el7.3. When the OST mount up, all the OST-object have been recovered from lost+found to their original OI slots. Not reproduce your trouble. Since you have both the Lustre-2.7 (prod) and Lustre-2.10 (lab) environment, would you please to repeat my test and check whether you can reproduce the issues or not? If you can reproduce, would you please to check whether the OST-objects on the Lustre-2.7 (prod) have LMAC_FID_ON_OST flag or not before upgrading to Lustre-2.10? If you cannot reproduce the issue as my test did, would you please to show me (your way) how to reproduce the issue? Thanks! |
| Comment by Julien Wallior [ 10/Aug/17 ] |
|
@nasf – |
| Comment by nasf (Inactive) [ 15/Aug/17 ] |
|
About the LMA size: |
| Comment by Tim McMullan [ 15/Aug/17 ] |
|
Hey @nasf, I've been working with Julien on this in the lab trying to reproduce this in a simpler environment to what I’ve been testing with previously. I have been able to reproduce the oi_scrub process on mount not recovering all files in a pure 2.10 environment. This is the procure I've been following (executed in order, starting with everything unmounted): OSS: CLIENT: OSS: CLIENT: |
| Comment by nasf (Inactive) [ 16/Aug/17 ] |
|
mcmult, Thanks for the update. Questions: Thanks! |
| Comment by Tim McMullan [ 17/Aug/17 ] |
|
Sure thing! #debugfs stat output, both in its correct location and in the lost+found Thanks! |
| Comment by nasf (Inactive) [ 17/Aug/17 ] |
|
mcmult, would you please to umount the OST, then enable -1 level Lustre kernel debug logs on the OST, and then mount the OST as "lustre". Collect Lustre kernel debug logs just after the mount succeed. Show me the logs, thanks! |
| Comment by Tim McMullan [ 17/Aug/17 ] |
|
I've attached the log (800a_mount.log.gz |
| Comment by Tim McMullan [ 25/Aug/17 ] |
|
Hey @nasf, just wondering if this log has been helpful. I can also generate a log on a fresh instance of lustre and get the log while it recovers files and stops partway though if you would like it! Thanks!
|
| Comment by nasf (Inactive) [ 28/Aug/17 ] |
00100000:00000001:7.0:1502990563.221005:0:18533:0:(osd_scrub.c:2331:osd_ios_general_scan()) Process entered 00100000:00000001:7.0:1502990563.221042:0:18533:0:(osd_scrub.c:2105:osd_ios_lf_fill()) Process entered 00100000:00000001:7.0:1502990563.221042:0:18533:0:(osd_scrub.c:2109:osd_ios_lf_fill()) Process leaving (rc=0 : 0 : 0) 00100000:00000001:7.0:1502990563.221043:0:18533:0:(osd_scrub.c:2105:osd_ios_lf_fill()) Process entered 00100000:00000001:7.0:1502990563.221044:0:18533:0:(osd_scrub.c:2109:osd_ios_lf_fill()) Process leaving (rc=0 : 0 : 0) 00100000:00000001:7.0:1502990563.221005:0:18533:0:(osd_scrub.c:2331:osd_ios_general_scan()) Process entered That means the /lost+found only contains '.' and '..' entries, empty. So if you have the environment of partly recovered system, would you please to show me the output: debugfs -c -R "ls /lost+found/" $device If the /lost+found is not empty, then please re-collect the Lustre kernel debug logs as you did in the comment https://jira.hpdd.intel.com/browse/LU-9836?focusedCommentId=205627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-205627 Thanks! |
| Comment by Tim McMullan [ 28/Aug/17 ] |
|
I've uploaded debugfs_mount_logs.tar.gz
Thanks! |
| Comment by nasf (Inactive) [ 28/Aug/17 ] |
|
It is strange that the debugfs shows that the /lost+found is not empty, but the readdir() during mount only found "." and ".." entries. Currently, I am not sure what caused such strange behavior, but since debugfs parses the directory by itself logic, not general readdir(), I would suggest to mount the device "ldiskfs", then double check the /lost+found directory. Thanks! |
| Comment by Tim McMullan [ 28/Aug/17 ] |
|
I just mounted it up ldiskfs, ran ls, and to make things more strange I am seeing the same set of objects (ls_output.gz Thanks! |
| Comment by Gerrit Updater [ 28/Aug/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/28757 |
| Comment by nasf (Inactive) [ 28/Aug/17 ] |
|
mcmult, honestly, I am not sure why the readdir() cannot return name entry from non-empty lost+found directory. I made a debug patch 28757. Would you please to try it just on your current OST image to see whether the items under lost+found can be recovered? Please re-collect the -1 level debug logs on the OST when the patch applied. Thanks! |
| Comment by Tim McMullan [ 29/Aug/17 ] |
|
I applied the patch and mounted it up, here is the log from the patched mount. 800a_mount_patched.log.0.gz Thank you!800a_mount_patched.log.0.gz |
| Comment by nasf (Inactive) [ 30/Aug/17 ] |
|
mcmult, are you using loop device or real block device for the test? If it is loop device, how large is it? and is it possible to upload the 'bad' image? |
| Comment by Tim McMullan [ 30/Aug/17 ] |
|
This is on a real block device and is too large to upload reasonably. I will try to recreate on a loopback device that is small enough to upload here |
| Comment by Tim McMullan [ 05/Sep/17 ] |
|
I got it to reproduce in a loopback device. I've attached l210_loop_4g.tar.xz As an interesting note, I had tried to do this with a much smaller disk and fewer objects and the recovery process worked correctly. We have been seeing it stop around 250-260. I tried just moving 300 objects into lost+found and all of them were recovered successfully. |
| Comment by nasf (Inactive) [ 11/Sep/17 ] |
|
mcmult, where have you uploaded the image 210_loop_4g.tar.xz to? what the smallest image size (and what is smallest files number) you can reproduce the issue? |
| Comment by Tim McMullan [ 11/Sep/17 ] |
|
Please check again, I had thought the upload finished but it hadn't. The file should be here now. Sorry about that! In testing I had skipped straight from 512MB and 300 files to 4GB and 1024 files since I knew it would show the issue. Some of our initial tests were done with 512 files and that also showed the issue. |
| Comment by Gerrit Updater [ 08/Jan/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30770 |
| Comment by nasf (Inactive) [ 08/Jan/18 ] |
|
mcmult, Thanks for your help. We found the reason reason for why some orphans cannot be recovery. The patch https://review.whamcloud.com/30770 is master based, but it also be applicable for b2_10. You can verify it when you have time. Thanks! |
| Comment by Tim McMullan [ 19/Jan/18 ] |
|
Thank you for the patch! We were able to test it and it resolved the issue for us. Thanks again! |
| Comment by Gerrit Updater [ 25/Jan/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30770/ |
| Comment by Gerrit Updater [ 25/Jan/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31019 |
| Comment by Gerrit Updater [ 09/Feb/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31019/ |