[LU-9209] After power outage MDT would not mount with "bad file descriptor" Created: 13/Mar/17 Updated: 31/Mar/17 Resolved: 31/Mar/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Joe Mervini | Assignee: | Brad Hoagland (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 6.8/Toss 2.5.5, DDN 7700 storage for MDT |
||
| Attachments: |
|
| Severity: | 1 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We had a fairly significant power outage that took down our storage over the weekend and one of our MDTs would not mount after the system came back up with a "bad file descriptor" error. The target was mounted ldiskfs looked "normal" but a subsequent lustre mount yielded the same result. A fsck was run with a -n option and came back clean but still didn't mount, then a fsck was run with no options and still came back clean but again wouldn't mount lustre. Finally a fsck -fy was run and all hell broke loose. Many duplicate inodes, unattached blocks, multiple attached blocks and other errors were encountered. The fsck restarted 3 times and I observed passes that I've never seen before like Pass 1b, Pass1c and a couple dealing with directories.) Before it finally completed, it stuffed >93K files into lost+found. I am familiar with using ll_recover_lost_found_objs to recovery OST objects but I don't know what options are available for an MDT. Looking for advise here. Thanks. |
| Comments |
| Comment by Andreas Dilger [ 14/Mar/17 ] |
|
The ll_revover_lost_found_objs tool is only meant to be used on the OST, not the MDT. The first question to ask is which version of e2fsprogs/e2fsck do you have installed? It is always recommended to use the latest Lustre version, in this case e2fsprogs-1.42.13.wc5. Do you have a log of the e2fsck run to see what problems it was repairing? This would also potentially help repair the files that were moved into lost+found, if they have pathnames printed in the logs. Also, if you have a backup of the MDT deboce, even an old one, it may help repair affected directories that existed at the time of the backup. With newer versions of Lustre (2,7+) it is possible to recover files from lost+found using the Lustre LFSCK namespace repair, since each Lustre file also contains its own filename(a) and the parent directory. It would also be possible to move files from the lost+found directory to a (per user?) lost+found under ROOT so that users can access them. One concern that was raised by LLNL in the past is that the pass 1b/1c shared blocks repair may result in exposing data from another user into the blocks of the repaired file, depending on which files were sharing the same blocks. LLNL added an e2fsck option to preferentially clear inodes with shared block rather than potentially exposing data to another user. Finally, I see that this is marked severity "4" (low importance), but did you mean to file this as severity "1" (system down)? |
| Comment by Joe Mervini [ 14/Mar/17 ] |
|
Thanks for the feedback Andreas. We are running the latest version e2fsck 1.42.13.wc5 (15-Apr-2016). Unfortunately it wasn't until after I started the fsck that I realized that I didn't have the process running in a screen session to capture the output. The one thing that I found interesting is that through the first and second pass there were <40 inodes that seemed to be affected and those were directories that showed up as directory "UUUUUUUUUUUUUU..." followed by ???. It wasn't until the third pass that things went nuts. In addition, the fsck really took most of the day to run. Examining the entries in the lost+found directory, a large majority of those files are zero length although there are files and directories that do have data in then. I just don't know what to do about those files. My secondary question is what would cause the shared block condition and why would this occur as the result of a power failure? Is it possible that these conditions have been lurking for quite some time and it's just the newer version of e2fsck exposed them. Having >93K files active or in cache at the time of the event seems improbable. I'm guessing that affected upstream directories could be a contributing cause but it still seems like a lot. Finally - yes, I did mean to mark it severity 1. 50/50/90 rule applies... |
| Comment by Andreas Dilger [ 14/Mar/17 ] |
|
My first suspicion is that there was a write cache on the controller that was lost during the power outage. If this resulted in blocks being overwritten with bad data (i.e. "UUUUUU...") then e2fsck will do its best to turn whatever mess it is given into a working filesystem. If some directories at the start of the filesystem are lost, then it may dump a lot of files into lost+found but they may or may not be useful. None of the files on the MDT will have any data in them, so it isn't surprising that the files in lost+found have zero length. One thing to check would be using debugfs "stat" some of the files in l+f to see if they have "fid" and "lov" xattrs, in which case they are Lustre files. If they are directories, they might be identifiable from the contents. If they have neither then they are not, though they could be kept around for some time just in case. You might consider to make a "dd" backup of the MDT in case of further problems. I can attach a script that could migrate the files out of lost+found into better locations, but it normally needs to use the e2fsck log to try and recover the pathnames. |
| Comment by Joe Mervini [ 14/Mar/17 ] |
|
Yeah. I realized my mistake with logging after I started the fsck but I've always been reluctant to stop an fsck in fear of screwing up the file system. If you could supply the script that would be helpful - even if it's for the educational value. I don't know the first thing about how the MDT is constructed from a file layout perspective. As far as doing the dd, is the suggestion there might be a potential hardware issue? |
| Comment by Andreas Dilger [ 14/Mar/17 ] |
|
The "dd" is for belt-and-suspenders reasons. If there is a hardware problem, at least you will save the current state of the MDT as it is now. If there are problems with our recovery actions or other e2fsck issues, then we will have a fallback. In general, I recommend to make a "dd" backup of the MDT at least every few days, to two alternate devices, since the MDT is small enough to back up completely, yet critical for proper operation. While using an LVM snapshot as the source is best from a consistency POV, even using the raw MDT device while it is mounted and in use would give you a reasonable backup that e2fsck can repair, and would be valuable in situations like this to be able to reconstruct the pathnames of files based on their inode numbers (for files that exist in the backup at least). Since the "dd" backup is doing linear reads of the MDT, and linear writes to the backup device, there is relatively low impact to MDT performance and the backup storage device does not need to be expensive (e.g. a couple of 4TB (or whatever) SATA disks with enough capacity to hold a full copy of the MDT). At 150MB/s this might take 4h to dump the whole MDT, which is reasonable for a once-a-week operation. |
| Comment by Andreas Dilger [ 14/Mar/17 ] |
|
Joe, I've attached the ll_fix_mdt_lost_found.sh script. It normally depends on having an e2fsck log to help it along. Without the e2fsck logfile to provide filenames (which is possible with certain types of corruption), it will essentially just move files from the ldiskfs lost+found/#ino to ROOT/home/$user/lost+found/[fid], if the files have a Lustre fid xattr available, otherwise they will be left in the ldiskfs lost+found. The script is not intended as a solution to all kinds of problems, it was written for a specific incident where the MDT lost a number of directories but the filenames were in the e2fsck log file. It would be useful to know if the files in lost+found even have the Lustre fid xattr available, otherwise they may just be trash that e2fsck resurrected from random data written to the filesystem. This can be done with the debugfs "stat" command, as described previously, or if the MDT is mounted as type ldiskfs using "lfs path2fid /mnt/mdt/lost+found/#ino" for a random sample of inodes. |
| Comment by Joe Mervini [ 14/Mar/17 ] |
|
Andreas - Thanks for providing the script. I had been thinking that I wanted to do everything possible to try and recover the lost+found file prior to attempting to mount lustre but Steve Monk suggested mounting the file system lustre while there was no IO before trying something heroic. Bad news... I am still getting a "mount.lustre: mount <device> failed: Bad file descriptor" I can still mount the device ldiskfs. Where do I go from here? |
| Comment by Andreas Dilger [ 14/Mar/17 ] |
|
Please attach the /var/log/messages or dmesg output from your mount attempt. That will hopefully tell us what is still wrong with the filesystem at the Lustre level, and may be as straight forward as deleting some of the Lustre log files. |
| Comment by Joe Mervini [ 14/Mar/17 ] |
|
Naturally this is on a classified system so I have to type it in... From syslog: Lustre: Lustre: Build Version: -9chaos-CHANGED-2.6.32-642.6.2.1.chaos.ch5.5.x86_64 dmesg produced the same output. |
| Comment by Joe Mervini [ 14/Mar/17 ] |
|
Note that this is the same error message I got yesterday prior to running fsck. |
| Comment by Andreas Dilger [ 14/Mar/17 ] |
|
If the OI files are corrupted, they can be deleted from the filesystem mounted via ldiskfs (I'd suggest to copy them somewhere like /root first, as a backup) and then OI scrub will rebuild them automatically upon mount. That includes the files oi.16.NN, oi_scrub, lfsck_namespace, lfsck_layout, CATALOGS. That may take some time, depending on how many files on the MDT, estimate 75-100k files/sec about 3.5h/billion files). You can monitor progress via lctl get_param osd-ldiskfs.*.oi_scrub on the MDS. The filesystem could be used during this time, but access to files that are not yet restored to the OI file may block, and this would put more load on the MDS, so best to limit it to light (recovery investigation) usage during this time if possible. |
| Comment by Andreas Dilger [ 14/Mar/17 ] |
|
If OI scrub does not start automatically when the MDS is mounted as type lustre, you can start it manually via lctl lfsck_start -M hscratch-MDT0000 -t scrub -r. It is recommended to enable lfsck logging to the console to get updates on what LFSCK is doing, via lctl set_param prinkt=+lfsck. |
| Comment by Joe Mervini [ 14/Mar/17 ] |
|
That's great news! So to be clear - after copying - I can safely remove oi.16.0-63, IO_scrub, CATALOGS, lfsck_namespace? Note that I do not have a lfsck_layout file on the ldiskfs mount device. (Sorry if I'm being paranoid but I don't want to lose the file system.) |
| Comment by Andreas Dilger [ 14/Mar/17 ] |
|
The lfsck_layout file is only for newer versions of Lustre (2.7+) so that is fine if it is missing. The OI_scrub and lfsck_namespace files are internal log files for LFSCK and will be created. The oi.16.* files are what is giving you grief right now, and will be rebuilt by LFSCK. The CATALOGS file is the list of Lustre recovery logs of unlink and setattr operations, at worst this will result in some OST objects not being unlinked, and they will be cleaned up in the future when you upgrade to Lustre 2.7 with LFSCK layout support. It isn't strictly necessary to remove CATALOGS at this point, but it is likely to also have suffered some corruption, and just saves us a bit of time. |
| Comment by Joe Mervini [ 14/Mar/17 ] |
|
Does the device need to remain mounted ldiskfs? I unmounted it after removing the above files and when I tried to mount via lustre it responded with: mount.lustre: mount <device> at <mnt-pt> failed: Operation now in progress dmesg and syslog messages: hscratch-MDT0000: trigger OI scrub by RPC for (0x1:0x76ca:0x0), rc = 0 (1) It kind of looks like the OSTs need to be mounted as well for this to proceed? |
| Comment by Joe Mervini [ 14/Mar/17 ] |
|
tunefs.lustre looks normal and specifies both MGS and MDT. |
| Comment by nasf (Inactive) [ 15/Mar/17 ] |
|
The OI scrub runs under "lustre" mode, not "ldiskfs" mode, so we need to mount the MDT device as "lustre". It is suggested to avoid RPC during the OI scrub, so only mount the MDT is enough, neither OST nor client. Your latest mount failure seems related with CATALOGS. Have you removed the CATALOGS as suggested by Andreas in former comment? (It is suggested to backup it to outside of Lustre before remove it) |
| Comment by Joe Mervini [ 15/Mar/17 ] |
|
I thought I had but that will be the first check I'll make in the morning. |
| Comment by Andreas Dilger [ 15/Mar/17 ] |
|
Fan Yong, it definitely looks like it was mounted as type lustre based on the messages. The error -115 = -EINPROGRESS is caused by OI Scrub rebuilding the OI files. It seems that osp_sync_init() is trying to open the LLOG_CATALOGS_OID file by FID instead of by name. Is there some way to bypass the OID lookup using osd_lf_maps in ldiskfs or oids in ZFS (these should really be named the same) to avoid the OI files during early startup? |
| Comment by nasf (Inactive) [ 15/Mar/17 ] |
The issue has already been resolved on the master via the patch: Joe, which release are you using? I can back port such patch for you. You also can do that by yourself since it is a small patch. |
| Comment by Joe Mervini [ 15/Mar/17 ] |
|
I had removed the CATALOGS file yesterday but removed it again today and got the similar results as yesterday where the process began but exited. This time I'm getting error -17 as opposed to -115. We are running a 2.5.5 release of lustre based on the 2.6.32-642.6.2.1 kernel if that is what you mean. We are running TOSS although we did modify our build to include the Intel e2fsprogs. |
| Comment by nasf (Inactive) [ 15/Mar/17 ] |
|
Here is the patch to resolve CATALOGS trouble. Please try. Thanks! |
| Comment by Joe Mervini [ 15/Mar/17 ] |
|
The site isn't accepting my Intel HPDD password (I'm assuming it's the same as my JIRA login) and when I click on "Forgot my password" it says "Unable to reset your password at this time. Please try again later." |
| Comment by Andreas Dilger [ 15/Mar/17 ] |
|
Joe, is there a reason you are not running a long-term maintenance release like EE or 2.5FE? The patch from Fan Yong is based on the b2_5_fe branch, which may be why you can't access it? I can move it over to the b2_5 branch so it is visible to you. |
| Comment by Andreas Dilger [ 15/Mar/17 ] |
|
It looks like Fan Yong's patch is the same as https://review.whamcloud.com/8354 but on the b2_5_fe branch. This was landed to master as v2_5_52_0-56-g907b31c so it would already be included in Lustre 2.6 and later releases. |
| Comment by Joe Mervini [ 15/Mar/17 ] |
|
Thanks Andreas. I keep you posted. |
| Comment by Joe Mervini [ 16/Mar/17 ] |
|
The patch was applied using rpmbuild but I am still getting identical results as the first time that I removed the files that Andreas specified. |
| Comment by Joe Mervini [ 16/Mar/17 ] |
|
One thing a coworker mentioned the other day was when he had issue he'd alway run fsck but he also said that he'd turn off quotas. I have noticed that there are alway quota messages when fsck completed but we've never paid them much attention. Don't know if there's any relationship but thought I'd mention it anyway. |
| Comment by nasf (Inactive) [ 17/Mar/17 ] |
It is not CATALOGS, but for some special llog object. Would you please to enable -1 level Lustre kernel debug logs on the MDS and remount the MDT, then collect the Lustre kernel debug logs when mount failure? Thanks! |
| Comment by Joe Mervini [ 17/Mar/17 ] |
|
Are you talking about setting the lnet.debug level to 1? If not can you please be specific? I don't see any reference in the lustre manual to setting kernel level debug levels to a numeric value except for lnet. |
| Comment by nasf (Inactive) [ 18/Mar/17 ] |
|
You can enable '-1' level Lustre kernel debug log via "lctl set_param debug=-1" or "echo -1 > /proc/sys/lnet/debug". |
| Comment by Joe Mervini [ 20/Mar/17 ] |
|
I just ran the mount command with debug level set to -1. Here is the output: LDISKFS-fs (dm-1) mounted filesystem with ordered data mode. quota=on Opts: NOTE: hscratch-OST003f is a valid OST. |
| Comment by Andreas Dilger [ 20/Mar/17 ] |
|
Hi Joe, I couldn't find where the debug log is from your test. When you run the MDS mount test and the mount fails, you need to dump the kernel debug log and attach it here: mds# lctl set_param debug=-1 |
| Comment by Joe Mervini [ 20/Mar/17 ] |
|
Hi Andreas. Since this is on a classified system, the kernel dump is pending approval to downgrade. Is there anything in particular that I can look for in the meantime. The file is ~157MB and if I can trim that to messages related to the problem it might speed up the process. |
| Comment by nasf (Inactive) [ 21/Mar/17 ] |
There seems some stale information to be cleared after the former mount failure. |
| Comment by Joe Mervini [ 21/Mar/17 ] |
|
Should I remove the oi.16.*, OI_SCRUB, CATALOGS and lfsck_namespace files again prior to the reboot/mount attempt? |
| Comment by nasf (Inactive) [ 21/Mar/17 ] |
|
No, only reboot is enough. Then load Lustre modules and enable -1 level Lustre debug as above. And then try to mount the MDT and collect the failure logs. |
| Comment by Joe Mervini [ 21/Mar/17 ] |
|
I have grabbed the kernel dump as well as syslog and dmesg and it's in the process of being reviewed. Are there any other log files that may be needed? |
| Comment by nasf (Inactive) [ 21/Mar/17 ] |
|
If it still failed with -115, then these logs are enough. Please attach the logs. Thanks! |
| Comment by Joe Mervini [ 21/Mar/17 ] |
|
Yes - it still failed with -115. Hopefully the files will be available shortly. |
| Comment by Joe Mervini [ 21/Mar/17 ] |
|
Just uploaded logs and kernel dump. |
| Comment by nasf (Inactive) [ 22/Mar/17 ] |
|
I searched the Lustre debug logs for the mount failure "(osp_sync.c:1042:osp_sync_init()) hscratch-OST003f-osc-MDT0000: can't initialize llog: rc = -115", unfortunately, the whole log only contains the following information with the string "osp": # grep osp hmds1_kernel_dump.032117 17326:00000020:00001000:18.0:1490112646.289647:0:6620:0:(lprocfs_status.c:366:__lprocfs_add_vars()) cur_root=osp, cur=num_refs, next=(null), (new) 17335:00000004:00000001:25.0:1490112646.291380:0:6621:0:(osp_precreate.c:862:osp_precreate_thread()) Process entered 17352:00000004:00000010:25.0:1490112646.291446:0:6621:0:(osp_dev.c:1156:osp_key_init()) kmalloced 'value': 568 at ffff88204ed63400. 17368:00000004:00000010:25.0:1490112646.389151:0:6621:0:(osp_dev.c:1156:osp_key_fini()) kfreed 'info': 568 at ffff88204ed63400. 17370:00000004:00000001:25.0:1490112646.389158:0:6621:0:(osp_precreate.c:977:osp_precreate_thread()) Process leaving (rc=0 : 0 : 0) So the Lustre kernel debug log seems incomplete or overwritten. Would you please to re-collect the logs on the MDT as following: lctl set_param debug=-1 debug_mb=xxxx The "xxxx" is the kernel logs buffer size with MB unit, it is suggested 1024 or larger if your MDS node has enough RAM. mount -t lustre $MDT_device $MNT_point lctl dk > /tmp/lustre.log Please run the "lctl dk" command as quickly as possible after the mount failure in the step 4). Thanks! |
| Comment by Joe Mervini [ 22/Mar/17 ] |
|
Before I go through the process of getting the kernel dump so I can send it to you (it is not a simple process), I'd like to make sure that I am completely understanding what you are instructing me to do. After a reboot I start lnet to be able to set the debug level to -1. I then mount the device using mount -t lustre <device> <mount-point>. Then run lctl dk > <output file>. The only change from what I did yesterday is that I add debug_mb=<some value>. I have 128GB of ram in the machine. It is diskless but I am saving the output to an NFS mounted file system. What value should I use for debug_mb size? Also, my LNET configuration includes lots of routes and networks. Since I am working on a dummy OS image should I restrict the LNET config to only the local network without routes? |
| Comment by nasf (Inactive) [ 22/Mar/17 ] |
|
I think it is enough to set the debug_mb as 1024 for collecting mount logs. As for the network configuration, it will not affect the MDT mount, I assume that your MGS and MDT0 are combined in the same device. |
| Comment by Joe Mervini [ 22/Mar/17 ] |
|
Is the only thing that you are looking for osp related? I noticed that from your previous message. If so, I can grep and copy that verbatim to the ticket which will speed the process considerably. |
| Comment by nasf (Inactive) [ 22/Mar/17 ] |
|
No, I also searched more info, such as the mount entry, but not in the log also. Because the dmesg reported "osp_sync_init" failure, then it should be in the log. So please offer me the original logs. Thanks! |
| Comment by Joe Mervini [ 22/Mar/17 ] |
|
Ok - I have got the files in the queue for download. As soon as I get them I'll send them on. FWIW there is significantly more osp information in the new dump. |
| Comment by Joe Mervini [ 22/Mar/17 ] |
|
The most recent kernel dump and log files have just been uploaded. |
| Comment by nasf (Inactive) [ 23/Mar/17 ] |
00000040:00000001:0.0:1490195709.394118:0:6654:0:(llog.c:971:llog_open()) Process entered 00000040:00000010:0.0:1490195709.394118:0:6654:0:(llog.c:66:llog_alloc_handle()) kmalloced 'loghandle': 184 at ffff88100ab60cc0. 00000040:00000001:0.0:1490195709.394119:0:6654:0:(llog_osd.c:955:llog_osd_open()) Process entered 00000020:00000001:0.0:1490195709.394119:0:6654:0:(local_storage.c:146:ls_device_get()) Process entered 00000020:00000001:0.0:1490195709.394120:0:6654:0:(local_storage.c:151:ls_device_get()) Process leaving via out_ls (rc=18446612201349379520 : -131872360172096 : 0xffff881012d0f5c0) 00000020:00000001:0.0:1490195709.394120:0:6654:0:(local_storage.c:173:ls_device_get()) Process leaving (rc=18446612201349379520 : -131872360172096 : ffff881012d0f5c0) 00000020:00000001:0.0:1490195709.394121:0:6654:0:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=4268758474 : 4268758474 : fe7015ca) 00000020:00000001:0.0:1490195709.394122:0:6654:0:(lu_object.c:242:lu_object_alloc()) Process entered 00000020:00000010:0.0:1490195709.394123:0:6654:0:(local_storage.c:86:ls_object_alloc()) kmalloced 'o': 152 at ffff88100ab60c00. 00000020:00000001:0.0:1490195709.394123:0:6654:0:(local_storage.c:48:ls_object_init()) Process entered 00000004:00000010:0.0:1490195709.394124:0:6654:0:(osd_handler.c:176:osd_object_alloc()) kmalloced 'mo': 176 at ffff88100ab60b40. 00000020:00000001:0.0:1490195709.394124:0:6654:0:(local_storage.c:58:ls_object_init()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1490195709.394125:0:6654:0:(osd_handler.c:507:osd_fid_lookup()) Process entered 00000004:00000001:0.0:1490195709.394125:0:6654:0:(osd_oi.c:497:fid_is_on_ost()) Process entered 00000004:00000001:0.0:1490195709.394125:0:6654:0:(osd_oi.c:505:fid_is_on_ost()) Process leaving (rc=0 : 0 : 0) 00000001:00000001:0.0:1490195709.394126:0:6654:0:(osd_compat.c:886:osd_obj_map_lookup()) Process entered 00000001:00000001:0.0:1490195709.394126:0:6654:0:(osd_compat.c:822:osd_seq_load()) Process entered 00000001:00000001:0.0:1490195709.394127:0:6654:0:(osd_compat.c:830:osd_seq_load()) Process leaving (rc=18446612202096913600 : -131871612638016 : ffff88103f5f6cc0) 00000001:00000001:0.0:1490195709.394162:0:6654:0:(osd_compat.c:931:osd_obj_map_lookup()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1490195709.394163:0:6654:0:(osd_handler.c:302:osd_iget_check()) Process entered 00000004:00000001:0.0:1490195709.394163:0:6654:0:(osd_handler.c:394:osd_iget_check()) Process leaving via put (rc=0 : 0 : 0x0) 00000004:00000001:0.0:1490195709.394164:0:6654:0:(osd_handler.c:443:osd_check_lma()) Process entered 00000004:00000002:0.0:1490195709.394164:0:6654:0:(osd_handler.c:484:osd_check_lma()) hscratch-MDT0000: FID [0x1:0x76ca:0x0] != self_fid [0x1:0x77b7:0x0] 00000004:00000001:0.0:1490195709.394165:0:6654:0:(osd_handler.c:488:osd_check_lma()) Process leaving (rc=18446744073709551538 : -78 : ffffffffffffffb2) It seems that the LAST_ID file for nameless sequence is corrupted. Means that osp_sync_llog_init() tried to generate new FID [0x1:0x76ca:0x0] in nameless sequence for the llog, but such FID has already been in OI, and the inode (for related OI mapping) claims another FID [0x1:0x77b7:0x0]. Currently, I do not know why the LAST_ID is corrupted. But to make the things to go ahead, we can fix the LAST_ID file manually with some large value, such as 0x10000. 1) mount the MDT device as "ldiskfs" |
| Comment by Joe Mervini [ 23/Mar/17 ] |
|
Could you be more explicit with regard to step 3? I'm assuming it would be similar to the discussion of fixing the LAST_ID in the manual but I don't want to make any mistakes. |
| Comment by nasf (Inactive) [ 23/Mar/17 ] |
|
I just updated an "LAST_ID.new" file which the value is 0x10000, that can be used to replace the corrupted "$MNT_point/O/1/LAST_ID" 3) cp -f LAST_ID.new $MNT_point/O/1/LAST_ID |
| Comment by Joe Mervini [ 23/Mar/17 ] |
|
Fan - Thank you for the file. I found resources to dd the original drive and started the dd process yesterday. I really wanted to have a backup of the original before I did anything else. This might be a question for Andreas, but I wanted to know if "dd if=<MDT> bs=1M|ssh <remote-node> dd of=<backup device> bs=1M" is a proper procedure. The remote node is connected to the QDR IB fabric and I am using ipoib for the transfer but I am only seeing ~25MB/s transfer rate. Is setting the block size to 1M out of bounds for a MDT? |
| Comment by Andreas Dilger [ 23/Mar/17 ] |
|
Using ssh for the transfer is encrypting and possibly compressing the data over the network. That would be visible by the SSH thread using 100% CPU on the MDS or the remote node. It would be possible to use a more lightweight crypto (e.g. -c blowfish-cbc, if you don't have hardware AES support) to possibly speed this up, or attach a SATA (even USB?) drive locally to the MDS to do the backup. It wouldn't be a bad idea to have a couple of locally-attached drives to make MDS "dd" backups on a regular basis, though note that you would want to change the disk label (via e2label /dev/sdX hscratch-MDT0bak or similar) if you are automatically locating the filesystem via its label. |
| Comment by Andreas Dilger [ 23/Mar/17 ] |
|
I just saw the following on the zfs-discuss mailing list:
|
| Comment by Joe Mervini [ 24/Mar/17 ] |
|
Andreas - thanks a lot for the feedback. Since I was getting better that 10gb/s across the link with iperf, it didn't make any sense that my performance was so far off. The bottleneck explanation fits. I'm going to give the hpn-ssh a go on our testbed. The prime reason I wanted to do backups to a remote system is remove any potential mistakes with local devices. I'm a little paranoid that way. |
| Comment by Joe Mervini [ 27/Mar/17 ] |
|
I was able to get the file system operational again on Saturday thanks to the new LAST_ID file you provided (I added a comment to the ticket then but for some reason it didn't post.) I still have the issue of the large number of files in lost+found but since I did not capture the output from the fsck I was unable to use the script to relocate those files. Is it possible to simply move the contents of lost+found to a directory new directory under /ROOT that would be visible to lustre? I'm mostly interested in the directories only because there is human readable information there. In any event, thank you so much for your help with this. |
| Comment by nasf (Inactive) [ 27/Mar/17 ] |
|
I assume you have already run OI scrub to rebuild your OI mappings, right? If yes, let's go ahead. The items under backend /lost+found are generated by the e2fsck as orphans. Since Lustre-2.7, you can run namespace LFSCK to recover the MDT orphans back to its original namespace. But for your current Lustre-2.5.5, you have to do that manually, the precondition is that the orphan has linkEA (parent FID + child name). To make the orphans temporary visible, you can do as following: 1) mount the system as "lustre" The difficult point is the step 7), currently, there is no special tool for such parsing. Another point is that the parent directory in the step 8) may have been deleted or not recovered by e2fsck. So the step 8) may be failed, under such case, you cannot know its position in the original namespace. |
| Comment by Joe Mervini [ 27/Mar/17 ] |
|
In terms of the OI scrub, I did not run it manually since it appears it ran automatically with the lustre mount. The oi_scrub file shows that it ran for 2361 seconds with a status of completed. Is there any additional steps I should take? Right now I have the file system mounted but I don't have any client connections so I am free to do any additional work before putting it back into production. And thanks for the detailed information regarding the lost+found files. |
| Comment by Joe Mervini [ 27/Mar/17 ] |
|
Just an update: The file system has been released back to the users. Once again, thank you for the help. I have been able to examine a number of files and directories that were moved out of lost+found and we are coming up with a procedure to at least identify and rename them. A concern and a question I have goes back to what Andreas mentioned early in this ticket regarding LLNL's preferential treatment of shared blocks during fsck. Am I correct in assuming that as the MDT does not technically store data, the conditions and risks associated with Livermore's position would essentially be non-existent? |
| Comment by nasf (Inactive) [ 28/Mar/17 ] |
The OI scrub has done automatically. You need to do nothing for OI scrub. |
| Comment by Andreas Dilger [ 30/Mar/17 ] |
|
Joe, you are correct that the MDS does not store any file data, so there is no chance that shared blocks would result in file data being copied to the wrong file. |
| Comment by Joe Mervini [ 31/Mar/17 ] |
|
Thanks for the feedback and for all the help. We're good now. Feel free to close the ticket. |
| Comment by Minh Diep [ 31/Mar/17 ] |
|
Thanks |