[LU-2018] Questions about using lfsck Created: 24/Sep/12 Updated: 24/Sep/15 Resolved: 24/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.x (1.8.0 - 1.8.5) |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Joe Mervini | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Sun hardware running mdadm |
||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 4120 | ||||||||
| Description |
|
We have an OST on one of our scratch file systems that was deactivated and attempts to reactivate it failed with: [204919.753933] Lustre: scratch1-MDT0000: scratch1-OST0001_UUID now active, resetting orphans First: Is lfsck the proper tool to recover from this error? Second: Since this is the first time that I have ever used lfsck I am not sure what to expect. In this particular case there are 32 OSTs on this file system. Following the examples in the manual, I started a read-only run against all OSTs Saturday night and the the following morning from what I could judge it had only gotten through 11 of the 32 OSTs (if it run sequentially.) It reported LOTS of zero length orphaned inodes. Unfortunately in this case I didn't pipe the output to a log file so information regarding the OST (2/32) that couldn't be reactivated was lost when I ran out out line buffer space. So I stopped the run and restarted it against only that OST. Because it has been running listing user files for many hours I am guessing the answer is no bu is this a valid execution option or is it required that all OSTs must be accounted for? Third: The db files were generated when the file system was quiesced and after e2fsck was run against all targets cleanly. Must lfsck be run while in an offline mode or can it be run while the file system is serving clients? Because of my inexperience with using lfsck I don't know what to expect in terms of duration of the run and since I am using the -n option I will need to run again with corrective options. Also when running the corrective options is -l -c the preferred method? |
| Comments |
| Comment by Joe Mervini [ 24/Sep/12 ] |
|
Just adding some output since what is going into my log is not what is being seen on my console: [root@rmmds-scratch2 ~]# lfsck -n -v --mdsdb /lustre-tmp/mdsdb --ostdb /lustre-tmp/scratch {1md1db,1md2db,2md3db,2md4db,3md1db,3md2db,4md3db,4md4db,5md1db,5md2db,6md3db,6md4db,7md1db,7md2db,8md3db,8md4db,9md1db,9md2db,10dm3db,10md4db,11md1db,11md2db,12md3db,12md4db,13md1db,13md2db,14md3db,14md4db,15md1db,15md2db,16md3db,16md4db} /scratch 2>&1 >> /lustre-tmp/lfsck.output-allosts Log file: [0] zero-length orphan objid 6050029 |
| Comment by Johann Lombardi (Inactive) [ 24/Sep/12 ] |
We usually try to fix problems manually first before running lfsck (which can be time consuming). That said, it does seems that you have orphan files on the OSTs which would have to be removed. Andreas, any thoughts on the e2fsck output? |
| Comment by Andreas Dilger [ 24/Sep/12 ] |
|
Joe, AFAIK it is not possible to run lfsck against only a single OST. If you omit the ostdb files from the other OSTs, lfsck (right or wrong) will assume that those OSTs could not generate an ostdb and assume they have been removed. I would not recommend that you run lfsck until we determine what the problem is. I don't think lfsck will fix the -EINVAL problem you are seeing, but it isn't possible to know for sure what problem you are hitting without more information. If there are useful console error messages from the OST, please post them here. Alternately (or additionally), running with full debug during the attempted mount of OST0001 would be useful, so we can determine exactly where the -22 error is coming from. Typical areas to check include that on the MDS the {{lctl get_param osc. {fsname}-OST0001-osc.next_id}} (I think, can't check a system right now, essentially what is in the lov_objids file for this OST) roughly matches the LAST_ID on the OST itself (from {{lctl get_param obdfilter.{fsname}-OST0001.last_id}} on the OSS). These values do not have to be identical, but cannot be too far apart. If this isn't the problem, we'll have to dig further. |
| Comment by Andreas Dilger [ 24/Sep/12 ] |
|
Just to confirm, the inability for the MDS to start using this OST should not affect the ability of the clients to read existing files from the OST. This would only prevent new files from being created there. |
| Comment by Joe Mervini [ 24/Sep/12 ] |
|
Andreas, here is the output from the lctl get_param on the MDS and the affected OSS: [root@rmmds-scratch1 ~]# lctl get_param osc.scratch1-OST0001-osc/prealloc_next_id I'd say these two numbers are pretty far off. [102794.862025] LDISKFS FS on md2, external journal on md12 |
| Comment by Joe Mervini [ 24/Sep/12 ] |
|
Is there any other information that I can provide to assist you in helping me resolve my activation problem? Beyond that will it be wise/beneficial to run lfsck -i -c against the file system to clear the dangling and zero length inode? |
| Comment by Andreas Dilger [ 24/Sep/12 ] |
|
Joe, you are correct - the MDT and the OST have very different expectations on what object should be allocated next on this OST. The maximum object preallocation window is 10000, so these two numbers should not exceed this amount. If they do, then the OST refuses to destroy the large number of "orphan" objects to avoid potentially erasing the whole OST due to a software bug or corruption. In this case, the difference is over 23000, though it doesn't appear that the two values are corrupted. Is it possible that the MDS was restored from backup at some point since this OST was taken offline? Alternately (this shouldn't happen) was the OST accidentally attached to some other Lustre filesystem? The fix for this is to either binary edit the MDT "lov_objids" file to have the same value as the OST (which will potentially leave some orphan objects on the OST, or (slightly more risky, but less effort) to delete the lov_objids file entirely and have the MDT recreate it from the values on each of the OSTs. |
| Comment by Joe Mervini [ 24/Sep/12 ] |
|
I think that I would opt for the less risky approach. If I do the binary edit, and later run lfsck would that clear the orphaned objects? I am not particularly concerned it they wind up in lost+found if I use the -l option just as long as the file system is in a better state than it currently is or was prior to this recent deactivation problem. But on the other hand, what is the risk with deleting the lov_objids? I'm a little bit nervous about doing the binary edit and making a mistake. Just to be clear, if I do edit lov_objids file I am only concern with scratch1-OST0001 right? And in doing the edit I want to modify the value to be 14563595, correct? Also, as I mentioned in my previous comment, is there a benefit to running lfsck on the file system once I am able to activate the OST? Currently the file system is offline with only the backup MDS acting as a client for running lfsck. If there is something to be gained by running lfsck we are will to keep the file system unavailable until it completes if necessary. If you can respond this evening it would be greatly appreciated so I can get these processes rolling over night. Thanks. |
| Comment by Joe Mervini [ 24/Sep/12 ] |
|
Also WRT binary edit, the manual discuss modifying the data on an OST. Do I just apply the same principles to the MDT? |
| Comment by Joe Mervini [ 24/Sep/12 ] |
|
I have taken down the file system and mounted both the MDT and affected OST ldiskfs. Running the od on the MDT this is the output: [root@rmmds-scratch1 CONFIGS]# od -Ax -td8 /lustre/md1/lov_objid Granted I don't really understand what I'm looking at but 000010 (if that is scratch1-OST0001) doesn't line up with any of the numbers above. With that in mind, I really don't know how to proceed. Guidance would definitely be appreciated. |
| Comment by Peter Jones [ 24/Sep/12 ] |
|
Niu Could you please advise on this one? Thanks Peter |
| Comment by Andreas Dilger [ 25/Sep/12 ] |
|
Joe, OST0001 is the second number displayed (14540258) at offset 08. Editing this value should not affect the ability to run lfsck later on. As for running lfsck afterward, this could also be done on the running system, so I'm reluctant to have you keep the system down longer than needed. I didn't know that the system was down thus while time - the inability of the MDS to complete recovery with the OST should not have affected the ability of clients to use the rest of the filesystem. |
| Comment by Joe Mervini [ 25/Sep/12 ] |
|
So following Andreas' latest comment I was able to understand what I was seeing and successfully edited the lov_objid file on the MDS to get the OST active again. The problem has be fixed and the file system is operational again. For the benefit of anyone reading this that may have had the same problems understanding the documentation in the manual regarding binary edits (Appendix B: How to fix bad LAST_ID on an OST) I am detailing it here with the specifics of my case as example. (Note that this is for editing the MDT and not the OST so some of the information provided in the manual doesn't necessarily apply.) The ids that were identified as being out of sync for the OST were based on error information in dmesg on the deactivated OST: [204924.149414] LustreError: 17598:0:(filter.c:3173:filter_handle_precreate()) scratch1-OST0001: ignoring bogus orphan destroy request: obdid 14540258 last_id 14563595 In the above the obdid 14540258 represents what the MDS had recorded as the last_id, and the last_id 14563595 is what the OST reported. As Andreas mentioned above the difference between those two numbers is greater than 23000 (only by 337 - but enough to cause the problem!) The first step I took was to mount the MDT as type ldiskfs: [root@rmoss-scratch1 /root]# mount -t ldiskfs /dev/md1 /lustre/md1 I then ran the od command in Step 1 to get the objids on the MDT: [root@rmmds-scratch1 CONFIGS]# od -Ax -td8 /lustre/md1/lov_objid The thing that confused me initially was that although I could see the number 14540258 on the first line, I didn't understand the offset (and I guess I still have a question about it: Does each line represent an OSS?) Then I mounted the affected OST ldiskfs: [root@rmoss-scratch1 scratch1-OST0001]# mount -t ldiskfs /dev/md2 /lustre/md2 I then ran Step 2. I don't really understand the purpose of the command and it is not clear whether it is supposed to be run on the MDS or the OSS since the documentation instructs you to mount the OST in Step 4. In any event the output was meaningless to me. (It would be nice if someone could explain it with an example.) Then following the instructions in the manual I ran the debugfs command (Step 3) which gave me the last_id of the OST that was consistent with the dmesg entry: [root@rmoss-scratch1 scratch1-OST0001]# debugfs.ldiskfs -c -R 'dump O/0/LAST_ID /tmp/LAST_ID' /dev/md2;od -Ax -td8 /tmp/LAST_ID I then ran Step 4: [root@rmoss-scratch1 scratch1-OST0001]# mount -t ldiskfs /dev/md2 /lustre/md2 [root@rmoss-scratch1 scratch1-OST0001]# ls -1s /lustre/md2/ [root@rmoss-scratch1 scratch1-OST0001]# ls -1s /lustre/md2/O/0/d*|grep -v [a-z]|sort -k2 -n > /tmp/obj.md2 [root@rmoss-scratch1 ~]# tail -30 /tmp/obj.md2 As pointed out in the manual, the value of the last_id matched the existing objects confirming that the problem was on the MDS, and that the problem could be resolved by removing the lov_objid file. However, Andreas' comment about removing the lov_objid file on the MDS being a little more risky, I opted to edit the file on the MDT. One thing that is very poorly documented in the manual was the purpose for the HEX to decimal translation and in my option should be placed in the section where the editing is actually discussed. The last_id is represented as a decimal number when that value is obtained via the "od -Ax -td8" command. However the when you convert the file from binary to ascii the contents are in HEX. (This is not explained at all.) I copy the lov_objid file from the MDT to /tmp as describe in the manual. Below is the output without redirection to an output file. [root@rmmds-scratch1 tmp]# xxd lov_objid Given that the last_id the MDS thought was correct for the deactivated OST (14540258) I first wanted to make sure I knew what I was looking for. So I took 14540248 and converted it to HEX. [root@rmmds-scratch1 tmp]# echo "obase=16; 14540258"|bc This is the line that represents the first and second OSTs in the system: But if you notice, there is a byte swap that is not mentioned in the manual so the DDDDE2 number that is generated by the bc command must be convert to E2DD DD00 0000 0000 to be properly inserted into the file. I then converted the correct last_id number to HEX: Then created the ascii file by running the command "xxd lov_objid lov_objid.asc" and edited as described in the manual taking extra care to do the byte swap to "0b39 de00 0000 0000", followed by recreating the HEX file with "xxd -r lov_objid.asc lov_objid.new". [root@rmmds-scratch1 tmp]# vi /tmp/lov_objid.asc I confirmed that the file was what I expected: [root@rmmds-scratch1 tmp]# od -Ax -td8 /tmp/lov_objid.new I then moved the new file into place, unmount both targets on both the MDS and OSS, then mounted only those 2 devices as type lustre and confirmed that the OST came online and was activated. Once that was done I remounted the rest of the OSTs and am presently running lfsck. |
| Comment by Andreas Dilger [ 25/Sep/12 ] |
|
Joe, your commentary here is right on the mark. You are correct that these steps are not adequately described for someone "not skilled in the art". I take these kind of things for granted, but not many people are as closely familiar with the code and on-disk formats as I am. It definitely makes sense to update the user manual in this case with a better description of the required steps, as you have documented here. Also, it would be good to have Lustre handle this case more transparently than it does today. This is partially addressed with the patch in https://bugzilla.lustre.org/show_bug.cgi?id=24128, but this needs to be refreshed for the latest Lustre code, and landed to a release branch. |
| Comment by Peter Jones [ 26/Sep/12 ] |
|
Reassigning to Cliff to see what improvements can be made to the manual |
| Comment by Andreas Dilger [ 24/Sep/15 ] |
|
The LAST_ID reconstruction described here is now handled automatically by the OSS and LFSCK. |