[LU-2018] Questions about using lfsck - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Labels:
None
Environment:
Sun hardware running mdadm

Rank (Obsolete):
4120

Description

We have an OST on one of our scratch file systems that was deactivated and attempts to reactivate it failed with:

[204919.753933] Lustre: scratch1-MDT0000: scratch1-OST0001_UUID now active, resetting orphans
[204919.753939] Lustre: Skipped 1 previous similar message
[204919.754155] LustreError: 10403:0:(osc_create.c:589:osc_create()) scratch1-OST0001-osc: oscc recovery failed: -22
[204919.754166] Lustre: scratch1-OST0001_UUID: Failed to clear orphan objects on OST: -22
[204919.754170] Lustre: scratch1-OST0001_UUID: Sync failed deactivating: rc -22

First: Is lfsck the proper tool to recover from this error?

Second: Since this is the first time that I have ever used lfsck I am not sure what to expect. In this particular case there are 32 OSTs on this file system. Following the examples in the manual, I started a read-only run against all OSTs Saturday night and the the following morning from what I could judge it had only gotten through 11 of the 32 OSTs (if it run sequentially.) It reported LOTS of zero length orphaned inodes. Unfortunately in this case I didn't pipe the output to a log file so information regarding the OST (2/32) that couldn't be reactivated was lost when I ran out out line buffer space. So I stopped the run and restarted it against only that OST. Because it has been running listing user files for many hours I am guessing the answer is no bu is this a valid execution option or is it required that all OSTs must be accounted for?

Third: The db files were generated when the file system was quiesced and after e2fsck was run against all targets cleanly. Must lfsck be run while in an offline mode or can it be run while the file system is serving clients? Because of my inexperience with using lfsck I don't know what to expect in terms of duration of the run and since I am using the -n option I will need to run again with corrective options. Also when running the corrective options is -l -c the preferred method?

Attachments

Issue Links

is related to

LU-14 live replacement of OST

Resolved

Activity

[LU-2018] Questions about using lfsck

Andreas Dilger added a comment - 24/Sep/15 9:46 AM

The LAST_ID reconstruction described here is now handled automatically by the OSS and LFSCK.

Andreas Dilger added a comment - 24/Sep/15 9:46 AM The LAST_ID reconstruction described here is now handled automatically by the OSS and LFSCK.

Peter Jones added a comment - 26/Sep/12 2:44 AM

Reassigning to Cliff to see what improvements can be made to the manual

Peter Jones added a comment - 26/Sep/12 2:44 AM Reassigning to Cliff to see what improvements can be made to the manual

Andreas Dilger added a comment - 25/Sep/12 6:17 PM

Joe, your commentary here is right on the mark. You are correct that these steps are not adequately described for someone "not skilled in the art". I take these kind of things for granted, but not many people are as closely familiar with the code and on-disk formats as I am.

It definitely makes sense to update the user manual in this case with a better description of the required steps, as you have documented here.

Also, it would be good to have Lustre handle this case more transparently than it does today. This is partially addressed with the patch in https://bugzilla.lustre.org/show_bug.cgi?id=24128, but this needs to be refreshed for the latest Lustre code, and landed to a release branch.

Andreas Dilger added a comment - 25/Sep/12 6:17 PM Joe, your commentary here is right on the mark. You are correct that these steps are not adequately described for someone "not skilled in the art". I take these kind of things for granted, but not many people are as closely familiar with the code and on-disk formats as I am. It definitely makes sense to update the user manual in this case with a better description of the required steps, as you have documented here. Also, it would be good to have Lustre handle this case more transparently than it does today. This is partially addressed with the patch in https://bugzilla.lustre.org/show_bug.cgi?id=24128 , but this needs to be refreshed for the latest Lustre code, and landed to a release branch.

Joe Mervini added a comment - 25/Sep/12 3:34 PM

So following Andreas' latest comment I was able to understand what I was seeing and successfully edited the lov_objid file on the MDS to get the OST active again. The problem has be fixed and the file system is operational again.

For the benefit of anyone reading this that may have had the same problems understanding the documentation in the manual regarding binary edits (Appendix B: How to fix bad LAST_ID on an OST) I am detailing it here with the specifics of my case as example. (Note that this is for editing the MDT and not the OST so some of the information provided in the manual doesn't necessarily apply.)

The ids that were identified as being out of sync for the OST were based on error information in dmesg on the deactivated OST:

[204924.149414] LustreError: 17598:0:(filter.c:3173:filter_handle_precreate()) scratch1-OST0001: ignoring bogus orphan destroy request: obdid 14540258 last_id 14563595

In the above the obdid 14540258 represents what the MDS had recorded as the last_id, and the last_id 14563595 is what the OST reported. As Andreas mentioned above the difference between those two numbers is greater than 23000 (only by 337 - but enough to cause the problem!)

The first step I took was to mount the MDT as type ldiskfs:

[root@rmoss-scratch1 /root]# mount -t ldiskfs /dev/md1 /lustre/md1

I then ran the od command in Step 1 to get the objids on the MDT:

[root@rmmds-scratch1 CONFIGS]# od -Ax -td8 /lustre/md1/lov_objid
000000 15685909 14540258
000010 15932110 14947247
000020 14515004 14128711
000030 15000526 15162675
000040 13640425 14099966
000050 14681958 14342756
000060 15165350 14397848
000070 14549423 14439112
000080 14908468 14520235
000090 14317909 15447697
0000a0 14506040 14566356
0000b0 14878948 14560476
0000c0 14593685 14742015
0000d0 14934824 13734107
0000e0 14365307 14647258
0000f0 14255774 14431566
000100

The thing that confused me initially was that although I could see the number 14540258 on the first line, I didn't understand the offset (and I guess I still have a question about it: Does each line represent an OSS?)

Then I mounted the affected OST ldiskfs:

[root@rmoss-scratch1 scratch1-OST0001]# mount -t ldiskfs /dev/md2 /lustre/md2

I then ran Step 2. I don't really understand the purpose of the command and it is not clear whether it is supposed to be run on the MDS or the OSS since the documentation instructs you to mount the OST in Step 4. In any event the output was meaningless to me. (It would be nice if someone could explain it with an example.)

Then following the instructions in the manual I ran the debugfs command (Step 3) which gave me the last_id of the OST that was consistent with the dmesg entry:

[root@rmoss-scratch1 scratch1-OST0001]# debugfs.ldiskfs -c -R 'dump O/0/LAST_ID /tmp/LAST_ID' /dev/md2;od -Ax -td8 /tmp/LAST_ID
debugfs.ldiskfs 1.41.10.sun2-4chaos (23-Jun-2010)
/dev/md2: catastrophic mode - not reading inode or group bitmaps
000000 14563595
000008

I then ran Step 4:

[root@rmoss-scratch1 scratch1-OST0001]# mount -t ldiskfs /dev/md2 /lustre/md2

[root@rmoss-scratch1 scratch1-OST0001]# ls -1s /lustre/md2/
CONFIGS/ health_check last_rcvd lost+found/ lquota.group lquota.user O/

[root@rmoss-scratch1 scratch1-OST0001]# ls -1s /lustre/md2/O/0/d*|grep -v [a-z]|sort -k2 -n > /tmp/obj.md2
(The above command takes some time to run.)

[root@rmoss-scratch1 ~]# tail -30 /tmp/obj.md2
0 14563566
0 14563567
0 14563568
0 14563569
0 14563570
0 14563571
0 14563572
0 14563573
0 14563574
0 14563575
0 14563576
0 14563577
0 14563578
0 14563579
0 14563580
0 14563581
0 14563582
0 14563583
0 14563584
0 14563585
0 14563586
0 14563587
0 14563588
0 14563589
0 14563590
0 14563591
0 14563592
0 14563593
0 14563594
0 14563595

As pointed out in the manual, the value of the last_id matched the existing objects confirming that the problem was on the MDS, and that the problem could be resolved by removing the lov_objid file. However, Andreas' comment about removing the lov_objid file on the MDS being a little more risky, I opted to edit the file on the MDT.

One thing that is very poorly documented in the manual was the purpose for the HEX to decimal translation and in my option should be placed in the section where the editing is actually discussed.

The last_id is represented as a decimal number when that value is obtained via the "od -Ax -td8" command. However the when you convert the file from binary to ascii the contents are in HEX. (This is not explained at all.)

I copy the lov_objid file from the MDT to /tmp as describe in the manual. Below is the output without redirection to an output file.

[root@rmmds-scratch1 tmp]# xxd lov_objid
0000000: 1559 ef00 0000 0000 e2dd dd00 0000 0000 .Y..............
0000010: ce1a f300 0000 0000 af13 e400 0000 0000 ................
0000020: 3c7b dd00 0000 0000 4796 d700 0000 0000 <{......G.......
0000030: cee3 e400 0000 0000 335d e700 0000 0000 ........3]......
0000040: e922 d000 0000 0000 fe25 d700 0000 0000 .".......%......
0000050: 6607 e000 0000 0000 64da da00 0000 0000 f.......d.......
0000060: a667 e700 0000 0000 98b1 db00 0000 0000 .g..............
0000070: af01 de00 0000 0000 c852 dc00 0000 0000 .........R......
0000080: 347c e300 0000 0000 ab8f dd00 0000 0000 4|..............
0000090: 5579 da00 0000 0000 91b6 eb00 0000 0000 Uy..............
00000a0: 3858 dd00 0000 0000 d443 de00 0000 0000 8X.......C......
00000b0: e408 e300 0000 0000 dc2c de00 0000 0000 .........,......
00000c0: 95ae de00 0000 0000 fff1 e000 0000 0000 ................
00000d0: 28e3 e300 0000 0000 db90 d100 0000 0000 (...............
00000e0: 7b32 db00 0000 0000 da7f df00 0000 0000 {2..............
00000f0: 9e86 d900 0000 0000 4e35 dc00 0000 0000 ........N5......

Given that the last_id the MDS thought was correct for the deactivated OST (14540258) I first wanted to make sure I knew what I was looking for. So I took 14540248 and converted it to HEX.

[root@rmmds-scratch1 tmp]# echo "obase=16; 14540258"|bc
DDDDE2

This is the line that represents the first and second OSTs in the system:
0000000: 1559 ef00 0000 0000 e2dd dd00 0000 0000 .Y..............

But if you notice, there is a byte swap that is not mentioned in the manual so the DDDDE2 number that is generated by the bc command must be convert to E2DD DD00 0000 0000 to be properly inserted into the file.

I then converted the correct last_id number to HEX:
[root@rmmds-scratch1 tmp]# echo "obase=16; 14563595"|bc
DE390B

Then created the ascii file by running the command "xxd lov_objid lov_objid.asc" and edited as described in the manual taking extra care to do the byte swap to "0b39 de00 0000 0000", followed by recreating the HEX file with "xxd -r lov_objid.asc lov_objid.new".

[root@rmmds-scratch1 tmp]# vi /tmp/lov_objid.asc
<snip>
0000000: 1559 ef00 0000 0000 0b39 de00 0000 0000 .Y..............
0000010: ce1a f300 0000 0000 af13 e400 0000 0000 ................
</snip>

I confirmed that the file was what I expected:

[root@rmmds-scratch1 tmp]# od -Ax -td8 /tmp/lov_objid.new
000000 15685909 14563595
000010 15932110 14947247
000020 14515004 14128711
000030 15000526 15162675
000040 13640425 14099966
000050 14681958 14342756
000060 15165350 14397848
000070 14549423 14439112
000080 14908468 14520235
000090 14317909 15447697
0000a0 14506040 14566356
0000b0 14878948 14560476
0000c0 14593685 14742015
0000d0 14934824 13734107
0000e0 14365307 14647258
0000f0 14255774 14431566
000100

I then moved the new file into place, unmount both targets on both the MDS and OSS, then mounted only those 2 devices as type lustre and confirmed that the OST came online and was activated. Once that was done I remounted the rest of the OSTs and am presently running lfsck.

Joe Mervini added a comment - 25/Sep/12 3:34 PM So following Andreas' latest comment I was able to understand what I was seeing and successfully edited the lov_objid file on the MDS to get the OST active again. The problem has be fixed and the file system is operational again. For the benefit of anyone reading this that may have had the same problems understanding the documentation in the manual regarding binary edits (Appendix B: How to fix bad LAST_ID on an OST) I am detailing it here with the specifics of my case as example. (Note that this is for editing the MDT and not the OST so some of the information provided in the manual doesn't necessarily apply.) The ids that were identified as being out of sync for the OST were based on error information in dmesg on the deactivated OST: [204924.149414] LustreError: 17598:0:(filter.c:3173:filter_handle_precreate()) scratch1-OST0001: ignoring bogus orphan destroy request: obdid 14540258 last_id 14563595 In the above the obdid 14540258 represents what the MDS had recorded as the last_id, and the last_id 14563595 is what the OST reported. As Andreas mentioned above the difference between those two numbers is greater than 23000 (only by 337 - but enough to cause the problem!) The first step I took was to mount the MDT as type ldiskfs: [root@rmoss-scratch1 /root] # mount -t ldiskfs /dev/md1 /lustre/md1 I then ran the od command in Step 1 to get the objids on the MDT: [root@rmmds-scratch1 CONFIGS] # od -Ax -td8 /lustre/md1/lov_objid 000000 15685909 14540258 000010 15932110 14947247 000020 14515004 14128711 000030 15000526 15162675 000040 13640425 14099966 000050 14681958 14342756 000060 15165350 14397848 000070 14549423 14439112 000080 14908468 14520235 000090 14317909 15447697 0000a0 14506040 14566356 0000b0 14878948 14560476 0000c0 14593685 14742015 0000d0 14934824 13734107 0000e0 14365307 14647258 0000f0 14255774 14431566 000100 The thing that confused me initially was that although I could see the number 14540258 on the first line, I didn't understand the offset (and I guess I still have a question about it: Does each line represent an OSS?) Then I mounted the affected OST ldiskfs: [root@rmoss-scratch1 scratch1-OST0001] # mount -t ldiskfs /dev/md2 /lustre/md2 I then ran Step 2. I don't really understand the purpose of the command and it is not clear whether it is supposed to be run on the MDS or the OSS since the documentation instructs you to mount the OST in Step 4. In any event the output was meaningless to me. (It would be nice if someone could explain it with an example.) Then following the instructions in the manual I ran the debugfs command (Step 3) which gave me the last_id of the OST that was consistent with the dmesg entry: [root@rmoss-scratch1 scratch1-OST0001] # debugfs.ldiskfs -c -R 'dump O/0/LAST_ID /tmp/LAST_ID' /dev/md2;od -Ax -td8 /tmp/LAST_ID debugfs.ldiskfs 1.41.10.sun2-4chaos (23-Jun-2010) /dev/md2: catastrophic mode - not reading inode or group bitmaps 000000 14563595 000008 I then ran Step 4: [root@rmoss-scratch1 scratch1-OST0001] # mount -t ldiskfs /dev/md2 /lustre/md2 [root@rmoss-scratch1 scratch1-OST0001] # ls -1s /lustre/md2/ CONFIGS/ health_check last_rcvd lost+found/ lquota.group lquota.user O/ [root@rmoss-scratch1 scratch1-OST0001] # ls -1s /lustre/md2/O/0/d*|grep -v [a-z] |sort -k2 -n > /tmp/obj.md2 (The above command takes some time to run.) [root@rmoss-scratch1 ~] # tail -30 /tmp/obj.md2 0 14563566 0 14563567 0 14563568 0 14563569 0 14563570 0 14563571 0 14563572 0 14563573 0 14563574 0 14563575 0 14563576 0 14563577 0 14563578 0 14563579 0 14563580 0 14563581 0 14563582 0 14563583 0 14563584 0 14563585 0 14563586 0 14563587 0 14563588 0 14563589 0 14563590 0 14563591 0 14563592 0 14563593 0 14563594 0 14563595 As pointed out in the manual, the value of the last_id matched the existing objects confirming that the problem was on the MDS, and that the problem could be resolved by removing the lov_objid file. However, Andreas' comment about removing the lov_objid file on the MDS being a little more risky, I opted to edit the file on the MDT. One thing that is very poorly documented in the manual was the purpose for the HEX to decimal translation and in my option should be placed in the section where the editing is actually discussed. The last_id is represented as a decimal number when that value is obtained via the "od -Ax -td8" command. However the when you convert the file from binary to ascii the contents are in HEX. (This is not explained at all.) I copy the lov_objid file from the MDT to /tmp as describe in the manual. Below is the output without redirection to an output file. [root@rmmds-scratch1 tmp] # xxd lov_objid 0000000: 1559 ef00 0000 0000 e2dd dd00 0000 0000 .Y.............. 0000010: ce1a f300 0000 0000 af13 e400 0000 0000 ................ 0000020: 3c7b dd00 0000 0000 4796 d700 0000 0000 <{......G....... 0000030: cee3 e400 0000 0000 335d e700 0000 0000 ........3]...... 0000040: e922 d000 0000 0000 fe25 d700 0000 0000 .".......%...... 0000050: 6607 e000 0000 0000 64da da00 0000 0000 f.......d....... 0000060: a667 e700 0000 0000 98b1 db00 0000 0000 .g.............. 0000070: af01 de00 0000 0000 c852 dc00 0000 0000 .........R...... 0000080: 347c e300 0000 0000 ab8f dd00 0000 0000 4|.............. 0000090: 5579 da00 0000 0000 91b6 eb00 0000 0000 Uy.............. 00000a0: 3858 dd00 0000 0000 d443 de00 0000 0000 8X.......C...... 00000b0: e408 e300 0000 0000 dc2c de00 0000 0000 .........,...... 00000c0: 95ae de00 0000 0000 fff1 e000 0000 0000 ................ 00000d0: 28e3 e300 0000 0000 db90 d100 0000 0000 (............... 00000e0: 7b32 db00 0000 0000 da7f df00 0000 0000 {2.............. 00000f0: 9e86 d900 0000 0000 4e35 dc00 0000 0000 ........N5...... Given that the last_id the MDS thought was correct for the deactivated OST (14540258) I first wanted to make sure I knew what I was looking for. So I took 14540248 and converted it to HEX. [root@rmmds-scratch1 tmp] # echo "obase=16; 14540258"|bc DDDDE2 This is the line that represents the first and second OSTs in the system: 0000000: 1559 ef00 0000 0000 e2dd dd00 0000 0000 .Y.............. But if you notice, there is a byte swap that is not mentioned in the manual so the DDDDE2 number that is generated by the bc command must be convert to E2DD DD00 0000 0000 to be properly inserted into the file. I then converted the correct last_id number to HEX: [root@rmmds-scratch1 tmp] # echo "obase=16; 14563595"|bc DE390B Then created the ascii file by running the command "xxd lov_objid lov_objid.asc" and edited as described in the manual taking extra care to do the byte swap to "0b39 de00 0000 0000", followed by recreating the HEX file with "xxd -r lov_objid.asc lov_objid.new". [root@rmmds-scratch1 tmp] # vi /tmp/lov_objid.asc <snip> 0000000: 1559 ef00 0000 0000 0b39 de00 0000 0000 .Y.............. 0000010: ce1a f300 0000 0000 af13 e400 0000 0000 ................ </snip> I confirmed that the file was what I expected: [root@rmmds-scratch1 tmp] # od -Ax -td8 /tmp/lov_objid.new 000000 15685909 14563595 000010 15932110 14947247 000020 14515004 14128711 000030 15000526 15162675 000040 13640425 14099966 000050 14681958 14342756 000060 15165350 14397848 000070 14549423 14439112 000080 14908468 14520235 000090 14317909 15447697 0000a0 14506040 14566356 0000b0 14878948 14560476 0000c0 14593685 14742015 0000d0 14934824 13734107 0000e0 14365307 14647258 0000f0 14255774 14431566 000100 I then moved the new file into place, unmount both targets on both the MDS and OSS, then mounted only those 2 devices as type lustre and confirmed that the OST came online and was activated. Once that was done I remounted the rest of the OSTs and am presently running lfsck.

Andreas Dilger added a comment - 25/Sep/12 2:19 AM

Joe,
Apologies for not replying sooner - I'm currently travelling.

OST0001 is the second number displayed (14540258) at offset 08. Editing this value should not affect the ability to run lfsck later on.

As for running lfsck afterward, this could also be done on the running system, so I'm reluctant to have you keep the system down longer than needed. I didn't know that the system was down thus while time - the inability of the MDS to complete recovery with the OST should not have affected the ability of clients to use the rest of the filesystem.

Andreas Dilger added a comment - 25/Sep/12 2:19 AM Joe, Apologies for not replying sooner - I'm currently travelling. OST0001 is the second number displayed (14540258) at offset 08. Editing this value should not affect the ability to run lfsck later on. As for running lfsck afterward, this could also be done on the running system, so I'm reluctant to have you keep the system down longer than needed. I didn't know that the system was down thus while time - the inability of the MDS to complete recovery with the OST should not have affected the ability of clients to use the rest of the filesystem.

Peter Jones added a comment - 24/Sep/12 11:14 PM

Niu

Could you please advise on this one?

Thanks

Peter

Peter Jones added a comment - 24/Sep/12 11:14 PM Niu Could you please advise on this one? Thanks Peter

Joe Mervini added a comment - 24/Sep/12 9:17 PM

I have taken down the file system and mounted both the MDT and affected OST ldiskfs. Running the od on the MDT this is the output:

Granted I don't really understand what I'm looking at but 000010 (if that is scratch1-OST0001) doesn't line up with any of the numbers above.

With that in mind, I really don't know how to proceed. Guidance would definitely be appreciated.

Joe Mervini added a comment - 24/Sep/12 9:17 PM I have taken down the file system and mounted both the MDT and affected OST ldiskfs. Running the od on the MDT this is the output: [root@rmmds-scratch1 CONFIGS] # od -Ax -td8 /lustre/md1/lov_objid 000000 15685909 14540258 000010 15932110 14947247 000020 14515004 14128711 000030 15000526 15162675 000040 13640425 14099966 000050 14681958 14342756 000060 15165350 14397848 000070 14549423 14439112 000080 14908468 14520235 000090 14317909 15447697 0000a0 14506040 14566356 0000b0 14878948 14560476 0000c0 14593685 14742015 0000d0 14934824 13734107 0000e0 14365307 14647258 0000f0 14255774 14431566 000100 Granted I don't really understand what I'm looking at but 000010 (if that is scratch1-OST0001) doesn't line up with any of the numbers above. With that in mind, I really don't know how to proceed. Guidance would definitely be appreciated.

Joe Mervini added a comment - 24/Sep/12 8:52 PM

Also WRT binary edit, the manual discuss modifying the data on an OST. Do I just apply the same principles to the MDT?

Joe Mervini added a comment - 24/Sep/12 8:52 PM Also WRT binary edit, the manual discuss modifying the data on an OST. Do I just apply the same principles to the MDT?

Joe Mervini added a comment - 24/Sep/12 8:40 PM

I think that I would opt for the less risky approach. If I do the binary edit, and later run lfsck would that clear the orphaned objects? I am not particularly concerned it they wind up in lost+found if I use the -l option just as long as the file system is in a better state than it currently is or was prior to this recent deactivation problem.

But on the other hand, what is the risk with deleting the lov_objids? I'm a little bit nervous about doing the binary edit and making a mistake. Just to be clear, if I do edit lov_objids file I am only concern with scratch1-OST0001 right? And in doing the edit I want to modify the value to be 14563595, correct?

Also, as I mentioned in my previous comment, is there a benefit to running lfsck on the file system once I am able to activate the OST? Currently the file system is offline with only the backup MDS acting as a client for running lfsck. If there is something to be gained by running lfsck we are will to keep the file system unavailable until it completes if necessary.

If you can respond this evening it would be greatly appreciated so I can get these processes rolling over night.

Thanks.

Joe Mervini added a comment - 24/Sep/12 8:40 PM I think that I would opt for the less risky approach. If I do the binary edit, and later run lfsck would that clear the orphaned objects? I am not particularly concerned it they wind up in lost+found if I use the -l option just as long as the file system is in a better state than it currently is or was prior to this recent deactivation problem. But on the other hand, what is the risk with deleting the lov_objids? I'm a little bit nervous about doing the binary edit and making a mistake. Just to be clear, if I do edit lov_objids file I am only concern with scratch1-OST0001 right? And in doing the edit I want to modify the value to be 14563595, correct? Also, as I mentioned in my previous comment, is there a benefit to running lfsck on the file system once I am able to activate the OST? Currently the file system is offline with only the backup MDS acting as a client for running lfsck. If there is something to be gained by running lfsck we are will to keep the file system unavailable until it completes if necessary. If you can respond this evening it would be greatly appreciated so I can get these processes rolling over night. Thanks.

Andreas Dilger added a comment - 24/Sep/12 7:27 PM

Joe, you are correct - the MDT and the OST have very different expectations on what object should be allocated next on this OST. The maximum object preallocation window is 10000, so these two numbers should not exceed this amount. If they do, then the OST refuses to destroy the large number of "orphan" objects to avoid potentially erasing the whole OST due to a software bug or corruption.

In this case, the difference is over 23000, though it doesn't appear that the two values are corrupted. Is it possible that the MDS was restored from backup at some point since this OST was taken offline? Alternately (this shouldn't happen) was the OST accidentally attached to some other Lustre filesystem?

The fix for this is to either binary edit the MDT "lov_objids" file to have the same value as the OST (which will potentially leave some orphan objects on the OST, or (slightly more risky, but less effort) to delete the lov_objids file entirely and have the MDT recreate it from the values on each of the OSTs.

Andreas Dilger added a comment - 24/Sep/12 7:27 PM Joe, you are correct - the MDT and the OST have very different expectations on what object should be allocated next on this OST. The maximum object preallocation window is 10000, so these two numbers should not exceed this amount. If they do, then the OST refuses to destroy the large number of "orphan" objects to avoid potentially erasing the whole OST due to a software bug or corruption. In this case, the difference is over 23000, though it doesn't appear that the two values are corrupted. Is it possible that the MDS was restored from backup at some point since this OST was taken offline? Alternately (this shouldn't happen) was the OST accidentally attached to some other Lustre filesystem? The fix for this is to either binary edit the MDT "lov_objids" file to have the same value as the OST (which will potentially leave some orphan objects on the OST, or (slightly more risky, but less effort) to delete the lov_objids file entirely and have the MDT recreate it from the values on each of the OSTs.

People

Assignee:: Cliff White (Inactive)

Reporter:: Joe Mervini

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 24/Sep/12 4:53 AM

Updated:: 24/Sep/15 9:46 AM

Resolved:: 24/Sep/15 9:46 AM