[LU-2018] Questions about using lfsck - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Labels:
None
Environment:
Sun hardware running mdadm

Rank (Obsolete):
4120

Description

We have an OST on one of our scratch file systems that was deactivated and attempts to reactivate it failed with:

[204919.753933] Lustre: scratch1-MDT0000: scratch1-OST0001_UUID now active, resetting orphans
[204919.753939] Lustre: Skipped 1 previous similar message
[204919.754155] LustreError: 10403:0:(osc_create.c:589:osc_create()) scratch1-OST0001-osc: oscc recovery failed: -22
[204919.754166] Lustre: scratch1-OST0001_UUID: Failed to clear orphan objects on OST: -22
[204919.754170] Lustre: scratch1-OST0001_UUID: Sync failed deactivating: rc -22

First: Is lfsck the proper tool to recover from this error?

Second: Since this is the first time that I have ever used lfsck I am not sure what to expect. In this particular case there are 32 OSTs on this file system. Following the examples in the manual, I started a read-only run against all OSTs Saturday night and the the following morning from what I could judge it had only gotten through 11 of the 32 OSTs (if it run sequentially.) It reported LOTS of zero length orphaned inodes. Unfortunately in this case I didn't pipe the output to a log file so information regarding the OST (2/32) that couldn't be reactivated was lost when I ran out out line buffer space. So I stopped the run and restarted it against only that OST. Because it has been running listing user files for many hours I am guessing the answer is no bu is this a valid execution option or is it required that all OSTs must be accounted for?

Third: The db files were generated when the file system was quiesced and after e2fsck was run against all targets cleanly. Must lfsck be run while in an offline mode or can it be run while the file system is serving clients? Because of my inexperience with using lfsck I don't know what to expect in terms of duration of the run and since I am using the -n option I will need to run again with corrective options. Also when running the corrective options is -l -c the preferred method?

Attachments

Issue Links

is related to

LU-14 live replacement of OST

Resolved

Activity

[LU-2018] Questions about using lfsck

Andreas Dilger made changes - 24/Sep/15 9:46 AM

Resolution		New: Fixed [ 1 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Andreas Dilger added a comment - 24/Sep/15 9:46 AM

The LAST_ID reconstruction described here is now handled automatically by the OSS and LFSCK.

Andreas Dilger added a comment - 24/Sep/15 9:46 AM The LAST_ID reconstruction described here is now handled automatically by the OSS and LFSCK.

Andreas Dilger made changes - 24/Sep/15 9:46 AM

Link

New: This issue is related to ~~LU-14~~ [ ~~LU-14~~ ]

Robert Read made changes - 27/Sep/12 5:14 PM

Severity

Original: 3 [ 10022 ]

Robert Read made changes - 27/Sep/12 5:14 PM

Severity

New: 3 [ 10022 ]

Peter Jones made changes - 27/Sep/12 3:20 AM

Issue Type

Original: Story [ 6 ]

New: Bug [ 1 ]

Peter Jones made changes - 26/Sep/12 2:44 AM

Assignee

Original: Niu Yawei [ niu ]

New: Cliff White [ cliffw ]

Peter Jones added a comment - 26/Sep/12 2:44 AM

Reassigning to Cliff to see what improvements can be made to the manual

Peter Jones added a comment - 26/Sep/12 2:44 AM Reassigning to Cliff to see what improvements can be made to the manual

Andreas Dilger added a comment - 25/Sep/12 6:17 PM

Joe, your commentary here is right on the mark. You are correct that these steps are not adequately described for someone "not skilled in the art". I take these kind of things for granted, but not many people are as closely familiar with the code and on-disk formats as I am.

It definitely makes sense to update the user manual in this case with a better description of the required steps, as you have documented here.

Also, it would be good to have Lustre handle this case more transparently than it does today. This is partially addressed with the patch in https://bugzilla.lustre.org/show_bug.cgi?id=24128, but this needs to be refreshed for the latest Lustre code, and landed to a release branch.

Andreas Dilger added a comment - 25/Sep/12 6:17 PM Joe, your commentary here is right on the mark. You are correct that these steps are not adequately described for someone "not skilled in the art". I take these kind of things for granted, but not many people are as closely familiar with the code and on-disk formats as I am. It definitely makes sense to update the user manual in this case with a better description of the required steps, as you have documented here. Also, it would be good to have Lustre handle this case more transparently than it does today. This is partially addressed with the patch in https://bugzilla.lustre.org/show_bug.cgi?id=24128 , but this needs to be refreshed for the latest Lustre code, and landed to a release branch.

Joe Mervini added a comment - 25/Sep/12 3:34 PM

So following Andreas' latest comment I was able to understand what I was seeing and successfully edited the lov_objid file on the MDS to get the OST active again. The problem has be fixed and the file system is operational again.

For the benefit of anyone reading this that may have had the same problems understanding the documentation in the manual regarding binary edits (Appendix B: How to fix bad LAST_ID on an OST) I am detailing it here with the specifics of my case as example. (Note that this is for editing the MDT and not the OST so some of the information provided in the manual doesn't necessarily apply.)

The ids that were identified as being out of sync for the OST were based on error information in dmesg on the deactivated OST:

[204924.149414] LustreError: 17598:0:(filter.c:3173:filter_handle_precreate()) scratch1-OST0001: ignoring bogus orphan destroy request: obdid 14540258 last_id 14563595

In the above the obdid 14540258 represents what the MDS had recorded as the last_id, and the last_id 14563595 is what the OST reported. As Andreas mentioned above the difference between those two numbers is greater than 23000 (only by 337 - but enough to cause the problem!)

The first step I took was to mount the MDT as type ldiskfs:

[root@rmoss-scratch1 /root]# mount -t ldiskfs /dev/md1 /lustre/md1

I then ran the od command in Step 1 to get the objids on the MDT:

[root@rmmds-scratch1 CONFIGS]# od -Ax -td8 /lustre/md1/lov_objid
000000 15685909 14540258
000010 15932110 14947247
000020 14515004 14128711
000030 15000526 15162675
000040 13640425 14099966
000050 14681958 14342756
000060 15165350 14397848
000070 14549423 14439112
000080 14908468 14520235
000090 14317909 15447697
0000a0 14506040 14566356
0000b0 14878948 14560476
0000c0 14593685 14742015
0000d0 14934824 13734107
0000e0 14365307 14647258
0000f0 14255774 14431566
000100

The thing that confused me initially was that although I could see the number 14540258 on the first line, I didn't understand the offset (and I guess I still have a question about it: Does each line represent an OSS?)

Then I mounted the affected OST ldiskfs:

[root@rmoss-scratch1 scratch1-OST0001]# mount -t ldiskfs /dev/md2 /lustre/md2

I then ran Step 2. I don't really understand the purpose of the command and it is not clear whether it is supposed to be run on the MDS or the OSS since the documentation instructs you to mount the OST in Step 4. In any event the output was meaningless to me. (It would be nice if someone could explain it with an example.)

Then following the instructions in the manual I ran the debugfs command (Step 3) which gave me the last_id of the OST that was consistent with the dmesg entry:

[root@rmoss-scratch1 scratch1-OST0001]# debugfs.ldiskfs -c -R 'dump O/0/LAST_ID /tmp/LAST_ID' /dev/md2;od -Ax -td8 /tmp/LAST_ID
debugfs.ldiskfs 1.41.10.sun2-4chaos (23-Jun-2010)
/dev/md2: catastrophic mode - not reading inode or group bitmaps
000000 14563595
000008

I then ran Step 4:

[root@rmoss-scratch1 scratch1-OST0001]# mount -t ldiskfs /dev/md2 /lustre/md2

[root@rmoss-scratch1 scratch1-OST0001]# ls -1s /lustre/md2/
CONFIGS/ health_check last_rcvd lost+found/ lquota.group lquota.user O/

[root@rmoss-scratch1 scratch1-OST0001]# ls -1s /lustre/md2/O/0/d*|grep -v [a-z]|sort -k2 -n > /tmp/obj.md2
(The above command takes some time to run.)

[root@rmoss-scratch1 ~]# tail -30 /tmp/obj.md2
0 14563566
0 14563567
0 14563568
0 14563569
0 14563570
0 14563571
0 14563572
0 14563573
0 14563574
0 14563575
0 14563576
0 14563577
0 14563578
0 14563579
0 14563580
0 14563581
0 14563582
0 14563583
0 14563584
0 14563585
0 14563586
0 14563587
0 14563588
0 14563589
0 14563590
0 14563591
0 14563592
0 14563593
0 14563594
0 14563595

As pointed out in the manual, the value of the last_id matched the existing objects confirming that the problem was on the MDS, and that the problem could be resolved by removing the lov_objid file. However, Andreas' comment about removing the lov_objid file on the MDS being a little more risky, I opted to edit the file on the MDT.

One thing that is very poorly documented in the manual was the purpose for the HEX to decimal translation and in my option should be placed in the section where the editing is actually discussed.

The last_id is represented as a decimal number when that value is obtained via the "od -Ax -td8" command. However the when you convert the file from binary to ascii the contents are in HEX. (This is not explained at all.)

I copy the lov_objid file from the MDT to /tmp as describe in the manual. Below is the output without redirection to an output file.

[root@rmmds-scratch1 tmp]# xxd lov_objid
0000000: 1559 ef00 0000 0000 e2dd dd00 0000 0000 .Y..............
0000010: ce1a f300 0000 0000 af13 e400 0000 0000 ................
0000020: 3c7b dd00 0000 0000 4796 d700 0000 0000 <{......G.......
0000030: cee3 e400 0000 0000 335d e700 0000 0000 ........3]......
0000040: e922 d000 0000 0000 fe25 d700 0000 0000 .".......%......
0000050: 6607 e000 0000 0000 64da da00 0000 0000 f.......d.......
0000060: a667 e700 0000 0000 98b1 db00 0000 0000 .g..............
0000070: af01 de00 0000 0000 c852 dc00 0000 0000 .........R......
0000080: 347c e300 0000 0000 ab8f dd00 0000 0000 4|..............
0000090: 5579 da00 0000 0000 91b6 eb00 0000 0000 Uy..............
00000a0: 3858 dd00 0000 0000 d443 de00 0000 0000 8X.......C......
00000b0: e408 e300 0000 0000 dc2c de00 0000 0000 .........,......
00000c0: 95ae de00 0000 0000 fff1 e000 0000 0000 ................
00000d0: 28e3 e300 0000 0000 db90 d100 0000 0000 (...............
00000e0: 7b32 db00 0000 0000 da7f df00 0000 0000 {2..............
00000f0: 9e86 d900 0000 0000 4e35 dc00 0000 0000 ........N5......

Given that the last_id the MDS thought was correct for the deactivated OST (14540258) I first wanted to make sure I knew what I was looking for. So I took 14540248 and converted it to HEX.

[root@rmmds-scratch1 tmp]# echo "obase=16; 14540258"|bc
DDDDE2

This is the line that represents the first and second OSTs in the system:
0000000: 1559 ef00 0000 0000 e2dd dd00 0000 0000 .Y..............

But if you notice, there is a byte swap that is not mentioned in the manual so the DDDDE2 number that is generated by the bc command must be convert to E2DD DD00 0000 0000 to be properly inserted into the file.

I then converted the correct last_id number to HEX:
[root@rmmds-scratch1 tmp]# echo "obase=16; 14563595"|bc
DE390B

Then created the ascii file by running the command "xxd lov_objid lov_objid.asc" and edited as described in the manual taking extra care to do the byte swap to "0b39 de00 0000 0000", followed by recreating the HEX file with "xxd -r lov_objid.asc lov_objid.new".

[root@rmmds-scratch1 tmp]# vi /tmp/lov_objid.asc
<snip>
0000000: 1559 ef00 0000 0000 0b39 de00 0000 0000 .Y..............
0000010: ce1a f300 0000 0000 af13 e400 0000 0000 ................
</snip>

I confirmed that the file was what I expected:

[root@rmmds-scratch1 tmp]# od -Ax -td8 /tmp/lov_objid.new
000000 15685909 14563595
000010 15932110 14947247
000020 14515004 14128711
000030 15000526 15162675
000040 13640425 14099966
000050 14681958 14342756
000060 15165350 14397848
000070 14549423 14439112
000080 14908468 14520235
000090 14317909 15447697
0000a0 14506040 14566356
0000b0 14878948 14560476
0000c0 14593685 14742015
0000d0 14934824 13734107
0000e0 14365307 14647258
0000f0 14255774 14431566
000100

I then moved the new file into place, unmount both targets on both the MDS and OSS, then mounted only those 2 devices as type lustre and confirmed that the OST came online and was activated. Once that was done I remounted the rest of the OSTs and am presently running lfsck.

Joe Mervini added a comment - 25/Sep/12 3:34 PM So following Andreas' latest comment I was able to understand what I was seeing and successfully edited the lov_objid file on the MDS to get the OST active again. The problem has be fixed and the file system is operational again. For the benefit of anyone reading this that may have had the same problems understanding the documentation in the manual regarding binary edits (Appendix B: How to fix bad LAST_ID on an OST) I am detailing it here with the specifics of my case as example. (Note that this is for editing the MDT and not the OST so some of the information provided in the manual doesn't necessarily apply.) The ids that were identified as being out of sync for the OST were based on error information in dmesg on the deactivated OST: [204924.149414] LustreError: 17598:0:(filter.c:3173:filter_handle_precreate()) scratch1-OST0001: ignoring bogus orphan destroy request: obdid 14540258 last_id 14563595 In the above the obdid 14540258 represents what the MDS had recorded as the last_id, and the last_id 14563595 is what the OST reported. As Andreas mentioned above the difference between those two numbers is greater than 23000 (only by 337 - but enough to cause the problem!) The first step I took was to mount the MDT as type ldiskfs: [root@rmoss-scratch1 /root] # mount -t ldiskfs /dev/md1 /lustre/md1 I then ran the od command in Step 1 to get the objids on the MDT: [root@rmmds-scratch1 CONFIGS] # od -Ax -td8 /lustre/md1/lov_objid 000000 15685909 14540258 000010 15932110 14947247 000020 14515004 14128711 000030 15000526 15162675 000040 13640425 14099966 000050 14681958 14342756 000060 15165350 14397848 000070 14549423 14439112 000080 14908468 14520235 000090 14317909 15447697 0000a0 14506040 14566356 0000b0 14878948 14560476 0000c0 14593685 14742015 0000d0 14934824 13734107 0000e0 14365307 14647258 0000f0 14255774 14431566 000100 The thing that confused me initially was that although I could see the number 14540258 on the first line, I didn't understand the offset (and I guess I still have a question about it: Does each line represent an OSS?) Then I mounted the affected OST ldiskfs: [root@rmoss-scratch1 scratch1-OST0001] # mount -t ldiskfs /dev/md2 /lustre/md2 I then ran Step 2. I don't really understand the purpose of the command and it is not clear whether it is supposed to be run on the MDS or the OSS since the documentation instructs you to mount the OST in Step 4. In any event the output was meaningless to me. (It would be nice if someone could explain it with an example.) Then following the instructions in the manual I ran the debugfs command (Step 3) which gave me the last_id of the OST that was consistent with the dmesg entry: [root@rmoss-scratch1 scratch1-OST0001] # debugfs.ldiskfs -c -R 'dump O/0/LAST_ID /tmp/LAST_ID' /dev/md2;od -Ax -td8 /tmp/LAST_ID debugfs.ldiskfs 1.41.10.sun2-4chaos (23-Jun-2010) /dev/md2: catastrophic mode - not reading inode or group bitmaps 000000 14563595 000008 I then ran Step 4: [root@rmoss-scratch1 scratch1-OST0001] # mount -t ldiskfs /dev/md2 /lustre/md2 [root@rmoss-scratch1 scratch1-OST0001] # ls -1s /lustre/md2/ CONFIGS/ health_check last_rcvd lost+found/ lquota.group lquota.user O/ [root@rmoss-scratch1 scratch1-OST0001] # ls -1s /lustre/md2/O/0/d*|grep -v [a-z] |sort -k2 -n > /tmp/obj.md2 (The above command takes some time to run.) [root@rmoss-scratch1 ~] # tail -30 /tmp/obj.md2 0 14563566 0 14563567 0 14563568 0 14563569 0 14563570 0 14563571 0 14563572 0 14563573 0 14563574 0 14563575 0 14563576 0 14563577 0 14563578 0 14563579 0 14563580 0 14563581 0 14563582 0 14563583 0 14563584 0 14563585 0 14563586 0 14563587 0 14563588 0 14563589 0 14563590 0 14563591 0 14563592 0 14563593 0 14563594 0 14563595 As pointed out in the manual, the value of the last_id matched the existing objects confirming that the problem was on the MDS, and that the problem could be resolved by removing the lov_objid file. However, Andreas' comment about removing the lov_objid file on the MDS being a little more risky, I opted to edit the file on the MDT. One thing that is very poorly documented in the manual was the purpose for the HEX to decimal translation and in my option should be placed in the section where the editing is actually discussed. The last_id is represented as a decimal number when that value is obtained via the "od -Ax -td8" command. However the when you convert the file from binary to ascii the contents are in HEX. (This is not explained at all.) I copy the lov_objid file from the MDT to /tmp as describe in the manual. Below is the output without redirection to an output file. [root@rmmds-scratch1 tmp] # xxd lov_objid 0000000: 1559 ef00 0000 0000 e2dd dd00 0000 0000 .Y.............. 0000010: ce1a f300 0000 0000 af13 e400 0000 0000 ................ 0000020: 3c7b dd00 0000 0000 4796 d700 0000 0000 <{......G....... 0000030: cee3 e400 0000 0000 335d e700 0000 0000 ........3]...... 0000040: e922 d000 0000 0000 fe25 d700 0000 0000 .".......%...... 0000050: 6607 e000 0000 0000 64da da00 0000 0000 f.......d....... 0000060: a667 e700 0000 0000 98b1 db00 0000 0000 .g.............. 0000070: af01 de00 0000 0000 c852 dc00 0000 0000 .........R...... 0000080: 347c e300 0000 0000 ab8f dd00 0000 0000 4|.............. 0000090: 5579 da00 0000 0000 91b6 eb00 0000 0000 Uy.............. 00000a0: 3858 dd00 0000 0000 d443 de00 0000 0000 8X.......C...... 00000b0: e408 e300 0000 0000 dc2c de00 0000 0000 .........,...... 00000c0: 95ae de00 0000 0000 fff1 e000 0000 0000 ................ 00000d0: 28e3 e300 0000 0000 db90 d100 0000 0000 (............... 00000e0: 7b32 db00 0000 0000 da7f df00 0000 0000 {2.............. 00000f0: 9e86 d900 0000 0000 4e35 dc00 0000 0000 ........N5...... Given that the last_id the MDS thought was correct for the deactivated OST (14540258) I first wanted to make sure I knew what I was looking for. So I took 14540248 and converted it to HEX. [root@rmmds-scratch1 tmp] # echo "obase=16; 14540258"|bc DDDDE2 This is the line that represents the first and second OSTs in the system: 0000000: 1559 ef00 0000 0000 e2dd dd00 0000 0000 .Y.............. But if you notice, there is a byte swap that is not mentioned in the manual so the DDDDE2 number that is generated by the bc command must be convert to E2DD DD00 0000 0000 to be properly inserted into the file. I then converted the correct last_id number to HEX: [root@rmmds-scratch1 tmp] # echo "obase=16; 14563595"|bc DE390B Then created the ascii file by running the command "xxd lov_objid lov_objid.asc" and edited as described in the manual taking extra care to do the byte swap to "0b39 de00 0000 0000", followed by recreating the HEX file with "xxd -r lov_objid.asc lov_objid.new". [root@rmmds-scratch1 tmp] # vi /tmp/lov_objid.asc <snip> 0000000: 1559 ef00 0000 0000 0b39 de00 0000 0000 .Y.............. 0000010: ce1a f300 0000 0000 af13 e400 0000 0000 ................ </snip> I confirmed that the file was what I expected: [root@rmmds-scratch1 tmp] # od -Ax -td8 /tmp/lov_objid.new 000000 15685909 14563595 000010 15932110 14947247 000020 14515004 14128711 000030 15000526 15162675 000040 13640425 14099966 000050 14681958 14342756 000060 15165350 14397848 000070 14549423 14439112 000080 14908468 14520235 000090 14317909 15447697 0000a0 14506040 14566356 0000b0 14878948 14560476 0000c0 14593685 14742015 0000d0 14934824 13734107 0000e0 14365307 14647258 0000f0 14255774 14431566 000100 I then moved the new file into place, unmount both targets on both the MDS and OSS, then mounted only those 2 devices as type lustre and confirmed that the OST came online and was activated. Once that was done I remounted the rest of the OSTs and am presently running lfsck.

People

Assignee:: Cliff White (Inactive)

Reporter:: Joe Mervini

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 24/Sep/12 4:53 AM

Updated:: 24/Sep/15 9:46 AM

Resolved:: 24/Sep/15 9:46 AM