[LU-7611] OSTs become "not healthy" Created: 24/Dec/15  Updated: 28/Mar/16  Resolved: 28/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jesse Hanley Assignee: Jian Yu
Resolution: Done Votes: 0
Labels: None
Environment:

RHEL 6.6
2.6.32-504.30.3.el6.atlas.x86_64


Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

We had a hard power outage early this morning. After hardware fixes, we were about to mount one of our file systems and it appears to be healthy.

On the other file system, we have 5 OSTs that have reported unhealthy. The file system mounted fine, and clients connected, but soon after, I started receiving alerts. I'm in the process of rebooting the nodes so that I can e2fsck them. On each node that has an unhealthy OST, they have messages similar to the following:

[ 2962.001970] LustreError: 24264:0:(tgt_lastrcvd.c:583:tgt_client_new()) atlas2-OST00ca: Failed to write client lcd at idx 19, rc -30
[ 2962.029196] LustreError: 24264:0:(tgt_lastrcvd.c:583:tgt_client_new()) Skipped 140 previous similar messages
[ 3263.951324] LustreError: 24172:0:(ofd_obd.c:1365:ofd_create()) atlas2-OST00ca: unable to precreate: rc = -30
[ 3263.981018] LustreError: 24172:0:(ofd_obd.c:1365:ofd_create()) Skipped 61 previous similar messages
[ 3562.220443] LustreError: 24251:0:(tgt_lastrcvd.c:583:tgt_client_new()) atlas2-OST00ca: Failed to write client lcd at idx 19, rc -30
[ 3562.252970] LustreError: 24251:0:(tgt_lastrcvd.c:583:tgt_client_new()) Skipped 233 previous similar messages
[ 3874.190928] LustreError: 24281:0:(ofd_obd.c:1365:ofd_create()) atlas2-OST00ca: unable to precreate: rc = -30
[ 3874.219280] LustreError: 24281:0:(ofd_obd.c:1365:ofd_create()) Skipped 62 previous similar messages

Is an e2fsck the proper response in this type of situation?

Thanks,

Jesse



 Comments   
Comment by Jian Yu [ 24/Dec/15 ]

Yes, Jesse. Please refer to https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.html#dbdoclet.50438225_71141 .

Comment by Jesse Hanley [ 24/Dec/15 ]

Thanks Jian! About to start the e2fsck runs now.

Comment by John Fuchs-Chesney (Inactive) [ 24/Dec/15 ]

Thanks Jesse. I'm asking Jian to watch this ticket, for the time being.

Please let us know how the e2fsck runs proceed for you.

~ jfc.

Comment by Jesse Hanley [ 21/Jan/16 ]

Sorry for the delay on this - we managed to get the file systems back up. We ran e2fscks on all target OSTs during an outage and got the following output (I removed some of the superfluous and [QUOTA WARNING] lines):

atlas1-OST02e8: 1471186/29343744 files (7.2% non-contiguous), 2281519450/3755999232 blocks
Inode 5712358, i_size is 2097152, should be 2138112.  Fix? no


atlas1-OST00aa: 1418001/29343744 files (7.1% non-contiguous), 2267986225/3755999232 blocks
Deleted inode 1184018 has zero dtime.  Fix? no

Inode bitmap differences:  -1184018
Fix? no


atlas1-OST0212: 1530414/29343744 files (7.2% non-contiguous), 2274401960/3755999232 blocks
Deleted inode 94387 has zero dtime.  Fix? no

Inode bitmap differences:  -94387
Fix? no


atlas1-OST0088: 1246831/29343744 files (7.4% non-contiguous), 2518723619/3755999232 blocks
[ERROR] quotaio_tree.c:590:check_reference:: Illegal reference (1527054912 >= 2690) in user quota file. Quota file is probably corrupted.
Please run e2fsck (8) to fix it.
[ERROR] quotaio_tree.c:590:check_reference:: Illegal reference (20480 >= 2690) in user quota file. Quota file is probably corrupted.
Please run e2fsck (8) to fix it.
[ERROR] quotaio_tree.c:590:check_reference:: Illegal reference (897567104 >= 2690) in user quota file. Quota file is probably corrupted.
Please run e2fsck (8) to fix it.
[ERROR] quotaio_tree.c:590:check_reference:: Illegal reference (4096 >= 2690) in user quota file. Quota file is probably corrupted.
Please run e2fsck (8) to fix it.


atlas2-OST033f: 1181549/29343744 files (1.5% non-contiguous), 2164774412/3755999232 blocks
Deleted inode 1791 has zero dtime.  Fix? no

Block bitmap differences:  -(77247232--77247743)
Fix? no

Inode bitmap differences:  -1791
Fix? no


atlas2-OST0228: 1277960/29343744 files (1.6% non-contiguous), 1921932726/3755999232 blocks
Deleted inode 4916172 has zero dtime.  Fix? no

Inode bitmap differences:  -4916172
Fix? no

Do you see anything alarming about this output, or should we be fine to run e2fsck (with -y or -p).

Thanks,

Jesse

Comment by Jian Yu [ 21/Jan/16 ]

Hi Niu,

Could you please take a look at the above outputs of e2fsck and advise? Thank you.

Comment by Niu Yawei (Inactive) [ 22/Jan/16 ]

I think it should be fine to run e2fsck -y to repair the system, please make sure to use the latest recommended e2fsprogs (1.42.13.wc4, I believe), which has several recent defect fixing included.

Comment by John Fuchs-Chesney (Inactive) [ 25/Mar/16 ]

Hello Jesse,

Do you need any more work done on this ticket?

Many thanks,
~ jfc.

Comment by Jesse Hanley [ 28/Mar/16 ]

Hey John,

We were able to complete the e2fsck runs without an issue. Everything appeared to be in a healthy state afterwards. Please feel free to resolve this, and thanks for the help and advice.


Jesse

Comment by John Fuchs-Chesney (Inactive) [ 28/Mar/16 ]

Thank you Jesse – glad everything worked out well.

~ jfc.

Generated at Sat Feb 10 02:10:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.