[LU-7611] OSTs become "not healthy" Created: 24/Dec/15 Updated: 28/Mar/16 Resolved: 28/Mar/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jesse Hanley | Assignee: | Jian Yu |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 6.6 |
||
| Severity: | 2 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
We had a hard power outage early this morning. After hardware fixes, we were about to mount one of our file systems and it appears to be healthy. On the other file system, we have 5 OSTs that have reported unhealthy. The file system mounted fine, and clients connected, but soon after, I started receiving alerts. I'm in the process of rebooting the nodes so that I can e2fsck them. On each node that has an unhealthy OST, they have messages similar to the following: [ 2962.001970] LustreError: 24264:0:(tgt_lastrcvd.c:583:tgt_client_new()) atlas2-OST00ca: Failed to write client lcd at idx 19, rc -30 Is an e2fsck the proper response in this type of situation? Thanks, |
| Comments |
| Comment by Jian Yu [ 24/Dec/15 ] |
|
Yes, Jesse. Please refer to https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.html#dbdoclet.50438225_71141 . |
| Comment by Jesse Hanley [ 24/Dec/15 ] |
|
Thanks Jian! About to start the e2fsck runs now. |
| Comment by John Fuchs-Chesney (Inactive) [ 24/Dec/15 ] |
|
Thanks Jesse. I'm asking Jian to watch this ticket, for the time being. Please let us know how the e2fsck runs proceed for you. ~ jfc. |
| Comment by Jesse Hanley [ 21/Jan/16 ] |
|
Sorry for the delay on this - we managed to get the file systems back up. We ran e2fscks on all target OSTs during an outage and got the following output (I removed some of the superfluous and [QUOTA WARNING] lines): atlas1-OST02e8: 1471186/29343744 files (7.2% non-contiguous), 2281519450/3755999232 blocks Inode 5712358, i_size is 2097152, should be 2138112. Fix? no atlas1-OST00aa: 1418001/29343744 files (7.1% non-contiguous), 2267986225/3755999232 blocks Deleted inode 1184018 has zero dtime. Fix? no Inode bitmap differences: -1184018 Fix? no atlas1-OST0212: 1530414/29343744 files (7.2% non-contiguous), 2274401960/3755999232 blocks Deleted inode 94387 has zero dtime. Fix? no Inode bitmap differences: -94387 Fix? no atlas1-OST0088: 1246831/29343744 files (7.4% non-contiguous), 2518723619/3755999232 blocks [ERROR] quotaio_tree.c:590:check_reference:: Illegal reference (1527054912 >= 2690) in user quota file. Quota file is probably corrupted. Please run e2fsck (8) to fix it. [ERROR] quotaio_tree.c:590:check_reference:: Illegal reference (20480 >= 2690) in user quota file. Quota file is probably corrupted. Please run e2fsck (8) to fix it. [ERROR] quotaio_tree.c:590:check_reference:: Illegal reference (897567104 >= 2690) in user quota file. Quota file is probably corrupted. Please run e2fsck (8) to fix it. [ERROR] quotaio_tree.c:590:check_reference:: Illegal reference (4096 >= 2690) in user quota file. Quota file is probably corrupted. Please run e2fsck (8) to fix it. atlas2-OST033f: 1181549/29343744 files (1.5% non-contiguous), 2164774412/3755999232 blocks Deleted inode 1791 has zero dtime. Fix? no Block bitmap differences: -(77247232--77247743) Fix? no Inode bitmap differences: -1791 Fix? no atlas2-OST0228: 1277960/29343744 files (1.6% non-contiguous), 1921932726/3755999232 blocks Deleted inode 4916172 has zero dtime. Fix? no Inode bitmap differences: -4916172 Fix? no Do you see anything alarming about this output, or should we be fine to run e2fsck (with -y or -p). Thanks, |
| Comment by Jian Yu [ 21/Jan/16 ] |
|
Hi Niu, Could you please take a look at the above outputs of e2fsck and advise? Thank you. |
| Comment by Niu Yawei (Inactive) [ 22/Jan/16 ] |
|
I think it should be fine to run e2fsck -y to repair the system, please make sure to use the latest recommended e2fsprogs (1.42.13.wc4, I believe), which has several recent defect fixing included. |
| Comment by John Fuchs-Chesney (Inactive) [ 25/Mar/16 ] |
|
Hello Jesse, Do you need any more work done on this ticket? Many thanks, |
| Comment by Jesse Hanley [ 28/Mar/16 ] |
|
Hey John, We were able to complete the e2fsck runs without an issue. Everything appeared to be in a healthy state afterwards. Please feel free to resolve this, and thanks for the help and advice. – |
| Comment by John Fuchs-Chesney (Inactive) [ 28/Mar/16 ] |
|
Thank you Jesse – glad everything worked out well. ~ jfc. |