[LU-3457] After power failure 1 OST failed to come up. Created: 12/Jun/13 Updated: 10/Jul/13 Resolved: 10/Jul/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Joe Mervini | Assignee: | Andreas Dilger |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
TOSS 2.0 Lustre 2.1, DDN SFA10k |
||
| Severity: | 3 |
| Rank (Obsolete): | 8642 |
| Description |
|
We experienced a power failure last night due to winds slapping two high tension lines together and while restoring everything to working order one OST (luckily only one) would not mount. When I ran fsck on the file system, basically it put EVERYTHING in lost+found and complained loudly about multiply-claimed blocks. When the fsck conpleted, I ran ll_recover_lost_found_objs that restored all but ~84MB and after recreating the CONFIGS directory in the root and moving the CONFIGS files back to that directory. I re-ran fsck until no more fixes were made and was able to get OST information via tunefs.lustre and mount the file system. In all about a dozen inodes were affected and I think all but maybe three were not recovered. I think these file were the health check and quota files. (After bringing the file system back online lfs quota -u <file system> failed saying it wasn't enabled. I am in the process of doing a lfs quotacheck now.) In any event these fsck messages a ones that I've never seen before and the message that is most concerning is the "boot loader inode" message. As I said, after the first fsck and mounting ldiskfs there was only the lost+found directory. Since we are still in a recovery mode I wanted to find out if there is any thing that I am missing or should have done or do with this particular OST or if there are any concerns that I should be on the lookout for before returning the file system to production. Unfortunately this occurred on a system that prevents me from providing any logs (the output below is hand typed). TIA #### File <The boot loader inode> (inode $5 mod time Mon Jun 10 13:32:09 2013) clone_file_block: internal error: can't find dup_blk for 3357776535 clone_file_block: internal error: can't find dup_blk for 3357776535 File O/0/d28/6574428 (inode %1624725, mod time Mon Jun 10 18:23:10 2013) |
| Comments |
| Comment by Peter Jones [ 12/Jun/13 ] |
|
Andreas Could you please advise Sandia? Thanks Peter |
| Comment by Oleg Drokin [ 12/Jun/13 ] |
|
Seems objects directory was damaged and also some inode tables, but then I see a full path name in your hand-typed message, so based on that it should not got everything into lost_found. In any case if you've got most everything back with ll_recover_lost_found_objs, then you probably cannot do much more. Please note that there still could be file content damage in files that fsck cannot correct for obvious reasons, there's no easy way to find it, so I'd treat all objects on that OST as suspect. |
| Comment by Andreas Dilger [ 12/Jun/13 ] |
|
I agree with Oleg - ll_recover_lost_found_objs did its job and rebuilt the filesystem tree. It seems that the beginning of the filesystem must have been corrupted somehow, but I don't think any further recovery is possible at this point. The "boot loader inode" is just one of the reserved inodes at the start of the filesystem (#5), so it was overwritten by garbage along with the root directory (#2) and some of the object directories (which is what caused everything to end up in lost+found). |
| Comment by Joe Mervini [ 10/Jul/13 ] |
|
Feel free to close this ticket. |
| Comment by Peter Jones [ 10/Jul/13 ] |
|
ok - thanks Joe! |