[LU-4557] Negative used block number of OST after OSS crashes and reboots Created: 29/Jan/14 Updated: 12/Aug/14 Resolved: 29/Apr/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Li Xi (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ldiskfs, patch | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12446 |
| Description |
|
Durding active I/O on OSS (e.g. IOR from client), if OSS is reset (not umount, but like force reset), and when OSS comes up, the mount all OSTs, it shows strange OST size like below. [root@noss01 mount]# df -h -t lustre It is easy to reproduce the problem. The script "run.sh" is able to reproduce the problme on a server named "server1" and a virtual machine named "vm1". After some investigation, we found some facts about this problem. After the problem happends, the OST file system is corrupted. Following is the fsck result.
[QUOTA WARNING] Usage inconsistent for ID 0:actual (1220608, 253) != expected (0, 32) [QUOTA WARNING] Usage inconsistent for ID 0:actual (1220608, 253) != expected (0, 32) server1-OST0002: ***** FILE SYSTEM WAS MODIFIED ***** |
| Comments |
| Comment by Li Xi (Inactive) [ 29/Jan/14 ] |
|
Here is a patch which fixes the problem by pre-mount/umount ldiskfs before OSS starts. Though that patch helps to fix the problem, I don't think that is a perfect solution. There might be better ways to fix this problem. Any ideas? Thanks! |
| Comment by Peter Jones [ 29/Jan/14 ] |
|
Hongchao Could you please comment on this patch? Thanks Peter |
| Comment by Andreas Dilger [ 29/Jan/14 ] |
|
I think that the mount/unmount is not a proper fix for the problem. We need to understand what is actually going wrong and fix that. The free blocks/inodes values stored in the superblock should never be used directly by the kernel code, since they are not kept up-to-date. Instead, there are percpu counters that are loaded from the bitmaps at mount time and kept updated when blocks are allocated or freed. |
| Comment by Shuichi Ihara (Inactive) [ 31/Jan/14 ] |
|
BTW, we hit this problem as real situation at the customer site. In order to reproduce this problem in the our lab, we used VMs that was Li Xi posted reproducer script. |
| Comment by Li Xi (Inactive) [ 06/Feb/14 ] |
|
Yeah, Andreas, I agree on that. I am wondering why the inconsistent free block/inode number in the superblock causes further problem of Lustre OSS. That is strange to me because the numbers are not used directly. |
| Comment by Hongchao Zhang [ 10/Feb/14 ] |
|
I have tested with two different kernel 2.6.32-279.2.1 and 2.6.32-358.23.2, both have the problem. in ext4/ldiskfs, Reset the system during active IO(say, dd), and deleting/truncating some file after rebooting the system, more work is needed to check the ext4/ldiskfs more deeply to see where is the problem. |
| Comment by Hongchao Zhang [ 10/Feb/14 ] |
|
by printing more debug info during mounting the ldiskfs device, the output of the fsck is incomplete, fsck from util-linux-ng 2.17.2 e2fsck 1.42.6.wc2 (10-Dec-2012) lustre-OST0002: recovering journal Setting free blocks count to 2318165 (was 2153045) lustre-OST0002: clean, 192/184320 files, 827563/3145728 blocks the debug output: different free blocks(1): stored = 3176 (counted 2918) different free blocks(3): stored = 766 (counted 3072) different free blocks(4): stored = 2048 (counted 0) different free blocks(5): stored = 1024 (counted 3072) different free blocks(6): stored = 2048 (counted 0) different free blocks(9): stored = 1024 (counted 3072) different free blocks(15): stored = 0 (counted 2048) different free blocks(19): stored = 0 (counted 32768) different free blocks(20): stored = 2048 (counted 32768) different free blocks(21): stored = 2048 (counted 32768) different free blocks(22): stored = 0 (counted 32768) different free blocks(23): stored = 0 (counted 32768) different free blocks(24): stored = 0 (counted 32768) different free blocks(25): stored = 3072 (counted 31744) different free blocks(26): stored = 2048 (counted 32768) different free blocks(27): stored = 3072 (counted 31744) different free blocks(28): stored = 10496 (counted 32768) the free blocks count has been incompatible in various block groups. |
| Comment by Hongchao Zhang [ 13/Feb/14 ] |
|
this problem is in ext4, which initializes the ext4_sb_info->s_freeblocks_counter, ext4_sb_info->s_freeinodes_counter before loading journal, by moving the following codes after journal was loaded, the issue in ext4 is fixed err = percpu_counter_init(&sbi->s_freeblocks_counter,
ext4_count_free_blocks(sb));
if (!err) {
err = percpu_counter_init(&sbi->s_freeinodes_counter,
ext4_count_free_inodes(sb));
}
if (!err) {
err = percpu_counter_init(&sbi->s_dirs_counter,
ext4_count_dirs(sb));
}
if (!err) {
err = percpu_counter_init(&sbi->s_dirtyblocks_counter, 0);
}
if (err) {
ext4_msg(sb, KERN_ERR, "insufficient memory");
goto failed_mount4;
}
the problem in Lustre is also fixed by it. |
| Comment by Li Xi (Inactive) [ 13/Feb/14 ] |
|
Hi Hongchao, Thank you very much for investigate on this! Would you please share your Lustre patch which fixes this problem? I'd like to check the result too. |
| Comment by Hongchao Zhang [ 14/Feb/14 ] |
|
the patch is under test, and will push it to Gerrit soon. |
| Comment by Hongchao Zhang [ 14/Feb/14 ] |
|
the initial patch is tracked at http://review.whamcloud.com/#/c/9277/ |
| Comment by Li Xi (Inactive) [ 14/Feb/14 ] |
|
Hi Hongchao, I've checked that your patch works perfectly to fix this problem. Thanks! |
| Comment by James A Simmons [ 14/Feb/14 ] |
|
Is this a problem for SLES11 SP3 as well? |
| Comment by Hongchao Zhang [ 24/Feb/14 ] |
|
the SLES11 SP3 uses ext3 by default, and ext4 will only be used by read-only mode. |
| Comment by Bob Glossman (Inactive) [ 25/Feb/14 ] |
|
The problem of ext4 being readonly in SLES has been fixed in our builds for months. See http://review.whamcloud.com/8335, |
| Comment by Andreas Dilger [ 07/Mar/14 ] |
|
Looking at http://review.whamcloud.com/9277 more closely, along with the upstream kernel, it seems that this patch is NOT needed for the SLES11SP2, even though it appears the code is the same as RHEL6. There were two patches applied to the upstream kernel - v2.6.34-rc7-16-g84061e0 was almost the same as 9277, and v2.6.37-rc1-3-gce7e010 that mostly reverted it and loaded the percpu counters both before and after journal replay. It isn't yet clear why the ce7e010 patch was landed, but the net result is that we should delete the sles11sp2/ext4-init-statfs-after-journal.patch and remove it from the sles11sp2 and sles11sp3 series files. There is also a subtle defect in the 9277 patch, since if ldiskfs is ever mounted with "-o nojournal" the initialization of the percpu counters will be skipped. We don't ever run Lustre in that mode, so it isn't seen during our testing. The correct approach would probably be to replace the current rhel6.3/ext4-init-statfs-after-journal.patch with copies of the upstream commits 84061e0 and ce7e010, so that when RHEL6 backports these fixes it will be clear that our patch is no longer needed. Otherwise, our patch does not conflict when both of those patches are applied. |
| Comment by Hongchao Zhang [ 08/Apr/14 ] |
|
RHEL6 has backported the commits 84061e0 and ce7e010 in 2.6.32-431.5.1, and our patch(http://review.whamcloud.com/#/c/9277/) was ever landed on master, the reverting patch is at http://review.whamcloud.com/#/c/9908/ |
| Comment by Hongchao Zhang [ 11/Apr/14 ] |
|
the patch is updated |
| Comment by Shuichi Ihara (Inactive) [ 11/Apr/14 ] |
|
backport patch for b2_5 http://review.whamcloud.com/9933 |
| Comment by Peter Jones [ 28/Apr/14 ] |
|
The latest patch landed to 2.6. Do I understand correctly that, due to the recent kernel updates, no changes are needed to any maintenance release branches and so this ticket can now be marked as resolved? |
| Comment by Hongchao Zhang [ 29/Apr/14 ] |
|
Yes, it can be closed now. |
| Comment by Peter Jones [ 29/Apr/14 ] |
|
Thanks Hongchao |