[LU-4557] Negative used block number of OST after OSS crashes and reboots Created: 29/Jan/14  Updated: 12/Aug/14  Resolved: 29/Apr/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3

Type: Bug Priority: Major
Reporter: Li Xi (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: ldiskfs, patch

Attachments: File run.sh    
Severity: 3
Rank (Obsolete): 12446

 Description   

Durding active I/O on OSS (e.g. IOR from client), if OSS is reset (not umount, but like force reset), and when OSS comes up, the mount all OSTs, it shows strange OST size like below.

[root@noss01 mount]# df -h -t lustre
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/OST00 22T -17G 22T 0% /mnt/lustre/OST00
/dev/mapper/OST01 22T -19G 22T 0% /mnt/lustre/OST01
/dev/mapper/OST02 22T -17G 22T 0% /mnt/lustre/OST02
/dev/mapper/OST03 22T -19G 22T 0% /mnt/lustre/OST03
/dev/mapper/OST04 22T -17G 22T 0% /mnt/lustre/OST04

It is easy to reproduce the problem. The script "run.sh" is able to reproduce the problme on a server named "server1" and a virtual machine named "vm1".

After some investigation, we found some facts about this problem. After the problem happends, the OST file system is corrupted. Following is the fsck result.
===============================================================================

  1. fsck -y /dev/sdb3
    fsck from util-linux-ng 2.17.2
    e2fsck 1.42.7.wc1 (12-Apr-2013)
    server1-OST0002 contains a file system with errors, check forced.
    Pass 1: Checking inodes, blocks, and sizes
    Pass 2: Checking directory structure
    Pass 3: Checking directory connectivity
    Pass 4: Checking reference counts
    Pass 5: Checking group summary information
    Free blocks count wrong (560315, counted=490939).
    Fix? yes

[QUOTA WARNING] Usage inconsistent for ID 0:actual (1220608, 253) != expected (0, 32)
Update quota info for quota type 0? yes

[QUOTA WARNING] Usage inconsistent for ID 0:actual (1220608, 253) != expected (0, 32)
Update quota info for quota type 1? yes

server1-OST0002: ***** FILE SYSTEM WAS MODIFIED *****
server1-OST0002: 262/131648 files (0.4% non-contiguous), 35189/526128 blocks
===============================================================================
Second, after the OSS crashes and before mounts the OST again, fsck shows the free inode/space in the super block is false. That is not a big problem since fsck is able to fix that problem easily. Somehow Lustre makes the problem bigger if this tiny problem is not fixed.
===============================================================================
[root@vm1 ~]# fsck -n /dev/sdb3
fsck from util-linux-ng 2.17.2
e2fsck 1.42.7.wc1 (12-Apr-2013)
Warning: skipping journal recovery because doing a read-only filesystem check.
server1-OST0002: clean, 13/131648 files, 34900/526128 blocks
[root@vm1 ~]# fsck /dev/sdb3
fsck from util-linux-ng 2.17.2
e2fsck 1.42.7.wc1 (12-Apr-2013)
server1-OST0002: recovering journal
Setting free inodes count to 131387 (was 131635)
Setting free blocks count to 420283 (was 491228)
server1-OST0002: clean, 261/131648 files, 105845/526128 blocks
===============================================================================
What's more, after the OSS crashes and before mounts the OST again, we have two ways to prevent to problem from happening, fsck that OST or mount/umount that OST using ldiskfs.
We also found that this problem is not reproducable on Lustre versions before 6a6561972406043efe41ae43b64fd278f360a4b9, simply because versions before that commit do a premount/umount before start OST service.



 Comments   
Comment by Li Xi (Inactive) [ 29/Jan/14 ]

Here is a patch which fixes the problem by pre-mount/umount ldiskfs before OSS starts.
http://review.whamcloud.com/9044

Though that patch helps to fix the problem, I don't think that is a perfect solution. There might be better ways to fix this problem. Any ideas? Thanks!

Comment by Peter Jones [ 29/Jan/14 ]

Hongchao

Could you please comment on this patch?

Thanks

Peter

Comment by Andreas Dilger [ 29/Jan/14 ]

I think that the mount/unmount is not a proper fix for the problem. We need to understand what is actually going wrong and fix that.

The free blocks/inodes values stored in the superblock should never be used directly by the kernel code, since they are not kept up-to-date. Instead, there are percpu counters that are loaded from the bitmaps at mount time and kept updated when blocks are allocated or freed.

Comment by Shuichi Ihara (Inactive) [ 31/Jan/14 ]

BTW, we hit this problem as real situation at the customer site.
For example, when stonith process kills an OSS via IPMI, then another OSS mounts all OSTs for failover, but all OST's size are negative numbers which are more critical.

In order to reproduce this problem in the our lab, we used VMs that was Li Xi posted reproducer script.

Comment by Li Xi (Inactive) [ 06/Feb/14 ]

Yeah, Andreas, I agree on that. I am wondering why the inconsistent free block/inode number in the superblock causes further problem of Lustre OSS. That is strange to me because the numbers are not used directly.

Comment by Hongchao Zhang [ 10/Feb/14 ]

I have tested with two different kernel 2.6.32-279.2.1 and 2.6.32-358.23.2, both have the problem.
and I also test the ext4 and it also show the problem with negative "Used" block space.

in ext4/ldiskfs, Reset the system during active IO(say, dd), and deleting/truncating some file after rebooting the system,
the free blocks will become larger than the total disk block space, which cause negative "Used" value.

more work is needed to check the ext4/ldiskfs more deeply to see where is the problem.

Comment by Hongchao Zhang [ 10/Feb/14 ]

by printing more debug info during mounting the ldiskfs device, the output of the fsck is incomplete,
the output of fsck:

fsck from util-linux-ng 2.17.2
e2fsck 1.42.6.wc2 (10-Dec-2012)
lustre-OST0002: recovering journal
Setting free blocks count to 2318165 (was 2153045)
lustre-OST0002: clean, 192/184320 files, 827563/3145728 blocks

the debug output:

different free blocks(1): stored = 3176 (counted 2918)
different free blocks(3): stored = 766 (counted 3072)
different free blocks(4): stored = 2048 (counted 0)
different free blocks(5): stored = 1024 (counted 3072)
different free blocks(6): stored = 2048 (counted 0)
different free blocks(9): stored = 1024 (counted 3072)
different free blocks(15): stored = 0 (counted 2048)
different free blocks(19): stored = 0 (counted 32768)
different free blocks(20): stored = 2048 (counted 32768)
different free blocks(21): stored = 2048 (counted 32768)
different free blocks(22): stored = 0 (counted 32768)
different free blocks(23): stored = 0 (counted 32768)
different free blocks(24): stored = 0 (counted 32768)
different free blocks(25): stored = 3072 (counted 31744)
different free blocks(26): stored = 2048 (counted 32768)
different free blocks(27): stored = 3072 (counted 31744)
different free blocks(28): stored = 10496 (counted 32768)

the free blocks count has been incompatible in various block groups.

Comment by Hongchao Zhang [ 13/Feb/14 ]

this problem is in ext4, which initializes the ext4_sb_info->s_freeblocks_counter, ext4_sb_info->s_freeinodes_counter before loading journal,
then it will be fixed by mounting twice in patch "http://review.whamcloud.com/9044".

by moving the following codes after journal was loaded, the issue in ext4 is fixed

        err = percpu_counter_init(&sbi->s_freeblocks_counter,
                        ext4_count_free_blocks(sb));
        if (!err) {
                err = percpu_counter_init(&sbi->s_freeinodes_counter,
                                ext4_count_free_inodes(sb));
        }
        if (!err) {
                err = percpu_counter_init(&sbi->s_dirs_counter,
                                ext4_count_dirs(sb));
        }
        if (!err) {
                err = percpu_counter_init(&sbi->s_dirtyblocks_counter, 0);
        }
        if (err) {
                ext4_msg(sb, KERN_ERR, "insufficient memory");
                goto failed_mount4;
        }

the problem in Lustre is also fixed by it.

Comment by Li Xi (Inactive) [ 13/Feb/14 ]

Hi Hongchao,

Thank you very much for investigate on this! Would you please share your Lustre patch which fixes this problem? I'd like to check the result too.

Comment by Hongchao Zhang [ 14/Feb/14 ]

the patch is under test, and will push it to Gerrit soon.

Comment by Hongchao Zhang [ 14/Feb/14 ]

the initial patch is tracked at http://review.whamcloud.com/#/c/9277/

Comment by Li Xi (Inactive) [ 14/Feb/14 ]

Hi Hongchao,

I've checked that your patch works perfectly to fix this problem. Thanks!

Comment by James A Simmons [ 14/Feb/14 ]

Is this a problem for SLES11 SP3 as well?

Comment by Hongchao Zhang [ 24/Feb/14 ]

the SLES11 SP3 uses ext3 by default, and ext4 will only be used by read-only mode.
and this problem exists according to the code line of ext4 (kernel version: 3.0.76-0.11.1)

Comment by Bob Glossman (Inactive) [ 25/Feb/14 ]

The problem of ext4 being readonly in SLES has been fixed in our builds for months. See http://review.whamcloud.com/8335, LU-4276.

Comment by Andreas Dilger [ 07/Mar/14 ]

Looking at http://review.whamcloud.com/9277 more closely, along with the upstream kernel, it seems that this patch is NOT needed for the SLES11SP2, even though it appears the code is the same as RHEL6. There were two patches applied to the upstream kernel - v2.6.34-rc7-16-g84061e0 was almost the same as 9277, and v2.6.37-rc1-3-gce7e010 that mostly reverted it and loaded the percpu counters both before and after journal replay. It isn't yet clear why the ce7e010 patch was landed, but the net result is that we should delete the sles11sp2/ext4-init-statfs-after-journal.patch and remove it from the sles11sp2 and sles11sp3 series files.

There is also a subtle defect in the 9277 patch, since if ldiskfs is ever mounted with "-o nojournal" the initialization of the percpu counters will be skipped. We don't ever run Lustre in that mode, so it isn't seen during our testing. The correct approach would probably be to replace the current rhel6.3/ext4-init-statfs-after-journal.patch with copies of the upstream commits 84061e0 and ce7e010, so that when RHEL6 backports these fixes it will be clear that our patch is no longer needed. Otherwise, our patch does not conflict when both of those patches are applied.

Comment by Hongchao Zhang [ 08/Apr/14 ]

RHEL6 has backported the commits 84061e0 and ce7e010 in 2.6.32-431.5.1, and our patch(http://review.whamcloud.com/#/c/9277/) was ever landed on master,
then the patch is needed to revert.

the reverting patch is at http://review.whamcloud.com/#/c/9908/

Comment by Hongchao Zhang [ 11/Apr/14 ]

the patch is updated

Comment by Shuichi Ihara (Inactive) [ 11/Apr/14 ]

backport patch for b2_5 http://review.whamcloud.com/9933

Comment by Peter Jones [ 28/Apr/14 ]

The latest patch landed to 2.6. Do I understand correctly that, due to the recent kernel updates, no changes are needed to any maintenance release branches and so this ticket can now be marked as resolved?

Comment by Hongchao Zhang [ 29/Apr/14 ]

Yes, it can be closed now.

Comment by Peter Jones [ 29/Apr/14 ]

Thanks Hongchao

Generated at Sat Feb 10 01:43:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.