Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4557

Negative used block number of OST after OSS crashes and reboots

Details

    • 3
    • 12446

    Description

      Durding active I/O on OSS (e.g. IOR from client), if OSS is reset (not umount, but like force reset), and when OSS comes up, the mount all OSTs, it shows strange OST size like below.

      [root@noss01 mount]# df -h -t lustre
      Filesystem Size Used Avail Use% Mounted on
      /dev/mapper/OST00 22T -17G 22T 0% /mnt/lustre/OST00
      /dev/mapper/OST01 22T -19G 22T 0% /mnt/lustre/OST01
      /dev/mapper/OST02 22T -17G 22T 0% /mnt/lustre/OST02
      /dev/mapper/OST03 22T -19G 22T 0% /mnt/lustre/OST03
      /dev/mapper/OST04 22T -17G 22T 0% /mnt/lustre/OST04

      It is easy to reproduce the problem. The script "run.sh" is able to reproduce the problme on a server named "server1" and a virtual machine named "vm1".

      After some investigation, we found some facts about this problem. After the problem happends, the OST file system is corrupted. Following is the fsck result.
      ===============================================================================

      1. fsck -y /dev/sdb3
        fsck from util-linux-ng 2.17.2
        e2fsck 1.42.7.wc1 (12-Apr-2013)
        server1-OST0002 contains a file system with errors, check forced.
        Pass 1: Checking inodes, blocks, and sizes
        Pass 2: Checking directory structure
        Pass 3: Checking directory connectivity
        Pass 4: Checking reference counts
        Pass 5: Checking group summary information
        Free blocks count wrong (560315, counted=490939).
        Fix? yes

      [QUOTA WARNING] Usage inconsistent for ID 0:actual (1220608, 253) != expected (0, 32)
      Update quota info for quota type 0? yes

      [QUOTA WARNING] Usage inconsistent for ID 0:actual (1220608, 253) != expected (0, 32)
      Update quota info for quota type 1? yes

      server1-OST0002: ***** FILE SYSTEM WAS MODIFIED *****
      server1-OST0002: 262/131648 files (0.4% non-contiguous), 35189/526128 blocks
      ===============================================================================
      Second, after the OSS crashes and before mounts the OST again, fsck shows the free inode/space in the super block is false. That is not a big problem since fsck is able to fix that problem easily. Somehow Lustre makes the problem bigger if this tiny problem is not fixed.
      ===============================================================================
      [root@vm1 ~]# fsck -n /dev/sdb3
      fsck from util-linux-ng 2.17.2
      e2fsck 1.42.7.wc1 (12-Apr-2013)
      Warning: skipping journal recovery because doing a read-only filesystem check.
      server1-OST0002: clean, 13/131648 files, 34900/526128 blocks
      [root@vm1 ~]# fsck /dev/sdb3
      fsck from util-linux-ng 2.17.2
      e2fsck 1.42.7.wc1 (12-Apr-2013)
      server1-OST0002: recovering journal
      Setting free inodes count to 131387 (was 131635)
      Setting free blocks count to 420283 (was 491228)
      server1-OST0002: clean, 261/131648 files, 105845/526128 blocks
      ===============================================================================
      What's more, after the OSS crashes and before mounts the OST again, we have two ways to prevent to problem from happening, fsck that OST or mount/umount that OST using ldiskfs.
      We also found that this problem is not reproducable on Lustre versions before 6a6561972406043efe41ae43b64fd278f360a4b9, simply because versions before that commit do a premount/umount before start OST service.

      Attachments

        Activity

          [LU-4557] Negative used block number of OST after OSS crashes and reboots
          pjones Peter Jones added a comment -

          Thanks Hongchao

          pjones Peter Jones added a comment - Thanks Hongchao

          Yes, it can be closed now.

          hongchao.zhang Hongchao Zhang added a comment - Yes, it can be closed now.
          pjones Peter Jones added a comment -

          The latest patch landed to 2.6. Do I understand correctly that, due to the recent kernel updates, no changes are needed to any maintenance release branches and so this ticket can now be marked as resolved?

          pjones Peter Jones added a comment - The latest patch landed to 2.6. Do I understand correctly that, due to the recent kernel updates, no changes are needed to any maintenance release branches and so this ticket can now be marked as resolved?
          ihara Shuichi Ihara (Inactive) added a comment - backport patch for b2_5 http://review.whamcloud.com/9933

          the patch is updated

          hongchao.zhang Hongchao Zhang added a comment - the patch is updated

          RHEL6 has backported the commits 84061e0 and ce7e010 in 2.6.32-431.5.1, and our patch(http://review.whamcloud.com/#/c/9277/) was ever landed on master,
          then the patch is needed to revert.

          the reverting patch is at http://review.whamcloud.com/#/c/9908/

          hongchao.zhang Hongchao Zhang added a comment - RHEL6 has backported the commits 84061e0 and ce7e010 in 2.6.32-431.5.1, and our patch( http://review.whamcloud.com/#/c/9277/ ) was ever landed on master, then the patch is needed to revert. the reverting patch is at http://review.whamcloud.com/#/c/9908/

          Looking at http://review.whamcloud.com/9277 more closely, along with the upstream kernel, it seems that this patch is NOT needed for the SLES11SP2, even though it appears the code is the same as RHEL6. There were two patches applied to the upstream kernel - v2.6.34-rc7-16-g84061e0 was almost the same as 9277, and v2.6.37-rc1-3-gce7e010 that mostly reverted it and loaded the percpu counters both before and after journal replay. It isn't yet clear why the ce7e010 patch was landed, but the net result is that we should delete the sles11sp2/ext4-init-statfs-after-journal.patch and remove it from the sles11sp2 and sles11sp3 series files.

          There is also a subtle defect in the 9277 patch, since if ldiskfs is ever mounted with "-o nojournal" the initialization of the percpu counters will be skipped. We don't ever run Lustre in that mode, so it isn't seen during our testing. The correct approach would probably be to replace the current rhel6.3/ext4-init-statfs-after-journal.patch with copies of the upstream commits 84061e0 and ce7e010, so that when RHEL6 backports these fixes it will be clear that our patch is no longer needed. Otherwise, our patch does not conflict when both of those patches are applied.

          adilger Andreas Dilger added a comment - Looking at http://review.whamcloud.com/9277 more closely, along with the upstream kernel, it seems that this patch is NOT needed for the SLES11SP2, even though it appears the code is the same as RHEL6. There were two patches applied to the upstream kernel - v2.6.34-rc7-16-g84061e0 was almost the same as 9277, and v2.6.37-rc1-3-gce7e010 that mostly reverted it and loaded the percpu counters both before and after journal replay. It isn't yet clear why the ce7e010 patch was landed, but the net result is that we should delete the sles11sp2/ext4-init-statfs-after-journal.patch and remove it from the sles11sp2 and sles11sp3 series files. There is also a subtle defect in the 9277 patch, since if ldiskfs is ever mounted with "-o nojournal" the initialization of the percpu counters will be skipped. We don't ever run Lustre in that mode, so it isn't seen during our testing. The correct approach would probably be to replace the current rhel6.3/ext4-init-statfs-after-journal.patch with copies of the upstream commits 84061e0 and ce7e010, so that when RHEL6 backports these fixes it will be clear that our patch is no longer needed. Otherwise, our patch does not conflict when both of those patches are applied.

          The problem of ext4 being readonly in SLES has been fixed in our builds for months. See http://review.whamcloud.com/8335, LU-4276.

          bogl Bob Glossman (Inactive) added a comment - The problem of ext4 being readonly in SLES has been fixed in our builds for months. See http://review.whamcloud.com/8335 , LU-4276 .

          the SLES11 SP3 uses ext3 by default, and ext4 will only be used by read-only mode.
          and this problem exists according to the code line of ext4 (kernel version: 3.0.76-0.11.1)

          hongchao.zhang Hongchao Zhang added a comment - the SLES11 SP3 uses ext3 by default, and ext4 will only be used by read-only mode. and this problem exists according to the code line of ext4 (kernel version: 3.0.76-0.11.1)

          Is this a problem for SLES11 SP3 as well?

          simmonsja James A Simmons added a comment - Is this a problem for SLES11 SP3 as well?

          People

            hongchao.zhang Hongchao Zhang
            lixi Li Xi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: