Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8856

ZFS-MDT 100% full. Cannot delete files.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0, Lustre 2.10.4
    • Lustre 2.8.0
    • CentOS 6.8 2.6.32_504.30.3.el6.x86_64, Lustre 2.8.0 (g0bcd520), ZFS 0.6.5.4-1
    • 2
    • 9223372036854775807

    Description

      End Customer: MSU (Michigan State Univ)

      A user generated tons of small files and exhausted the available inodes of the MDT (single MDT, no DNE). Any attempts at deleting files as root fail.

      I looked at LU-8787 and LU-8714 but they don't seem to follow this closely enough.

      zdb -d ls15-mds-00.mdt/mdt
      Dataset ls15-mds-00.mdt/mdt [ZPL], ID 66, cr_txg 20442, 2.82T, 280362968 objects

      ls15-mds-00.mdt/mdt 2.82T 0 2.82T /ls15-mds-00.mdt/mdt

      [root@lac-373 roth]# lfs df -i
      UUID Inodes IUsed IFree IUse% Mounted on
      ls15-MDT0000_UUID 280362968 280362968 0 100% /mnt/ls15[MDT:0]

      But we can't remove any files:

      [root@lac-000 1mk5_5998]# rm tor.mat
      rm: cannot remove `tor.mat': No space left on device

      I'm going to take a stab at deregistering the changelog which might free up enough space to get the MDT able to process some file deletions. If anyone has any other 'best practices' please advise.

      Attachments

        Issue Links

          Activity

            [LU-8856] ZFS-MDT 100% full. Cannot delete files.

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31443
            Subject: LU-8856 osd: mark specific transactions netfree
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: d47e3892f83d2d1bdb21653381a8d1ec0db68a4a

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31443 Subject: LU-8856 osd: mark specific transactions netfree Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: d47e3892f83d2d1bdb21653381a8d1ec0db68a4a

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31442/
            Subject: Revert "LU-8856 osd: mark specific transactions netfree"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c2caa40bd38e7645dc4ac90552e12e3fb7fde476

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31442/ Subject: Revert " LU-8856 osd: mark specific transactions netfree" Project: fs/lustre-release Branch: master Current Patch Set: Commit: c2caa40bd38e7645dc4ac90552e12e3fb7fde476
            pjones Peter Jones added a comment -

            Reopening due to LU-10732

            pjones Peter Jones added a comment - Reopening due to LU-10732

            Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31442
            Subject: Revert "LU-8856 osd: mark specific transactions netfree"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 83be173e2848e3b81d6fb2123d70d0cf614105a8

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31442 Subject: Revert " LU-8856 osd: mark specific transactions netfree" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 83be173e2848e3b81d6fb2123d70d0cf614105a8
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26930/
            Subject: LU-8856 osd: mark specific transactions netfree
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8d1639b5cf1edbc885876956dcd6189173c00955

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26930/ Subject: LU-8856 osd: mark specific transactions netfree Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8d1639b5cf1edbc885876956dcd6189173c00955
            ofaaland Olaf Faaland added a comment - - edited

            We've encountered this at LLNL, too.

            For the benefit of other sites that end up looking at this ticket and have Lustre versions without Alex's patches, I'm working up a procedure which I'll put on wiki.lustre.org at http://wiki.lustre.org/ZFS_MDT_ENOSPC_Recovery. It will work on any ZFS >= 0.6.5 using spa_slop_shift mentioned by Alex, above.

            ofaaland Olaf Faaland added a comment - - edited We've encountered this at LLNL, too. For the benefit of other sites that end up looking at this ticket and have Lustre versions without Alex's patches, I'm working up a procedure which I'll put on wiki.lustre.org at http://wiki.lustre.org/ZFS_MDT_ENOSPC_Recovery . It will work on any ZFS >= 0.6.5 using spa_slop_shift mentioned by Alex, above.

            well, I guess we can mark any transaction originated from root with netfree flag when a special tunable set ?
            if no space can be released, admin comes in settting that variable and do whatever may help with it's rights..

            bzzz Alex Zhuravlev added a comment - well, I guess we can mark any transaction originated from root with netfree flag when a special tunable set ? if no space can be released, admin comes in settting that variable and do whatever may help with it's rights..

            I think in the "reserved with writes" case, since the admin needs to get involved they can hopefully fix the source of the problem that is consuming all the free space (e.g. stale ChangeLog consumer registered) when they delete the emergency file.

            adilger Andreas Dilger added a comment - I think in the "reserved with writes" case, since the admin needs to get involved they can hopefully fix the source of the problem that is consuming all the free space (e.g. stale ChangeLog consumer registered) when they delete the emergency file.

            I think this is true for "reserved with writes" as well - changelogs/destroy logs can be quite big so that with that reserved released we'll keep consuming?
            correct me if I'm wrong, but I don't really see big difference.

            bzzz Alex Zhuravlev added a comment - I think this is true for "reserved with writes" as well - changelogs/destroy logs can be quite big so that with that reserved released we'll keep consuming? correct me if I'm wrong, but I don't really see big difference.

            I think the two approaches are complimentary. We can use the reserved space file for now, and use the "netfree" functionality when it is available.

            The main question about "netfree" is whether this is actually true when we delete an inode on the MDT with ChangeLogs enabled? Even if the dnode is deleted, it may not actually release space (due to shared dnode blocks) and the added ChangeLog record will consume space.

            As a result, even if the netfree functionality is available I think it makes sense to keep the emergency space reservation file around. If we never need to delete it then that is fine too, the amount of space consumed is minimal.

            adilger Andreas Dilger added a comment - I think the two approaches are complimentary. We can use the reserved space file for now, and use the "netfree" functionality when it is available. The main question about "netfree" is whether this is actually true when we delete an inode on the MDT with ChangeLogs enabled? Even if the dnode is deleted, it may not actually release space (due to shared dnode blocks) and the added ChangeLog record will consume space. As a result, even if the netfree functionality is available I think it makes sense to keep the emergency space reservation file around. If we never need to delete it then that is fine too, the amount of space consumed is minimal.

            People

              bzzz Alex Zhuravlev
              aeonjeffj Jeff Johnson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: