Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8856

ZFS-MDT 100% full. Cannot delete files.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0, Lustre 2.10.4
    • Lustre 2.8.0
    • CentOS 6.8 2.6.32_504.30.3.el6.x86_64, Lustre 2.8.0 (g0bcd520), ZFS 0.6.5.4-1
    • 2
    • 9223372036854775807

    Description

      End Customer: MSU (Michigan State Univ)

      A user generated tons of small files and exhausted the available inodes of the MDT (single MDT, no DNE). Any attempts at deleting files as root fail.

      I looked at LU-8787 and LU-8714 but they don't seem to follow this closely enough.

      zdb -d ls15-mds-00.mdt/mdt
      Dataset ls15-mds-00.mdt/mdt [ZPL], ID 66, cr_txg 20442, 2.82T, 280362968 objects

      ls15-mds-00.mdt/mdt 2.82T 0 2.82T /ls15-mds-00.mdt/mdt

      [root@lac-373 roth]# lfs df -i
      UUID Inodes IUsed IFree IUse% Mounted on
      ls15-MDT0000_UUID 280362968 280362968 0 100% /mnt/ls15[MDT:0]

      But we can't remove any files:

      [root@lac-000 1mk5_5998]# rm tor.mat
      rm: cannot remove `tor.mat': No space left on device

      I'm going to take a stab at deregistering the changelog which might free up enough space to get the MDT able to process some file deletions. If anyone has any other 'best practices' please advise.

      Attachments

        Issue Links

          Activity

            [LU-8856] ZFS-MDT 100% full. Cannot delete files.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26930/
            Subject: LU-8856 osd: mark specific transactions netfree
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8d1639b5cf1edbc885876956dcd6189173c00955

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26930/ Subject: LU-8856 osd: mark specific transactions netfree Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8d1639b5cf1edbc885876956dcd6189173c00955
            ofaaland Olaf Faaland added a comment - - edited

            We've encountered this at LLNL, too.

            For the benefit of other sites that end up looking at this ticket and have Lustre versions without Alex's patches, I'm working up a procedure which I'll put on wiki.lustre.org at http://wiki.lustre.org/ZFS_MDT_ENOSPC_Recovery. It will work on any ZFS >= 0.6.5 using spa_slop_shift mentioned by Alex, above.

            ofaaland Olaf Faaland added a comment - - edited We've encountered this at LLNL, too. For the benefit of other sites that end up looking at this ticket and have Lustre versions without Alex's patches, I'm working up a procedure which I'll put on wiki.lustre.org at http://wiki.lustre.org/ZFS_MDT_ENOSPC_Recovery . It will work on any ZFS >= 0.6.5 using spa_slop_shift mentioned by Alex, above.

            well, I guess we can mark any transaction originated from root with netfree flag when a special tunable set ?
            if no space can be released, admin comes in settting that variable and do whatever may help with it's rights..

            bzzz Alex Zhuravlev added a comment - well, I guess we can mark any transaction originated from root with netfree flag when a special tunable set ? if no space can be released, admin comes in settting that variable and do whatever may help with it's rights..

            I think in the "reserved with writes" case, since the admin needs to get involved they can hopefully fix the source of the problem that is consuming all the free space (e.g. stale ChangeLog consumer registered) when they delete the emergency file.

            adilger Andreas Dilger added a comment - I think in the "reserved with writes" case, since the admin needs to get involved they can hopefully fix the source of the problem that is consuming all the free space (e.g. stale ChangeLog consumer registered) when they delete the emergency file.

            I think this is true for "reserved with writes" as well - changelogs/destroy logs can be quite big so that with that reserved released we'll keep consuming?
            correct me if I'm wrong, but I don't really see big difference.

            bzzz Alex Zhuravlev added a comment - I think this is true for "reserved with writes" as well - changelogs/destroy logs can be quite big so that with that reserved released we'll keep consuming? correct me if I'm wrong, but I don't really see big difference.

            I think the two approaches are complimentary. We can use the reserved space file for now, and use the "netfree" functionality when it is available.

            The main question about "netfree" is whether this is actually true when we delete an inode on the MDT with ChangeLogs enabled? Even if the dnode is deleted, it may not actually release space (due to shared dnode blocks) and the added ChangeLog record will consume space.

            As a result, even if the netfree functionality is available I think it makes sense to keep the emergency space reservation file around. If we never need to delete it then that is fine too, the amount of space consumed is minimal.

            adilger Andreas Dilger added a comment - I think the two approaches are complimentary. We can use the reserved space file for now, and use the "netfree" functionality when it is available. The main question about "netfree" is whether this is actually true when we delete an inode on the MDT with ChangeLogs enabled? Even if the dnode is deleted, it may not actually release space (due to shared dnode blocks) and the added ChangeLog record will consume space. As a result, even if the netfree functionality is available I think it makes sense to keep the emergency space reservation file around. If we never need to delete it then that is fine too, the amount of space consumed is minimal.

            unfortunately this capability was added in 0.7, it's not easily available on 0.6 though the majority of the required functionality is in place.

            bzzz Alex Zhuravlev added a comment - unfortunately this capability was added in 0.7, it's not easily available on 0.6 though the majority of the required functionality is in place.

            the approach seem to work (in simple cases at least). here is the test:

            test_803() {
            mkdir $DIR/$tdir
            createmany -m $DIR/$tdir/f 10000000000 && error "too big device?"
            rm -rf $DIR/$tdir || error "rm should succeed after ENOSPC"
            }

            == sanity test 803: OOS == 15:52:41 (1493902361)

            • create 10000 (time 1493902365.56 total 4.41 last 2269.00)
            • create 20000 (time 1493902371.51 total 10.35 last 1681.47)
            • create 30000 (time 1493902377.66 total 16.50 last 1625.92)
            • create 40000 (time 1493902384.95 total 23.80 last 1370.67)
            • create 44433 (time 1493902394.95 total 33.80 last 443.18)
            • create 46390 (time 1493902404.97 total 43.81 last 195.48)
            • create 47995 (time 1493902414.97 total 53.82 last 160.40)
            • create 49455 (time 1493902424.98 total 63.83 last 145.92)
            • create 50000 (time 1493902428.78 total 67.62 last 143.49)
            • create 51395 (time 1493902438.78 total 77.63 last 139.39)
            • create 52925 (time 1493902448.79 total 87.64 last 152.94)
            • create 54468 (time 1493902458.79 total 97.64 last 154.29)
            • create 56076 (time 1493902468.80 total 107.64 last 160.70)
            • create 57716 (time 1493902478.80 total 117.65 last 163.87)
            • create 59290 (time 1493902488.81 total 127.66 last 157.27)
            • create 60000 (time 1493902493.28 total 132.12 last 159.07)
            • create 61487 (time 1493902503.28 total 142.13 last 148.58)
              mknod(/mnt/lustre/d803.sanity/f62098) error: No space left on device
              total: 62098 create in 146.27 seconds: 424.54 ops/second
              Resetting fail_loc on all nodes...done.
              15:56:09 (1493902569) waiting for dual2 network 5 secs ...
              15:56:09 (1493902569) network interface is UP
              PASS 803 (208s)

            w/o the patch rm fails..

            bzzz Alex Zhuravlev added a comment - the approach seem to work (in simple cases at least). here is the test: test_803() { mkdir $DIR/$tdir createmany -m $DIR/$tdir/f 10000000000 && error "too big device?" rm -rf $DIR/$tdir || error "rm should succeed after ENOSPC" } == sanity test 803: OOS == 15:52:41 (1493902361) create 10000 (time 1493902365.56 total 4.41 last 2269.00) create 20000 (time 1493902371.51 total 10.35 last 1681.47) create 30000 (time 1493902377.66 total 16.50 last 1625.92) create 40000 (time 1493902384.95 total 23.80 last 1370.67) create 44433 (time 1493902394.95 total 33.80 last 443.18) create 46390 (time 1493902404.97 total 43.81 last 195.48) create 47995 (time 1493902414.97 total 53.82 last 160.40) create 49455 (time 1493902424.98 total 63.83 last 145.92) create 50000 (time 1493902428.78 total 67.62 last 143.49) create 51395 (time 1493902438.78 total 77.63 last 139.39) create 52925 (time 1493902448.79 total 87.64 last 152.94) create 54468 (time 1493902458.79 total 97.64 last 154.29) create 56076 (time 1493902468.80 total 107.64 last 160.70) create 57716 (time 1493902478.80 total 117.65 last 163.87) create 59290 (time 1493902488.81 total 127.66 last 157.27) create 60000 (time 1493902493.28 total 132.12 last 159.07) create 61487 (time 1493902503.28 total 142.13 last 148.58) mknod(/mnt/lustre/d803.sanity/f62098) error: No space left on device total: 62098 create in 146.27 seconds: 424.54 ops/second Resetting fail_loc on all nodes...done. 15:56:09 (1493902569) waiting for dual2 network 5 secs ... 15:56:09 (1493902569) network interface is UP PASS 803 (208s) w/o the patch rm fails..

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26930
            Subject: LU-8856 osd: mark specific transactions netfree
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 63b56104f21e8e1abfe962bd9ab6b749b67fed3a

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26930 Subject: LU-8856 osd: mark specific transactions netfree Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 63b56104f21e8e1abfe962bd9ab6b749b67fed3a

            Andreas, probably there is another solution for the problem.

            Basically ZFS reserves some space internally:

            • Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space in
            • the pool to be consumed. This ensures that we don't run the pool
            • completely out of space, due to unaccounted changes (e.g. to the MOS).
            • It also limits the worst-case time to allocate space. If we have
            • less than this amount of free space, most ZPL operations (e.g. write,
            • create) will return ENOSPC.
              *
            • Certain operations (e.g. file removal, most administrative actions) can
            • use half the slop space. They will only return ENOSPC if less than half
            • the slop space is free. Typically, once the pool has less than the slop
            • space free, the user will use these operations to free up space in the pool.
            • These are the operations that call dsl_pool_adjustedsize() with the netfree
            • argument set to TRUE.

            we can mark any transction "net free" using dmu_tx_mark_netfree()

            so the very first thing would be to mark transactions involving object destroy.
            then we could have a procfs tunable so that sysadm can turn that for specificic transactions (e.g. originated from root).

            bzzz Alex Zhuravlev added a comment - Andreas, probably there is another solution for the problem. Basically ZFS reserves some space internally: Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space in the pool to be consumed. This ensures that we don't run the pool completely out of space, due to unaccounted changes (e.g. to the MOS). It also limits the worst-case time to allocate space. If we have less than this amount of free space, most ZPL operations (e.g. write, create) will return ENOSPC. * Certain operations (e.g. file removal, most administrative actions) can use half the slop space. They will only return ENOSPC if less than half the slop space is free. Typically, once the pool has less than the slop space free, the user will use these operations to free up space in the pool. These are the operations that call dsl_pool_adjustedsize() with the netfree argument set to TRUE. we can mark any transction "net free" using dmu_tx_mark_netfree() so the very first thing would be to mark transactions involving object destroy. then we could have a procfs tunable so that sysadm can turn that for specificic transactions (e.g. originated from root).

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26868
            Subject: LU-8856 osd: reserve space in zfs pool for emergency
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5920a86e0a1fad6669997483a28fcea95d9a2fce

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26868 Subject: LU-8856 osd: reserve space in zfs pool for emergency Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5920a86e0a1fad6669997483a28fcea95d9a2fce

            People

              bzzz Alex Zhuravlev
              aeonjeffj Jeff Johnson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: