[LU-8856] ZFS-MDT 100% full. Cannot delete files. - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4
Affects Version/s: Lustre 2.8.0
Labels:
- llnl
Environment:
CentOS 6.8 2.6.32_504.30.3.el6.x86_64, Lustre 2.8.0 (g0bcd520), ZFS 0.6.5.4-1

Epic/Theme:
- zfs
Severity:
2
Epic:
- metadata
- zfs
Rank (Obsolete):
9223372036854775807

Description

End Customer: MSU (Michigan State Univ)

A user generated tons of small files and exhausted the available inodes of the MDT (single MDT, no DNE). Any attempts at deleting files as root fail.

I looked at ~~LU-8787~~ and LU-8714 but they don't seem to follow this closely enough.

zdb -d ls15-mds-00.mdt/mdt
Dataset ls15-mds-00.mdt/mdt [ZPL], ID 66, cr_txg 20442, 2.82T, 280362968 objects

ls15-mds-00.mdt/mdt 2.82T 0 2.82T /ls15-mds-00.mdt/mdt

[root@lac-373 roth]# lfs df -i
UUID Inodes IUsed IFree IUse% Mounted on
ls15-MDT0000_UUID 280362968 280362968 0 100% /mnt/ls15[MDT:0]

But we can't remove any files:

[root@lac-000 1mk5_5998]# rm tor.mat
rm: cannot remove `tor.mat': No space left on device

I'm going to take a stab at deregistering the changelog which might free up enough space to get the MDT able to process some file deletions. If anyone has any other 'best practices' please advise.

Attachments

Issue Links

is related to

LU-10732 sanity-lfsck test_9a: FAIL: (7) Failed to get expected 'completed'

Resolved

is related to

LU-7340 ChangeLogs catalog full condition should be handled more gracefully

Resolved

Activity

[LU-8856] ZFS-MDT 100% full. Cannot delete files.

Gerrit Updater added a comment - 27/Feb/18 3:41 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26930/
Subject: ~~LU-8856~~ osd: mark specific transactions netfree
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8d1639b5cf1edbc885876956dcd6189173c00955

Gerrit Updater added a comment - 27/Feb/18 3:41 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26930/ Subject: LU-8856 osd: mark specific transactions netfree Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8d1639b5cf1edbc885876956dcd6189173c00955

Olaf Faaland added a comment - 11/Jan/18 7:25 PM - edited

We've encountered this at LLNL, too.

For the benefit of other sites that end up looking at this ticket and have Lustre versions without Alex's patches, I'm working up a procedure which I'll put on wiki.lustre.org at http://wiki.lustre.org/ZFS_MDT_ENOSPC_Recovery. It will work on any ZFS >= 0.6.5 using spa_slop_shift mentioned by Alex, above.

Olaf Faaland added a comment - 11/Jan/18 7:25 PM - edited We've encountered this at LLNL, too. For the benefit of other sites that end up looking at this ticket and have Lustre versions without Alex's patches, I'm working up a procedure which I'll put on wiki.lustre.org at http://wiki.lustre.org/ZFS_MDT_ENOSPC_Recovery . It will work on any ZFS >= 0.6.5 using spa_slop_shift mentioned by Alex, above.

Alex Zhuravlev added a comment - 16/May/17 12:37 PM

well, I guess we can mark any transaction originated from root with netfree flag when a special tunable set ?
if no space can be released, admin comes in settting that variable and do whatever may help with it's rights..

Alex Zhuravlev added a comment - 16/May/17 12:37 PM well, I guess we can mark any transaction originated from root with netfree flag when a special tunable set ? if no space can be released, admin comes in settting that variable and do whatever may help with it's rights..

Andreas Dilger added a comment - 16/May/17 7:33 AM

I think in the "reserved with writes" case, since the admin needs to get involved they can hopefully fix the source of the problem that is consuming all the free space (e.g. stale ChangeLog consumer registered) when they delete the emergency file.

Andreas Dilger added a comment - 16/May/17 7:33 AM I think in the "reserved with writes" case, since the admin needs to get involved they can hopefully fix the source of the problem that is consuming all the free space (e.g. stale ChangeLog consumer registered) when they delete the emergency file.

Alex Zhuravlev added a comment - 04/May/17 5:34 PM

I think this is true for "reserved with writes" as well - changelogs/destroy logs can be quite big so that with that reserved released we'll keep consuming?
correct me if I'm wrong, but I don't really see big difference.

Alex Zhuravlev added a comment - 04/May/17 5:34 PM I think this is true for "reserved with writes" as well - changelogs/destroy logs can be quite big so that with that reserved released we'll keep consuming? correct me if I'm wrong, but I don't really see big difference.

Andreas Dilger added a comment - 04/May/17 5:17 PM

I think the two approaches are complimentary. We can use the reserved space file for now, and use the "netfree" functionality when it is available.

The main question about "netfree" is whether this is actually true when we delete an inode on the MDT with ChangeLogs enabled? Even if the dnode is deleted, it may not actually release space (due to shared dnode blocks) and the added ChangeLog record will consume space.

As a result, even if the netfree functionality is available I think it makes sense to keep the emergency space reservation file around. If we never need to delete it then that is fine too, the amount of space consumed is minimal.

Andreas Dilger added a comment - 04/May/17 5:17 PM I think the two approaches are complimentary. We can use the reserved space file for now, and use the "netfree" functionality when it is available. The main question about "netfree" is whether this is actually true when we delete an inode on the MDT with ChangeLogs enabled? Even if the dnode is deleted, it may not actually release space (due to shared dnode blocks) and the added ChangeLog record will consume space. As a result, even if the netfree functionality is available I think it makes sense to keep the emergency space reservation file around. If we never need to delete it then that is fine too, the amount of space consumed is minimal.

Alex Zhuravlev added a comment - 04/May/17 2:29 PM

unfortunately this capability was added in 0.7, it's not easily available on 0.6 though the majority of the required functionality is in place.

Alex Zhuravlev added a comment - 04/May/17 2:29 PM unfortunately this capability was added in 0.7, it's not easily available on 0.6 though the majority of the required functionality is in place.

Alex Zhuravlev added a comment - 04/May/17 9:00 AM

the approach seem to work (in simple cases at least). here is the test:

test_803() {
mkdir $DIR/$tdir
createmany -m $DIR/$tdir/f 10000000000 && error "too big device?"
rm -rf $DIR/$tdir || error "rm should succeed after ENOSPC"
}

== sanity test 803: OOS == 15:52:41 (1493902361)

create 10000 (time 1493902365.56 total 4.41 last 2269.00)
create 20000 (time 1493902371.51 total 10.35 last 1681.47)
create 30000 (time 1493902377.66 total 16.50 last 1625.92)
create 40000 (time 1493902384.95 total 23.80 last 1370.67)
create 44433 (time 1493902394.95 total 33.80 last 443.18)
create 46390 (time 1493902404.97 total 43.81 last 195.48)
create 47995 (time 1493902414.97 total 53.82 last 160.40)
create 49455 (time 1493902424.98 total 63.83 last 145.92)
create 50000 (time 1493902428.78 total 67.62 last 143.49)
create 51395 (time 1493902438.78 total 77.63 last 139.39)
create 52925 (time 1493902448.79 total 87.64 last 152.94)
create 54468 (time 1493902458.79 total 97.64 last 154.29)
create 56076 (time 1493902468.80 total 107.64 last 160.70)
create 57716 (time 1493902478.80 total 117.65 last 163.87)
create 59290 (time 1493902488.81 total 127.66 last 157.27)
create 60000 (time 1493902493.28 total 132.12 last 159.07)
create 61487 (time 1493902503.28 total 142.13 last 148.58)
mknod(/mnt/lustre/d803.sanity/f62098) error: No space left on device
total: 62098 create in 146.27 seconds: 424.54 ops/second
Resetting fail_loc on all nodes...done.
15:56:09 (1493902569) waiting for dual2 network 5 secs ...
15:56:09 (1493902569) network interface is UP
PASS 803 (208s)

w/o the patch rm fails..

Alex Zhuravlev added a comment - 04/May/17 9:00 AM the approach seem to work (in simple cases at least). here is the test: test_803() { mkdir $DIR/$tdir createmany -m $DIR/$tdir/f 10000000000 && error "too big device?" rm -rf $DIR/$tdir || error "rm should succeed after ENOSPC" } == sanity test 803: OOS == 15:52:41 (1493902361) create 10000 (time 1493902365.56 total 4.41 last 2269.00) create 20000 (time 1493902371.51 total 10.35 last 1681.47) create 30000 (time 1493902377.66 total 16.50 last 1625.92) create 40000 (time 1493902384.95 total 23.80 last 1370.67) create 44433 (time 1493902394.95 total 33.80 last 443.18) create 46390 (time 1493902404.97 total 43.81 last 195.48) create 47995 (time 1493902414.97 total 53.82 last 160.40) create 49455 (time 1493902424.98 total 63.83 last 145.92) create 50000 (time 1493902428.78 total 67.62 last 143.49) create 51395 (time 1493902438.78 total 77.63 last 139.39) create 52925 (time 1493902448.79 total 87.64 last 152.94) create 54468 (time 1493902458.79 total 97.64 last 154.29) create 56076 (time 1493902468.80 total 107.64 last 160.70) create 57716 (time 1493902478.80 total 117.65 last 163.87) create 59290 (time 1493902488.81 total 127.66 last 157.27) create 60000 (time 1493902493.28 total 132.12 last 159.07) create 61487 (time 1493902503.28 total 142.13 last 148.58) mknod(/mnt/lustre/d803.sanity/f62098) error: No space left on device total: 62098 create in 146.27 seconds: 424.54 ops/second Resetting fail_loc on all nodes...done. 15:56:09 (1493902569) waiting for dual2 network 5 secs ... 15:56:09 (1493902569) network interface is UP PASS 803 (208s) w/o the patch rm fails..

Gerrit Updater added a comment - 03/May/17 12:47 PM

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26930
Subject: ~~LU-8856~~ osd: mark specific transactions netfree
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 63b56104f21e8e1abfe962bd9ab6b749b67fed3a

Gerrit Updater added a comment - 03/May/17 12:47 PM Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26930 Subject: LU-8856 osd: mark specific transactions netfree Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 63b56104f21e8e1abfe962bd9ab6b749b67fed3a

Alex Zhuravlev added a comment - 03/May/17 12:28 PM

Andreas, probably there is another solution for the problem.

Basically ZFS reserves some space internally:

Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space in
the pool to be consumed. This ensures that we don't run the pool
completely out of space, due to unaccounted changes (e.g. to the MOS).
It also limits the worst-case time to allocate space. If we have
less than this amount of free space, most ZPL operations (e.g. write,
create) will return ENOSPC.
*
Certain operations (e.g. file removal, most administrative actions) can
use half the slop space. They will only return ENOSPC if less than half
the slop space is free. Typically, once the pool has less than the slop
space free, the user will use these operations to free up space in the pool.
These are the operations that call dsl_pool_adjustedsize() with the netfree
argument set to TRUE.

we can mark any transction "net free" using dmu_tx_mark_netfree()

so the very first thing would be to mark transactions involving object destroy.
then we could have a procfs tunable so that sysadm can turn that for specificic transactions (e.g. originated from root).

Alex Zhuravlev added a comment - 03/May/17 12:28 PM Andreas, probably there is another solution for the problem. Basically ZFS reserves some space internally: Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space in the pool to be consumed. This ensures that we don't run the pool completely out of space, due to unaccounted changes (e.g. to the MOS). It also limits the worst-case time to allocate space. If we have less than this amount of free space, most ZPL operations (e.g. write, create) will return ENOSPC. * Certain operations (e.g. file removal, most administrative actions) can use half the slop space. They will only return ENOSPC if less than half the slop space is free. Typically, once the pool has less than the slop space free, the user will use these operations to free up space in the pool. These are the operations that call dsl_pool_adjustedsize() with the netfree argument set to TRUE. we can mark any transction "net free" using dmu_tx_mark_netfree() so the very first thing would be to mark transactions involving object destroy. then we could have a procfs tunable so that sysadm can turn that for specificic transactions (e.g. originated from root).

Gerrit Updater added a comment - 27/Apr/17 4:03 PM

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26868
Subject: ~~LU-8856~~ osd: reserve space in zfs pool for emergency
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5920a86e0a1fad6669997483a28fcea95d9a2fce

Gerrit Updater added a comment - 27/Apr/17 4:03 PM Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26868 Subject: LU-8856 osd: reserve space in zfs pool for emergency Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5920a86e0a1fad6669997483a28fcea95d9a2fce

People

Assignee:: Alex Zhuravlev

Reporter:: Jeff Johnson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 21/Nov/16 6:43 PM

Updated:: 06/Feb/24 6:44 AM

Resolved:: 15/Mar/18 2:04 PM