[LU-8856] ZFS-MDT 100% full. Cannot delete files. Created: 21/Nov/16  Updated: 06/Feb/24  Resolved: 15/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Critical
Reporter: Jeff Johnson (Inactive) Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

CentOS 6.8 2.6.32_504.30.3.el6.x86_64, Lustre 2.8.0 (g0bcd520), ZFS 0.6.5.4-1


Issue Links:
Duplicate
Related
is related to LU-7340 ChangeLogs catalog full condition sho... Resolved
is related to LU-10732 sanity-lfsck test_9a: FAIL: (7) Faile... Resolved
Epic/Theme: zfs
Severity: 2
Epic: metadata, zfs
Rank (Obsolete): 9223372036854775807

 Description   

End Customer: MSU (Michigan State Univ)

A user generated tons of small files and exhausted the available inodes of the MDT (single MDT, no DNE). Any attempts at deleting files as root fail.

I looked at LU-8787 and LU-8714 but they don't seem to follow this closely enough.

zdb -d ls15-mds-00.mdt/mdt
Dataset ls15-mds-00.mdt/mdt [ZPL], ID 66, cr_txg 20442, 2.82T, 280362968 objects

ls15-mds-00.mdt/mdt 2.82T 0 2.82T /ls15-mds-00.mdt/mdt

[root@lac-373 roth]# lfs df -i
UUID Inodes IUsed IFree IUse% Mounted on
ls15-MDT0000_UUID 280362968 280362968 0 100% /mnt/ls15[MDT:0]

But we can't remove any files:

[root@lac-000 1mk5_5998]# rm tor.mat
rm: cannot remove `tor.mat': No space left on device

I'm going to take a stab at deregistering the changelog which might free up enough space to get the MDT able to process some file deletions. If anyone has any other 'best practices' please advise.



 Comments   
Comment by Andreas Dilger [ 21/Nov/16 ]

If you have ChangeLogs active without an active consumer, then this will definitely consume a lot of space that does not get freed until the ChangeLog is processed or removed. Also, having an active ChangeLog means that some space is needed for the CL record at unlink time.

Do you have any snapshots of this filesystem? If yes, then deleting the oldest snapshot should also free up some space.

It may be that mounting and unmounting the dataset (up to 4 times) will allow old committed transactions to free up space

If none of these options work, it may be possible to mount the filesystem locally as type zfs and deleting some specific files, however we should discuss that before any action is taken like this.

Finally, one option would be to add extra storage to the MDT zpool. However, note that it will not be possible to remove those devices after they are added, so if this is done they should be configured correctly as mirrored VDEV(s) to maintain reliability.

Comment by Jeff Johnson (Inactive) [ 21/Nov/16 ]

Draining changelogs or deregistering the changelog isn't working. For some reason the changelog doesn't have a user. The user was cl1 and the logs of the robinhood server show it was processing using cl1.

On MDS it appears that there are no unprocessed changelog entries but robinhood was running up until a few months ago so there should be unprocessed changes stored:

# cat /proc/fs/lustre/mdd/ls15-MDT0000/changelog_users
current index: 164447373
ID    index

From a client, as root: (produces lots of output)

# lfs changelog ls15-MDT0000|head
160211907 12LYOUT 01:50:56.131136452 2015.11.01 0x0 t=[0x200002cf8:0x12e9f:0x0]
160211908 12LYOUT 01:50:56.131136452 2015.11.01 0x0 t=[0x200002d22:0x167e9:0x0]
160211909 13TRUNC 01:50:56.132136455 2015.11.01 0xe t=[0x200002ca8:0x8a69:0x0]
160211910 13TRUNC 01:50:56.132136455 2015.11.01 0xe t=[0x200002cb4:0x17a3d:0x0]
160211911 11CLOSE 01:50:56.132136455 2015.11.01 0x42 t=[0x200002c37:0x8577:0x0]
160211912 11CLOSE 01:50:56.133136458 2015.11.01 0x42 t=[0x200002ca8:0x8a69:0x0]
160211913 11CLOSE 01:50:56.133136458 2015.11.01 0x42 t=[0x200002cb4:0x17a3d:0x0]

Trying to clear as root from a client:

# lfs changelog_clear ls15-MDT0000 cl1 0
changelog_clear error: No such file or directory

Trying to deregister from the MDS:

[root@ls15-mds-00.i ~]# lctl --device ls15-MDT0000 changelog_deregister cl1
error: changelog_deregister: No such file or directory

[root@ls15-mds-00.i ~]# lctl --device ls15-MDT0000 changelog_deregister cl0
error: changelog_deregister: expected id of the form cl<num> got 'cl0'
deregister an existing changelog user
usage:	device <mdtname>
	changelog_deregister <id>
run <command> after connecting to device <devno>
--device <devno> <command [args ...]>

Logs from robinhood server showing consumption of changelogs using reader_id 'cl1':

======== General statistics =========
Daemon start time: 2016/07/28 18:48:59
Started modules: log_reader
ChangeLog reader #0:
   fs_name    =   ls15
   mdt_name   =   MDT0000
   reader_id  =   cl1
   records read        = 4235467
   interesting records = 2823646
   suppressed records  = 1411821
   records pending     = 0
   last received            = 2016/07/28 19:29:26
   last read record time    = 2015/10/31 22:28:52.489794
   last read record id      = 164447373
   last pushed record id    = 164447370
   last committed record id = 164447370
   last cleared record id   = 164447370
   read speed               = 0.00 record/sec (0.00 incl. idle time)
   processing speed ratio   = 0.00
   ChangeLog stats:
   MARK: 0, CREAT: 0, MKDIR: 0, HLINK: 0, SLINK: 0, MKNOD: 0, UNLNK: 0, RMDIR: 0, RENME: 0
   RNMTO: 0, OPEN: 0, CLOSE: 1411823, LYOUT: 1411822, TRUNC: 1411822, SATTR: 0, XATTR: 0
   HSM: 0, MTIME: 0, CTIME: 0, ATIME: 0
Comment by Jeff Johnson (Inactive) [ 21/Nov/16 ]

There are no snapshots in the MDT pool.

I was hoping to figure out how to see the changelog file or directory using zdb but I can't seem to find which object ID it might be. With over 2T full there are lots of entries to try and poke at. Is there a any sort of default object ID for the changelog file(s) or directory?

By 'dataset' are you referring to unmounting and remounting the LFS server-side targets? Basically take down and remount the server-side of the LFS 3-4 times?

Comment by Jeff Johnson (Inactive) [ 21/Nov/16 ]

We threw hardware at it. Expanded the MDT pool by adding a mirrored vdev and the extra 320GB gave room to move around and delete files.

I'd still like to walk down this ticket so some best practices could be offered in the event a future occurrence doesn't have extra hardware at hand.

Comment by Andreas Dilger [ 21/Nov/16 ]

While we do try to reserve space in the MDT and OST zpools (OSD_STATFS_RESERVED_SIZE), but I suspect we are not taking this into account when allocating files on the MDT, only on the OST.

Separately, we need to look into how ChangeLogs are handled when the MDT is "full". The "unused ChangeLog is filling MDT" problem seems to be happening a lot. I think we need to handle this in an automatic manner, by tracking how much space the ChangeLog consumes, and if the MDT is too full and the oldest ChangeLog user that hasn't been used in some time (a week?) should be unregistered (with a clear LCONSOLE() error message printed) and records purged up to the next CL user. CL deregistration should be repeated in LRU order as needed until enough free space is available or no more unused CL users exist. It shouldn't automatically deregister active CL users (e.g. less than one day) since that could be used as a DOS to deactivate filesystem monitoring tools.

A /proc tunable should be available to disable automatic CL user deregistration, and when this is set users would get ENOSPC instead of success when trying to modify the MDT. This should not be the default behaviour, however, and only used if it is more important to track every filesystem operation than it is to be able to use the filesystem.

Comment by Peter Jones [ 22/Nov/16 ]

Lai

Could you please assist with this one?

Thanks

Peter

Comment by Andreas Dilger [ 10/Jan/17 ]

Two things need to be done here to handle this problem automatically, since this problem of ChangeLogs filling the MDT has happened several times:

  1. if the MDT is too full and the ChangeLog consumes too much space (see also patch https://review.whamcloud.com/16416 "LU-7156 mdd: add changelog_size to procfs"), and some ChangeLog user hasn't been used in some time (over a week?) it should be unregistered (with a clear LCONSOLE() error message printed) and records purged up to the next CL user (this should happen automatically), repeat for the next CL user as needed. Possibly a /sys/fs/lustre tunable to disable automatic CL user deregistration should be available, and further operations on the filesystem would get ENOSPC instead of success (as it does today), if it really is critical to track every operation, but that should not be the default behavior. The deregistration should not be done for recently active ChangeLog users (< 24h), since this would potentially allow users to disable the ChangeLogs just by filling the MDT, and there is little benefit to removing the CL user if it does not free up much space.
  2. reserve more space for ZFS filesystems, and when the threshold is hit only allow files to be deleted. This needs to be done in conjunction with automatic removal of old ChangeLog, otherwise deleting files will free some space (assuming whole metadnode blocks are freed) but it will also consume space in the ChangeLog.
Comment by Andreas Dilger [ 10/Jan/17 ]

One option for a very simple short-term solution for the ZFS space reservation is to have the MDS or OSD startup check the size of and/or write a 10MB file in the MDS root with a name like IN_CASE_OF_ENOSPC_TRUNCATE_THIS_FILE.  This would be large enough to ensure that truncating the file manually from a local mount if there is an emergency situation like this in the future releases enough space to start deleting file again.  It isn't high-tech, but is definitely a robust way to reserve space for such an emergency, and in most cases that space won't be missed.  The size of the file could be scaled down for small test filesystems below, say, 10GB or skipped completely.  Some care would be needed to avoid refilling the file immediately after mount if the MDS is just being mounted after truncating the ICE file and files are being deleted. 

However, it shouldn't delay too long in repopulating the file to avoid the situation where there is some runaway user job that continues to fill the filesystem and it gets back into the same situation again immediately.  The benefit of this low-tech approach (vs. an in-memory reservation of space, and selectively blocking all but file/directory removal operations) is that this could be implemented quickly and potentially backported to existing releases with little risk.

Comment by Gerrit Updater [ 27/Apr/17 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26868
Subject: LU-8856 osd: reserve space in zfs pool for emergency
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5920a86e0a1fad6669997483a28fcea95d9a2fce

Comment by Alex Zhuravlev [ 03/May/17 ]

Andreas, probably there is another solution for the problem.

Basically ZFS reserves some space internally:

  • Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space in
  • the pool to be consumed. This ensures that we don't run the pool
  • completely out of space, due to unaccounted changes (e.g. to the MOS).
  • It also limits the worst-case time to allocate space. If we have
  • less than this amount of free space, most ZPL operations (e.g. write,
  • create) will return ENOSPC.
    *
  • Certain operations (e.g. file removal, most administrative actions) can
  • use half the slop space. They will only return ENOSPC if less than half
  • the slop space is free. Typically, once the pool has less than the slop
  • space free, the user will use these operations to free up space in the pool.
  • These are the operations that call dsl_pool_adjustedsize() with the netfree
  • argument set to TRUE.

we can mark any transction "net free" using dmu_tx_mark_netfree()

so the very first thing would be to mark transactions involving object destroy.
then we could have a procfs tunable so that sysadm can turn that for specificic transactions (e.g. originated from root).

Comment by Gerrit Updater [ 03/May/17 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26930
Subject: LU-8856 osd: mark specific transactions netfree
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 63b56104f21e8e1abfe962bd9ab6b749b67fed3a

Comment by Alex Zhuravlev [ 04/May/17 ]

the approach seem to work (in simple cases at least). here is the test:

test_803() {
mkdir $DIR/$tdir
createmany -m $DIR/$tdir/f 10000000000 && error "too big device?"
rm -rf $DIR/$tdir || error "rm should succeed after ENOSPC"
}

== sanity test 803: OOS == 15:52:41 (1493902361)

  • create 10000 (time 1493902365.56 total 4.41 last 2269.00)
  • create 20000 (time 1493902371.51 total 10.35 last 1681.47)
  • create 30000 (time 1493902377.66 total 16.50 last 1625.92)
  • create 40000 (time 1493902384.95 total 23.80 last 1370.67)
  • create 44433 (time 1493902394.95 total 33.80 last 443.18)
  • create 46390 (time 1493902404.97 total 43.81 last 195.48)
  • create 47995 (time 1493902414.97 total 53.82 last 160.40)
  • create 49455 (time 1493902424.98 total 63.83 last 145.92)
  • create 50000 (time 1493902428.78 total 67.62 last 143.49)
  • create 51395 (time 1493902438.78 total 77.63 last 139.39)
  • create 52925 (time 1493902448.79 total 87.64 last 152.94)
  • create 54468 (time 1493902458.79 total 97.64 last 154.29)
  • create 56076 (time 1493902468.80 total 107.64 last 160.70)
  • create 57716 (time 1493902478.80 total 117.65 last 163.87)
  • create 59290 (time 1493902488.81 total 127.66 last 157.27)
  • create 60000 (time 1493902493.28 total 132.12 last 159.07)
  • create 61487 (time 1493902503.28 total 142.13 last 148.58)
    mknod(/mnt/lustre/d803.sanity/f62098) error: No space left on device
    total: 62098 create in 146.27 seconds: 424.54 ops/second
    Resetting fail_loc on all nodes...done.
    15:56:09 (1493902569) waiting for dual2 network 5 secs ...
    15:56:09 (1493902569) network interface is UP
    PASS 803 (208s)

w/o the patch rm fails..

Comment by Alex Zhuravlev [ 04/May/17 ]

unfortunately this capability was added in 0.7, it's not easily available on 0.6 though the majority of the required functionality is in place.

Comment by Andreas Dilger [ 04/May/17 ]

I think the two approaches are complimentary. We can use the reserved space file for now, and use the "netfree" functionality when it is available.

The main question about "netfree" is whether this is actually true when we delete an inode on the MDT with ChangeLogs enabled? Even if the dnode is deleted, it may not actually release space (due to shared dnode blocks) and the added ChangeLog record will consume space.

As a result, even if the netfree functionality is available I think it makes sense to keep the emergency space reservation file around. If we never need to delete it then that is fine too, the amount of space consumed is minimal.

Comment by Alex Zhuravlev [ 04/May/17 ]

I think this is true for "reserved with writes" as well - changelogs/destroy logs can be quite big so that with that reserved released we'll keep consuming?
correct me if I'm wrong, but I don't really see big difference.

Comment by Andreas Dilger [ 16/May/17 ]

I think in the "reserved with writes" case, since the admin needs to get involved they can hopefully fix the source of the problem that is consuming all the free space (e.g. stale ChangeLog consumer registered) when they delete the emergency file.

Comment by Alex Zhuravlev [ 16/May/17 ]

well, I guess we can mark any transaction originated from root with netfree flag when a special tunable set ?
if no space can be released, admin comes in settting that variable and do whatever may help with it's rights..

Comment by Olaf Faaland [ 11/Jan/18 ]

We've encountered this at LLNL, too.

For the benefit of other sites that end up looking at this ticket and have Lustre versions without Alex's patches, I'm working up a procedure which I'll put on wiki.lustre.org at http://wiki.lustre.org/ZFS_MDT_ENOSPC_Recovery. It will work on any ZFS >= 0.6.5 using spa_slop_shift mentioned by Alex, above.

Comment by Gerrit Updater [ 27/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26930/
Subject: LU-8856 osd: mark specific transactions netfree
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8d1639b5cf1edbc885876956dcd6189173c00955

Comment by Peter Jones [ 27/Feb/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 27/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31442
Subject: Revert "LU-8856 osd: mark specific transactions netfree"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 83be173e2848e3b81d6fb2123d70d0cf614105a8

Comment by Peter Jones [ 27/Feb/18 ]

Reopening due to LU-10732

Comment by Gerrit Updater [ 27/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31442/
Subject: Revert "LU-8856 osd: mark specific transactions netfree"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c2caa40bd38e7645dc4ac90552e12e3fb7fde476

Comment by Gerrit Updater [ 27/Feb/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31443
Subject: LU-8856 osd: mark specific transactions netfree
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: d47e3892f83d2d1bdb21653381a8d1ec0db68a4a

Comment by Gerrit Updater [ 27/Feb/18 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/31444
Subject: LU-8856 osd: mark specific transactions netfree
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8bc89e5b3aeda1bf15f2ff6fc53651e470c0a6c6

Comment by Gerrit Updater [ 15/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31444/
Subject: LU-8856 osd: mark specific transactions netfree
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 106abc184d8b57de560dc1874683ce5487dcf30a

Comment by Peter Jones [ 15/Mar/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 23/Mar/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31751
Subject: LU-8856 osd: mark specific transactions netfree
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: aaafe65b47925dd83b2138d55843ac61af2967a8

Comment by Gerrit Updater [ 03/May/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31751/
Subject: LU-8856 osd: mark specific transactions netfree
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: a8c7a32fd7fc54e9717e23f208b40c8ff93b81e4

Generated at Sat Feb 10 02:21:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.