[LU-8856] ZFS-MDT 100% full. Cannot delete files. Created: 21/Nov/16 Updated: 06/Feb/24 Resolved: 15/Mar/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jeff Johnson (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
CentOS 6.8 2.6.32_504.30.3.el6.x86_64, Lustre 2.8.0 (g0bcd520), ZFS 0.6.5.4-1 |
||
| Issue Links: |
|
||||||||||||||||
| Epic/Theme: | zfs | ||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Epic: | metadata, zfs | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
End Customer: MSU (Michigan State Univ) A user generated tons of small files and exhausted the available inodes of the MDT (single MDT, no DNE). Any attempts at deleting files as root fail. I looked at zdb -d ls15-mds-00.mdt/mdt ls15-mds-00.mdt/mdt 2.82T 0 2.82T /ls15-mds-00.mdt/mdt [root@lac-373 roth]# lfs df -i But we can't remove any files: [root@lac-000 1mk5_5998]# rm tor.mat I'm going to take a stab at deregistering the changelog which might free up enough space to get the MDT able to process some file deletions. If anyone has any other 'best practices' please advise. |
| Comments |
| Comment by Andreas Dilger [ 21/Nov/16 ] |
|
If you have ChangeLogs active without an active consumer, then this will definitely consume a lot of space that does not get freed until the ChangeLog is processed or removed. Also, having an active ChangeLog means that some space is needed for the CL record at unlink time. Do you have any snapshots of this filesystem? If yes, then deleting the oldest snapshot should also free up some space. It may be that mounting and unmounting the dataset (up to 4 times) will allow old committed transactions to free up space If none of these options work, it may be possible to mount the filesystem locally as type zfs and deleting some specific files, however we should discuss that before any action is taken like this. Finally, one option would be to add extra storage to the MDT zpool. However, note that it will not be possible to remove those devices after they are added, so if this is done they should be configured correctly as mirrored VDEV(s) to maintain reliability. |
| Comment by Jeff Johnson (Inactive) [ 21/Nov/16 ] |
|
Draining changelogs or deregistering the changelog isn't working. For some reason the changelog doesn't have a user. The user was cl1 and the logs of the robinhood server show it was processing using cl1. On MDS it appears that there are no unprocessed changelog entries but robinhood was running up until a few months ago so there should be unprocessed changes stored: # cat /proc/fs/lustre/mdd/ls15-MDT0000/changelog_users current index: 164447373 ID index From a client, as root: (produces lots of output) # lfs changelog ls15-MDT0000|head 160211907 12LYOUT 01:50:56.131136452 2015.11.01 0x0 t=[0x200002cf8:0x12e9f:0x0] 160211908 12LYOUT 01:50:56.131136452 2015.11.01 0x0 t=[0x200002d22:0x167e9:0x0] 160211909 13TRUNC 01:50:56.132136455 2015.11.01 0xe t=[0x200002ca8:0x8a69:0x0] 160211910 13TRUNC 01:50:56.132136455 2015.11.01 0xe t=[0x200002cb4:0x17a3d:0x0] 160211911 11CLOSE 01:50:56.132136455 2015.11.01 0x42 t=[0x200002c37:0x8577:0x0] 160211912 11CLOSE 01:50:56.133136458 2015.11.01 0x42 t=[0x200002ca8:0x8a69:0x0] 160211913 11CLOSE 01:50:56.133136458 2015.11.01 0x42 t=[0x200002cb4:0x17a3d:0x0] Trying to clear as root from a client: # lfs changelog_clear ls15-MDT0000 cl1 0 changelog_clear error: No such file or directory Trying to deregister from the MDS: [root@ls15-mds-00.i ~]# lctl --device ls15-MDT0000 changelog_deregister cl1 error: changelog_deregister: No such file or directory [root@ls15-mds-00.i ~]# lctl --device ls15-MDT0000 changelog_deregister cl0 error: changelog_deregister: expected id of the form cl<num> got 'cl0' deregister an existing changelog user usage: device <mdtname> changelog_deregister <id> run <command> after connecting to device <devno> --device <devno> <command [args ...]> Logs from robinhood server showing consumption of changelogs using reader_id 'cl1': ======== General statistics ========= Daemon start time: 2016/07/28 18:48:59 Started modules: log_reader ChangeLog reader #0: fs_name = ls15 mdt_name = MDT0000 reader_id = cl1 records read = 4235467 interesting records = 2823646 suppressed records = 1411821 records pending = 0 last received = 2016/07/28 19:29:26 last read record time = 2015/10/31 22:28:52.489794 last read record id = 164447373 last pushed record id = 164447370 last committed record id = 164447370 last cleared record id = 164447370 read speed = 0.00 record/sec (0.00 incl. idle time) processing speed ratio = 0.00 ChangeLog stats: MARK: 0, CREAT: 0, MKDIR: 0, HLINK: 0, SLINK: 0, MKNOD: 0, UNLNK: 0, RMDIR: 0, RENME: 0 RNMTO: 0, OPEN: 0, CLOSE: 1411823, LYOUT: 1411822, TRUNC: 1411822, SATTR: 0, XATTR: 0 HSM: 0, MTIME: 0, CTIME: 0, ATIME: 0 |
| Comment by Jeff Johnson (Inactive) [ 21/Nov/16 ] |
|
There are no snapshots in the MDT pool. I was hoping to figure out how to see the changelog file or directory using zdb but I can't seem to find which object ID it might be. With over 2T full there are lots of entries to try and poke at. Is there a any sort of default object ID for the changelog file(s) or directory? By 'dataset' are you referring to unmounting and remounting the LFS server-side targets? Basically take down and remount the server-side of the LFS 3-4 times? |
| Comment by Jeff Johnson (Inactive) [ 21/Nov/16 ] |
|
We threw hardware at it. Expanded the MDT pool by adding a mirrored vdev and the extra 320GB gave room to move around and delete files. I'd still like to walk down this ticket so some best practices could be offered in the event a future occurrence doesn't have extra hardware at hand. |
| Comment by Andreas Dilger [ 21/Nov/16 ] |
|
While we do try to reserve space in the MDT and OST zpools (OSD_STATFS_RESERVED_SIZE), but I suspect we are not taking this into account when allocating files on the MDT, only on the OST. Separately, we need to look into how ChangeLogs are handled when the MDT is "full". The "unused ChangeLog is filling MDT" problem seems to be happening a lot. I think we need to handle this in an automatic manner, by tracking how much space the ChangeLog consumes, and if the MDT is too full and the oldest ChangeLog user that hasn't been used in some time (a week?) should be unregistered (with a clear LCONSOLE() error message printed) and records purged up to the next CL user. CL deregistration should be repeated in LRU order as needed until enough free space is available or no more unused CL users exist. It shouldn't automatically deregister active CL users (e.g. less than one day) since that could be used as a DOS to deactivate filesystem monitoring tools. A /proc tunable should be available to disable automatic CL user deregistration, and when this is set users would get ENOSPC instead of success when trying to modify the MDT. This should not be the default behaviour, however, and only used if it is more important to track every filesystem operation than it is to be able to use the filesystem. |
| Comment by Peter Jones [ 22/Nov/16 ] |
|
Lai Could you please assist with this one? Thanks Peter |
| Comment by Andreas Dilger [ 10/Jan/17 ] |
|
Two things need to be done here to handle this problem automatically, since this problem of ChangeLogs filling the MDT has happened several times:
|
| Comment by Andreas Dilger [ 10/Jan/17 ] |
|
One option for a very simple short-term solution for the ZFS space reservation is to have the MDS or OSD startup check the size of and/or write a 10MB However, it shouldn't delay too long in repopulating the file to avoid the situation where there is some runaway user job that continues to fill the filesystem and it gets back into the same situation again immediately. The benefit of this low-tech approach (vs. an in-memory reservation of space, and selectively blocking all but file/directory removal operations) is that this could be implemented quickly and potentially backported to existing releases with little risk. |
| Comment by Gerrit Updater [ 27/Apr/17 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26868 |
| Comment by Alex Zhuravlev [ 03/May/17 ] |
|
Andreas, probably there is another solution for the problem. Basically ZFS reserves some space internally:
we can mark any transction "net free" using dmu_tx_mark_netfree() so the very first thing would be to mark transactions involving object destroy. |
| Comment by Gerrit Updater [ 03/May/17 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/26930 |
| Comment by Alex Zhuravlev [ 04/May/17 ] |
|
the approach seem to work (in simple cases at least). here is the test: test_803() { == sanity test 803: OOS == 15:52:41 (1493902361)
w/o the patch rm fails.. |
| Comment by Alex Zhuravlev [ 04/May/17 ] |
|
unfortunately this capability was added in 0.7, it's not easily available on 0.6 though the majority of the required functionality is in place. |
| Comment by Andreas Dilger [ 04/May/17 ] |
|
I think the two approaches are complimentary. We can use the reserved space file for now, and use the "netfree" functionality when it is available. The main question about "netfree" is whether this is actually true when we delete an inode on the MDT with ChangeLogs enabled? Even if the dnode is deleted, it may not actually release space (due to shared dnode blocks) and the added ChangeLog record will consume space. As a result, even if the netfree functionality is available I think it makes sense to keep the emergency space reservation file around. If we never need to delete it then that is fine too, the amount of space consumed is minimal. |
| Comment by Alex Zhuravlev [ 04/May/17 ] |
|
I think this is true for "reserved with writes" as well - changelogs/destroy logs can be quite big so that with that reserved released we'll keep consuming? |
| Comment by Andreas Dilger [ 16/May/17 ] |
|
I think in the "reserved with writes" case, since the admin needs to get involved they can hopefully fix the source of the problem that is consuming all the free space (e.g. stale ChangeLog consumer registered) when they delete the emergency file. |
| Comment by Alex Zhuravlev [ 16/May/17 ] |
|
well, I guess we can mark any transaction originated from root with netfree flag when a special tunable set ? |
| Comment by Olaf Faaland [ 11/Jan/18 ] |
|
We've encountered this at LLNL, too. For the benefit of other sites that end up looking at this ticket and have Lustre versions without Alex's patches, I'm working up a procedure which I'll put on wiki.lustre.org at http://wiki.lustre.org/ZFS_MDT_ENOSPC_Recovery. It will work on any ZFS >= 0.6.5 using spa_slop_shift mentioned by Alex, above. |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26930/ |
| Comment by Peter Jones [ 27/Feb/18 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: https://review.whamcloud.com/31442 |
| Comment by Peter Jones [ 27/Feb/18 ] |
|
Reopening due to |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31442/ |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31443 |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/31444 |
| Comment by Gerrit Updater [ 15/Mar/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31444/ |
| Comment by Peter Jones [ 15/Mar/18 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 23/Mar/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31751 |
| Comment by Gerrit Updater [ 03/May/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31751/ |