[LU-4825] lfs migrate not freeing space on OST Created: 26/Mar/14 Updated: 17/Jan/19 Resolved: 12/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.9.0, Lustre 2.10.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Shawn Hall (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ldiskfs | ||
| Environment: |
SLES 11 SP2 clients, CentOS 6.4 servers (DDN packaged) |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 13268 | ||||||||||||||||||||||||||||||||||||
| Description |
|
We have some OSTs that we let get out of hand and have reached 100% capacity. We have offlined them using "lctl --device <device_num> deactivate" along with others that are approaching capacity. Despite having users delete multi-terabyte files and using the lfs_migrate script (with two patches from Our initial guess was that after the layout swap of the "lfs migrate", the old objects were not being deleted from disk because those OSTs were deactivated on the MDS. Therefore on one OST I re-activated it on the MDS, unmounted from the OSS, and ran an "e2fsck -v -f -p /dev/..." and that seemed to free about 300 GB on the OST. I tried the same procedure on another OST and it did not change anything. The e2fsck output indicates that nothing "happened" in either case. This is a live, production file system so after yanking two OSTs offline I thought I'd stop testing theories before too many users called |
| Comments |
| Comment by Oleg Drokin [ 26/Mar/14 ] |
|
2.4+ versions absolutely need OSTs to be connected for objects to be freed. Unlike earlier versions clients don't even try to free unlinked objects anymore. Just e2fsck should not free anything because it works on a local device only. So those 300G freed probably were related to earlir issues or it's llog replay that took care of the objects for you once MDS really reconnected to MDT and did a log replay (does 300M roughly match expected freed space?) I expect just reactivating OSTs and letting them to be reconnected to should free the space sometime soon after succesful reconnection after the sync is complete. |
| Comment by Shawn Hall (Inactive) [ 17/Apr/14 ] |
|
Thanks Oleg. After re-enabling the OSTs with lctl, they eventually did gain free space. We didn't get fully in the clear though until we moved some data to another file system. My recommendation from this would be to add some information in the Lustre manual, probably under the "Handling Full OSTs" subchapter. The procedure described says to deactivate, lfs_migrate, and re-activate. Intuition would say that you'd see space freeing up as you lfs_migrate, not after you re-enable. You don't want to re-enable an OST if it's still full. Having a note in there about exactly when space will be freed on OSTs would help clear up any confusion. |
| Comment by Sean Brisbane [ 03/Oct/14 ] |
|
Dear All, I have one OST that I am trying to decommission and I have a couple of possibly related issues that have come up only now that I have created a new file-system at version 2.5.2. In my case, no objects are freed even after re-enabling all OSTS and waiting several hours. 1) Files are not being unlinked/deleted from OSTS after references to them are removed if the file is on an inactive OST I deactivate the OST on the MDT to prevent further object allocation
On a 2.1 client, I run lfs_migrate which uses the rsync and in-place creation. I dont see the space usage on the inactive OST decrease or change at all, even by 1 byte. I also don't see the OSTS that are recieving the data get an increase in space usage. If I stop the migration process on the client, and re-activate the OST on thh MDT node, the space usage of the destination OSTS increases but the files are still not deleted from the OST*. When I realized that this was happening, I have stopped the migration process and not restarted. Therefore I should have an OST still with some files on it, but not full. I actually have an OST with the same space usage as when I started. atlas25-OST0079_UUID 14593315264 13507785912 355458364 97% /lustre/atlas25[OST:121] I wonder if I am missing something, and "lctl --device N deactivate" is not not the way to prevent new stripes being created on the OST (the manual v2.X still recommends deactivate-before-migrate)? 2) Now the lfs find for atlas25-OST0079 (index 121) on a lustre 2.5 client in a directory apparently containing files from the OST. When this output was generated the OST was active on all lustre servers and clients, and had been so for at least 30 minutes.
Cheers, *This diagnostic may help with explaining the above problem. When deactivating an OST, I see the following in the MDS logs: I see that the MDT node has been evicted by the OST from the MDT logs. Oct 2 21:19:33 pplxlustre25mds4 kernel: LustreError: 167-0: atlas25-OST0079-osc-MDT0000: This client was evicted by atlas25-OST0079; in progress operations using this service will fail. With other messages such as: (Im assuming here that the messages for the OST0079 that I focus on in this email are bieng skipped) |
| Comment by Andreas Dilger [ 09/Jul/15 ] |
|
One problem here is that the documented procedure for migrating objects off of an OST is to use "lctl --device XXX deactivate" on the MDS for the OST(s), but this disconnects the MDS from the OST entirely and disables RPC sending at a low level in the code (RPC layer) so it isn't necessarily practical to special-case that code to allow only OST_DESTROY RPCs through from the MDS, since the MDS doesn't even know whether the OST is alive or dead at that point. It seems we need to have a different method to disable only MDS object creation on the specified OST(s) (ideally one that would also work on older versions of Lustre like possibly osp.*.max_precreated=0 or osp.*.max_create_count=0 or similar), and then update the documentation to reflect this new command for newer versions of Lustre, and possibly backport this to older releases that are affected (2.5/2.7). The other option, which is less preferable, is to change the meaning of "active=0" so that it just quiesces an OSP connection, but doesn't disconnect it completely, and then conditionally allows OST_DESTROY RPCs through if the OST is connected but just marked active=0, but that may cause other problems. |
| Comment by Kurt J. Strosahl (Inactive) [ 10/Jul/15 ] |
|
Hello, I'm observing a similar occurrence on some of my systems as well. Earlier in the week three of my osts reached 97% so I set them to read-only using lctl --device <device no> deactivate. Yesterday I was able to add some new osts to the system, and so I started an lfs_migrate on one of the full osts. I was aware that the system wouldn't update the space usage on the ost while it remained read-only, so this morning I set it back to active using lctl --device <device number> activate. The oss reported that it was deleting orphan objects, but the size usage didn't go down... after an hour the ost had more data on it then when it was in read-only mode and so I deactivated it, again. |
| Comment by Andreas Dilger [ 17/Aug/15 ] |
|
One option that works on a variety of different Lustre versions is to mark an OST as degraded: lctl set_param obdfilter.{OST_name}.degraded=1
This means that the MDS will skip the degraded OST(s) during most allocations, but will not skip them if someone requested a widely striped file and not enough non-degraded OSTs to fill the request. I think we need to allow setting osp.*.max_create_count=0 to inform the MDS to skip object precreation on the OST(s), instead of using the old lctl --device * deactivate method, so that the MDS can still destroy OST objects for unlinked files. While it appears possible to set max_create_count=0 today, the MDS still tries to create objects on that OST if specified via lfs setstripe -i <idx> and it waits for a timeout (100s) trying to create files there before moving to the next OST (at <idx + 1>). If max_create_count==0 then the LOD/OSP should skip this OSP immediately instead of waiting for a full timeout. |
| Comment by Joseph Gmitter (Inactive) [ 17/Aug/15 ] |
|
Hi Lai, |
| Comment by Gerrit Updater [ 20/Aug/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/16032 |
| Comment by Gerrit Updater [ 27/Aug/15 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/16105 |
| Comment by Jian Yu [ 03/Sep/15 ] |
|
I created |
| Comment by Andreas Dilger [ 12/Nov/15 ] |
|
As a temporary workaround on older Lustre versions before http://review.whamcloud.com/16105 is landed, it is also possible to use: oss# lctl set_param fail_loc=0x229 fail_val=<ost_index> on the OSS where the OST to be deactivated is located. This will block all creates on the specified OST index. This only allows blocking a single OST from creates per OSS at one time (by simulating running out of inodes in the OST_STATFS RPC sent to the MDS), but it avoids the drawbacks of completely deactivating the OST on the MDS (namely that OST objects are not destroyed on deactivated OSTs). This will generate some console spew ("*** cfs_fail_loc=0x229, val=<ost_index>***" every few seconds), and makes the "lfs df -i" output for this OST to be incorrect (it will report all inodes in use), but it is a workaround after all. |
| Comment by Gerrit Updater [ 12/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16032/ |
| Comment by Gerrit Updater [ 13/May/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20163 |
| Comment by Gerrit Updater [ 27/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16105/ |
| Comment by Peter Jones [ 12/Aug/16 ] |
|
The main fix has landed for 2.9. I suggest moving Andreas's cleanup patch to be tracked under a different JIRA ticket refernece |
| Comment by Gerrit Updater [ 28/Feb/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25661 |
| Comment by Gerrit Updater [ 14/Mar/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/20163/ |
| Comment by Gerrit Updater [ 13/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25661/ |