[LU-4825] lfs migrate not freeing space on OST Created: 26/Mar/14  Updated: 17/Jan/19  Resolved: 12/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.9.0, Lustre 2.10.0

Type: Bug Priority: Major
Reporter: Shawn Hall (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: ldiskfs
Environment:

SLES 11 SP2 clients, CentOS 6.4 servers (DDN packaged)


Issue Links:
Duplicate
Related
is related to LU-7012 files not being deleted from OST afte... Resolved
is related to LU-5931 Deactivated OST still contains data Resolved
is related to LUDOC-305 "lctl deactivate/activate" does not w... Resolved
is related to LU-11115 OST selection algorithm broken with m... Resolved
is related to LU-4295 removing files on deactivated OST doe... Resolved
is related to LU-11605 create_count stuck in 0 after changei... Resolved
is related to LU-8523 sanity test_311: objs not destroyed a... Resolved
Severity: 3
Rank (Obsolete): 13268

 Description   

We have some OSTs that we let get out of hand and have reached 100% capacity. We have offlined them using "lctl --device <device_num> deactivate" along with others that are approaching capacity. Despite having users delete multi-terabyte files and using the lfs_migrate script (with two patches from LU-4293 included to allow it to use "lfs migrate" as root instead of rsync) to migrate over 100 TB of data (with the full OSTs deactivated), we are not freeing up any space on the OSTs.

Our initial guess was that after the layout swap of the "lfs migrate", the old objects were not being deleted from disk because those OSTs were deactivated on the MDS. Therefore on one OST I re-activated it on the MDS, unmounted from the OSS, and ran an "e2fsck -v -f -p /dev/..." and that seemed to free about 300 GB on the OST. I tried the same procedure on another OST and it did not change anything. The e2fsck output indicates that nothing "happened" in either case.

This is a live, production file system so after yanking two OSTs offline I thought I'd stop testing theories before too many users called



 Comments   
Comment by Oleg Drokin [ 26/Mar/14 ]

2.4+ versions absolutely need OSTs to be connected for objects to be freed. Unlike earlier versions clients don't even try to free unlinked objects anymore.

Just e2fsck should not free anything because it works on a local device only. So those 300G freed probably were related to earlir issues or it's llog replay that took care of the objects for you once MDS really reconnected to MDT and did a log replay (does 300M roughly match expected freed space?)
Could it be that hte other OST did not have anythning to unlink?

I expect just reactivating OSTs and letting them to be reconnected to should free the space sometime soon after succesful reconnection after the sync is complete.

Comment by Shawn Hall (Inactive) [ 17/Apr/14 ]

Thanks Oleg. After re-enabling the OSTs with lctl, they eventually did gain free space. We didn't get fully in the clear though until we moved some data to another file system.

My recommendation from this would be to add some information in the Lustre manual, probably under the "Handling Full OSTs" subchapter. The procedure described says to deactivate, lfs_migrate, and re-activate. Intuition would say that you'd see space freeing up as you lfs_migrate, not after you re-enable. You don't want to re-enable an OST if it's still full. Having a note in there about exactly when space will be freed on OSTs would help clear up any confusion.

Comment by Sean Brisbane [ 03/Oct/14 ]

Dear All,

I have one OST that I am trying to decommission and I have a couple of possibly related issues that have come up only now that I have created a new file-system at version 2.5.2. In my case, no objects are freed even after re-enabling all OSTS and waiting several hours.

1) Files are not being unlinked/deleted from OSTS after references to them are removed if the file is on an inactive OST
2) lfs find by uuid does not work for some OSTS - apparently only those that have been deactivated and reactivated on the MDT. Find by index works, and so I can workaround this later issue.

I deactivate the OST on the MDT to prevent further object allocation

  1. lctl --device 17 deactivate
  2. grep atlas25-OST0079_UUID /proc/fs/lustre/lov/atlas25-MDT0000-mdtlov/target_obd
    121: atlas25-OST0079_UUID INACTIVE

On a 2.1 client, I run lfs_migrate which uses the rsync and in-place creation. I dont see the space usage on the inactive OST decrease or change at all, even by 1 byte. I also don't see the OSTS that are recieving the data get an increase in space usage. If I stop the migration process on the client, and re-activate the OST on thh MDT node, the space usage of the destination OSTS increases but the files are still not deleted from the OST*. When I realized that this was happening, I have stopped the migration process and not restarted. Therefore I should have an OST still with some files on it, but not full. I actually have an OST with the same space usage as when I started.

atlas25-OST0079_UUID 14593315264 13507785912 355458364 97% /lustre/atlas25[OST:121]

I wonder if I am missing something, and "lctl --device N deactivate" is not not the way to prevent new stripes being created on the OST (the manual v2.X still recommends deactivate-before-migrate)?

2) Now the lfs find for atlas25-OST0079 (index 121) on a lustre 2.5 client in a directory apparently containing files from the OST. When this output was generated the OST was active on all lustre servers and clients, and had been so for at least 30 minutes.

  1. lfs find -ost atlas25-OST0079 .
  1. lfs find -ost 121 .
    ./mcatnl_herpp_ggH.17.2.11.3.root.220
    ./mcatnl_herpp_ggH.17.2.11.3.root.203
  1. lfs getstripe ./mcatnl_herpp_ggH.17.2.11.3.root.220
    ./mcatnl_herpp_ggH.17.2.11.3.root.220
    lmm_stripe_count: 1
    lmm_stripe_size: 1048576
    lmm_pattern: 1
    lmm_layout_gen: 0
    lmm_stripe_offset: 121
    obdidx objid objid group
    121 542840 0x84878 0

Cheers,
Sean

*This diagnostic may help with explaining the above problem.

When deactivating an OST, I see the following in the MDS logs:

I see that the MDT node has been evicted by the OST from the MDT logs.

Oct 2 21:19:33 pplxlustre25mds4 kernel: LustreError: 167-0: atlas25-OST0079-osc-MDT0000: This client was evicted by atlas25-OST0079; in progress operations using this service will fail.

With other messages such as:
Oct 3 06:04:57 pplxlustre25mds4 kernel: LustreError: 2246:0:(osp_precreate.c:464:osp_precreate_send()) atlas25-OST0047-osc-MDT0000: can't precreate: rc = -28
Oct 3 06:24:57 pplxlustre25mds4 kernel: LustreError: 2252:0:(osp_precreate.c:464:osp_precreate_send()) Skipped 239 previous similar messages

(Im assuming here that the messages for the OST0079 that I focus on in this email are bieng skipped)

Comment by Andreas Dilger [ 09/Jul/15 ]

One problem here is that the documented procedure for migrating objects off of an OST is to use "lctl --device XXX deactivate" on the MDS for the OST(s), but this disconnects the MDS from the OST entirely and disables RPC sending at a low level in the code (RPC layer) so it isn't necessarily practical to special-case that code to allow only OST_DESTROY RPCs through from the MDS, since the MDS doesn't even know whether the OST is alive or dead at that point.

It seems we need to have a different method to disable only MDS object creation on the specified OST(s) (ideally one that would also work on older versions of Lustre like possibly osp.*.max_precreated=0 or osp.*.max_create_count=0 or similar), and then update the documentation to reflect this new command for newer versions of Lustre, and possibly backport this to older releases that are affected (2.5/2.7). The other option, which is less preferable, is to change the meaning of "active=0" so that it just quiesces an OSP connection, but doesn't disconnect it completely, and then conditionally allows OST_DESTROY RPCs through if the OST is connected but just marked active=0, but that may cause other problems.

Comment by Kurt J. Strosahl (Inactive) [ 10/Jul/15 ]

Hello,

I'm observing a similar occurrence on some of my systems as well. Earlier in the week three of my osts reached 97% so I set them to read-only using lctl --device <device no> deactivate. Yesterday I was able to add some new osts to the system, and so I started an lfs_migrate on one of the full osts. I was aware that the system wouldn't update the space usage on the ost while it remained read-only, so this morning I set it back to active using lctl --device <device number> activate. The oss reported that it was deleting orphan objects, but the size usage didn't go down... after an hour the ost had more data on it then when it was in read-only mode and so I deactivated it, again.

Comment by Andreas Dilger [ 17/Aug/15 ]

One option that works on a variety of different Lustre versions is to mark an OST as degraded:

lctl set_param obdfilter.{OST_name}.degraded=1

This means that the MDS will skip the degraded OST(s) during most allocations, but will not skip them if someone requested a widely striped file and not enough non-degraded OSTs to fill the request.

I think we need to allow setting osp.*.max_create_count=0 to inform the MDS to skip object precreation on the OST(s), instead of using the old lctl --device * deactivate method, so that the MDS can still destroy OST objects for unlinked files. While it appears possible to set max_create_count=0 today, the MDS still tries to create objects on that OST if specified via lfs setstripe -i <idx> and it waits for a timeout (100s) trying to create files there before moving to the next OST (at <idx + 1>).

If max_create_count==0 then the LOD/OSP should skip this OSP immediately instead of waiting for a full timeout.

Comment by Joseph Gmitter (Inactive) [ 17/Aug/15 ]

Hi Lai,
Can you take a look at this? Please see Andreas' last comment.
Thanks.
Joe

Comment by Gerrit Updater [ 20/Aug/15 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/16032
Subject: LU-4825 osp: rename variables to match /proc entry
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d237665954c2cd3dde39f58b3171f3293676d5a3

Comment by Gerrit Updater [ 27/Aug/15 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/16105
Subject: LU-4825 osp: check max_create_count before use OSP
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 07e2c5d77cd33d3c24c283714b71ea6b7426cac7

Comment by Jian Yu [ 03/Sep/15 ]

I created LUDOC-305 to track the Lustre documentation change.

Comment by Andreas Dilger [ 12/Nov/15 ]

As a temporary workaround on older Lustre versions before http://review.whamcloud.com/16105 is landed, it is also possible to use:

oss# lctl set_param fail_loc=0x229 fail_val=<ost_index>

on the OSS where the OST to be deactivated is located. This will block all creates on the specified OST index.

This only allows blocking a single OST from creates per OSS at one time (by simulating running out of inodes in the OST_STATFS RPC sent to the MDS), but it avoids the drawbacks of completely deactivating the OST on the MDS (namely that OST objects are not destroyed on deactivated OSTs). This will generate some console spew ("*** cfs_fail_loc=0x229, val=<ost_index>***" every few seconds), and makes the "lfs df -i" output for this OST to be incorrect (it will report all inodes in use), but it is a workaround after all.

Comment by Gerrit Updater [ 12/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16032/
Subject: LU-4825 osp: rename variables to match /proc entry
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 05bf10903eba13db3d152f2725de56243123e7c5

Comment by Gerrit Updater [ 13/May/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20163
Subject: LU-4825 ofd: fix OBD_FAIL_OST_ENOINO/ENOSPC behaviour
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c1e339c769e4b6fc26aefc8e7ffc7f8421dc047d

Comment by Gerrit Updater [ 27/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16105/
Subject: LU-4825 osp: check max_create_count before use OSP
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: aa1a240338d18201f1047db62b31603e2cffcfe3

Comment by Peter Jones [ 12/Aug/16 ]

The main fix has landed for 2.9. I suggest moving Andreas's cleanup patch to be tracked under a different JIRA ticket refernece

Comment by Gerrit Updater [ 28/Feb/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25661
Subject: LU-4825 utils: improve lfs_migrate usage message
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 081995843296f51829a9cd2bf7ae4eb9442df679

Comment by Gerrit Updater [ 14/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/20163/
Subject: LU-4825 ofd: fix OBD_FAIL_OST_ENOINO/ENOSPC behaviour
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 659c81ca4bfbbc536260ff15bb31da84d9366791

Comment by Gerrit Updater [ 13/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25661/
Subject: LU-4825 utils: improve lfs_migrate usage message
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ed8a63c9b83ce9f64df19a15ec362e1edb04a6f4

Generated at Sat Feb 10 01:46:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.