Data-on-MDT phase II (LU-10176)

[LU-11421] DoM: manual migration OST-MDT, MDT-MDT Created: 24/Sep/18  Updated: 27/Apr/22  Resolved: 16/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0

Type: Technical task Priority: Major
Reporter: Andreas Dilger Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: DoM2

Issue Links:
Cloners
Clones LU-10177 DoM: manual migration MDT-OST Resolved
Related
is related to LU-3285 Data on MDT Resolved
is related to LU-10258 lfs mirror command to read/write spec... Resolved
is related to LU-10176 Data-on-MDT phase II Open
is related to LU-15794 Downgrade client fails: LustreError: ... Open
is related to LU-10910 LBUG with "lfs migrate -c 1 <domfile>" Resolved
is related to LU-12935 MDT deadlock on 2.12.3 with DoM; is i... Resolved
is related to LU-15219 DoM: lfs migrate doesn't work as expe... Resolved
is related to LU-10112 FLR: Support DoM component Open
is related to LU-10995 DoM2: allow MDT-only filesystems Open
is related to LU-13612 efficient DoM->OST component migration Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Make migration for DoM files with LFS command. This is not working out-of-box for Data-on-MDT files because it is not enough just change layout, data should be moved as well.

The OST-to-MDT and MDT-MDT migrations to be supported. Note that MDT-MDT migration might just be "cp + rename", since it will be the same.



 Comments   
Comment by Gerrit Updater [ 28/Jun/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35359
Subject: LU-11421 dom: manual OST-to-DOM migration via mirroring
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 95a57898961152c2d8dc6dc6e94399b0baffa81f

Comment by Mikhail Pershin [ 30/Jul/19 ]

Add here some explanation about how OST-DOM migration works:

OST layout + data on OSTs -> DOM layout + data on MDT

  • no sense to create volatile file, because its data will be tied to its new MDT inode, but we want to keep data with the old inode. We need to read data from OSTs stripes and write it to the inode blocks on MDT (like to the DOM file)
  • add new mirror with DOM component via 'mirror extend', the DOM component will be 'stale' after that
  • do mirror resync to fill stale DOM component with data
  • remove old layout from mirror by 'mirror split'

It is possible to have a DOM component with the same size in different mirrors, it is not 'mirrored' in that case but since we are considering MDT inode exists always that is not a problem I think. Technically it can be different size even - MDT inode will store the largest DoM stripe in that case but more work will be needed to return MDT stripe size to the client correctly in that case. Now it is just inode size, but should be limited by DOM component size depending on chosen mirror layout.

Comment by Mikhail Pershin [ 30/Jul/19 ]

Another thing to think about - what sort of OST-striped files should NOT be migrated to DOM files:

  • file size is bigger already than proposed DOM component size - can't imagine useful reason for such migration with no any benefits from MDT stripe
  • already mirrored files - new DOM layout should be added manually as new mirror in that case, not via lfs migrate
  • not sure about PFL files, if file was created as PFL so it is expected it will grow beyond the current size, so also no sense to move it on MDT, on other hand excluding such files would mean that only plain layout files can be migrated to MDT via lfs migrate
Comment by Andreas Dilger [ 30/Jul/19 ]

Add here some explanation about how OST-DOM migration works:

This explanation should all be included in the patch commit message.

Note that we can't exclude PFL files just on principle, because a filesystem may have a default PFL layout (maybe before the MDT has enough space for DoM, then new large MDTs are added to the filesystem) so all files are PFL. In general, while. It is good to have smart behavior by default if no other input is given, I think the kernel should try to avoid overriding decisions made by userspace.

Comment by Mikhail Pershin [ 31/Jul/19 ]

yes, I think that PFL files should be processed as all other, if it is file size bigger than DOM component size of new layout then lfs migrate should exit with a warning about that and propose to use -f parameter if user really want to do that.

Comment by Andreas Dilger [ 31/Jul/19 ]
yes, I think that PFL files should be processed as all other, if it is file size bigger than DOM component size of new layout then lfs migrate should exit with a warning about that and propose to use -f parameter if user really want to do that.

Well, it isn't clear that there is a need/benefit to return an error when the user asks for this. There are reasons for having a DoM component at the start of a file, e.g. for files that have an embedded index/icon/header at the start, so my first choice would be to allow what the user asked for instead of trying to second-guess their request. That said, if the user has not requested a DoM component (e.g. generic "lfs migrate" command) then I'm perfectly happy to drop the DoM component, and PFL in general, and use a plain layout with the stripe_count, stripe_size, and pool from the last instantiated component of the file.

As an aside, I generally dislike using "f" as "force some action", since often the "-f" argument gets overloaded to mean "force common action A" and also "force dangerous action B", and people just get used to adding "-f" to all of their commands which can lead to bad things happening (e.g. "rm -rf / *"). It would be better to have a long option like "-force-mdt", but in general I think if the user already asks for a DoM component via "-E 1M --layout mdt ..." then that is enough.

Comment by Andreas Dilger [ 31/Jul/19 ]

Is it also possible to do DoM-to-OST mirroring to drop just the mdt component from a large file? That would essentially need to write the DoM data to the first OST object (second component) in the background, and then add a new xattr operation to drop the mdt component and change the start of the second component to offset 0.

Comment by Mikhail Pershin [ 31/Jul/19 ]

Andreas, yes, I tend to agree, though several DOM optimisations are lost for files with DOM+OST objects instantiated there are still some remains, e.g. small random access moved from OST to MDT and considering that MDT can have faster storage in general. So it is even simpler for me to don't add extra checks for lfs migrate.

As for the DOM component removal, I think that has the similar benefits for any PFL file, when upon growing the first, smaller, component could be integrated into the next one and dropped. Though I am not sure if we have such feature now.

Meanwhile, mirroring allows also to increase DoM component size for DoM files, which is not possible via layout swap.

Comment by Gerrit Updater [ 16/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35359/
Subject: LU-11421 dom: manual OST-to-DOM migration via mirroring
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 44a721b8c10631b52f9ee2fbac1eee8cb775d148

Comment by Peter Jones [ 16/Sep/19 ]

Landed for 2.13

Comment by Stephane Thiell [ 10/Mar/20 ]

Is it planned to backport this patch to b2_12? I'm asking because we have MDTs that are almost full in terms of inodes (mainly due to the DoM ldiskfs space requirement, we ran out of inodes even though each MDT is 18TB). Many DoM files remain on these full MDTs, so we cannot easily migrate directories them to other MDTs (LU-13298).

Generated at Sat Feb 10 02:43:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.