[LU-17148] enhance directory migration robustness Created: 27/Sep/23  Updated: 02/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Lai Siyao Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-17307 osd_dirent_count() keeps multiple thr... Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-17162 assign transno to server started tran... Technical task Open Lai Siyao  
LU-17163 save local locks in server start tran... Technical task Open Lai Siyao  
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Directory migration involves multiple MDTs, and sever failure during migration tends to cause file missing, and this is its inherent fragility, not fault in the implementation:
1. directory migration doesn't support replay by client request: because all the source file xattrs need to be packed in the reply and then saved to request for replay, while current request format doesn't support this.
2. if the inter-MDT recovery is aborted, due to 1, file may get lost.
3. even if client replay is supported, if server recovery is aborted, file may get lost too.

To enhance directory migration robustness, we need to introduce redundancy for files migrated:

  • retain source file till migration is finished, this includes both source file dirents and inodes, and they may be located on different MDTs. NB the retained files should be invisible to client, otherwise accessing migrating directory may fail.
  • verify target files against source files before finishing migration (updating target directory layout), if there are extr source files, they must have been migrated, but target files get lost due to target MDT failure, migrate them again by MDT.
  • make sure target files are committed to disk before destroying retained source files.
  • this is complicated, so handle this in MDD layer other than LFSCK, because MDD code has better understanding of migration, and this is also more friendly for users, since it's opaque to them, which doesn't need extra lfsck command.


 Comments   
Comment by Andreas Dilger [ 27/Sep/23 ]

Potentially, such an operation to create a mirrored MDT object could be used as the starting point for implementing full MDT mirroring - Lustre Metadata Redundancy (LU-12310, https://jira.whamcloud.com/secure/attachment/43434/Lustre_Metadata_Redundancy-202112.pptx).

Generated at Sat Feb 10 03:33:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.