[LU-14470] striped directory layout mismatch after failover Created: 23/Feb/21  Updated: 28/Sep/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Andriy Skulysh Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16336 LFSCK should fix inconsistencies caus... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
[15965.280047] LustreError: 23882:0:(llite_lib.c:1442:ll_update_lsm_md()) lustre: [0x200008107:0x10653:0x0] dir layout mismatch:
[15965.283219] LustreError: 23882:0:(lustre_lmv.h:99:lsm_md_dump()) magic 0xcd20cd0 stripe count 2 master mdt 0 hash type 0x2 version 0 migrate offset 0 migrate hash 0x0 pool 
[15965.287312] LustreError: 23882:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[0] [0x20000dec0:0x7:0x0]
[15965.289569] LustreError: 23882:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[1] [0x24000e690:0x7:0x0]
[15965.291807] LustreError: 23882:0:(lustre_lmv.h:99:lsm_md_dump()) magic 0xcd20cd0 stripe count 2 master mdt 0 hash type 0x2 version 0 migrate offset 0 migrate hash 0x0 pool 
[15965.295841] LustreError: 23882:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[0] [0x20000e690:0x3:0x0]
[15965.298063] LustreError: 23882:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[1] [0x24000ee60:0x1:0x0]
[15965.310206] LustreError: 23884:0:(lustre_lmv.h:99:lsm_md_dump()) magic 0xcd20cd0 stripe count 2 master mdt 0 hash type 0x2 version 0 migrate offset 0 migrate hash 0x0 pool 
[15965.314355] LustreError: 23884:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[0] [0x20000dec0:0x7:0x0]
[15965.316652] LustreError: 23884:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[1] [0x24000e690:0x7:0x0]
[15965.318881] LustreError: 23884:0:(lustre_lmv.h:99:lsm_md_dump()) magic 0xcd20cd0 stripe count 2 master mdt 0 hash type 0x2 version 0 migrate offset 0 migrate hash 0x0 pool 
[15965.322888] LustreError: 23884:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[0] [0x20000e690:0x3:0x0]
[15965.325121] LustreError: 23884:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[1] [0x24000ee60:0x1:0x0]
[15965.340329] LustreError: 23886:0:(lustre_lmv.h:99:lsm_md_dump()) magic 0xcd20cd0 stripe count 2 master mdt 0 hash type 0x2 version 0 migrate offset 0 migrate hash 0x0 pool 
[15965.344411] LustreError: 23886:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[0] [0x20000dec0:0x7:0x0]
[15965.346655] LustreError: 23886:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[1] [0x24000e690:0x7:0x0]
[15965.348866] LustreError: 23886:0:(lustre_lmv.h:99:lsm_md_dump()) magic 0xcd20cd0 stripe count 2 master mdt 0 hash type 0x2 version 0 migrate offset 0 migrate hash 0x0 pool 
[15965.352827] LustreError: 23886:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[0] [0x20000e690:0x3:0x0]
[15965.355133] LustreError: 23886:0:(lustre_lmv.h:103:lsm_md_dump()) stripe[1] [0x24000ee60:0x1:0x0]
[15965.357439] LustreError: 23886:0:(llite_lib.c:2471:ll_prep_inode()) new_inode -fatal: rc -22

Create request is replayed but MDS creates striped directory shards with new fids, so client fails layout check.
It can be reproduced by recovery-mds-scale or custom test case, I'll attach it later.

For me it looks like a design flaw. Client should replay create request with previously allocated fids and MDS should recreate directory shards using client fids.



 Comments   
Comment by Gerrit Updater [ 23/Feb/21 ]

Andriy Skulysh (c17819@cray.com) uploaded a new patch: https://review.whamcloud.com/41731
Subject: LU-14470 test: striped dir layout mismatch after failover
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 48d6ff02041dd6d8124d364f784e31ec6517bf7c

Comment by Lai Siyao [ 15/Mar/21 ]

Dual or multiple MDT failure recovery is hard to support.

For striped directory creation, the stripe FIDs are allocated by MDTs, and these FIDs may not be usable in recovery(meta sequence not allocated yet), or they may be used by other objects (FID allocated by other operations in recovery).

But if we don't reuse the FIDs, there are other problems, even if we update directory layout after replay (instead of reporting error currently), the following touch under this striped directory can't be replayed because its parent has changed.

Comment by Lai Siyao [ 16/Mar/21 ]

Some distributed transaction replays need all the information stored in update logs, because these transactions allocated FID on MDTs, e.g. striped directory creation and directory migration (to striped directory). Such operation replay can't be done from client side.

One way to improve this is to store update logs on more MDTs than those involved, e.g., if a striped directory is created on MDT0 and MDT1, it also stores update logs on MDT2 (this can be configured, and may have more backups), so upon recovery, all the information can be obtained from MDT2 and successfully replayed.

Comment by Andreas Dilger [ 26/Nov/21 ]

Since the FIDs for the MDT directory stripes are allocated by the MDS, I think there are two options here:

  • have the client replay the mkdir with the original MDT FIDs in the LMV xattr, the same way that client replay of regular files includes the OST object FIDs. That ensures the recreated directory is exactly the same as the original and avoids failures in later replay operations. I think this would be my preferred solution, but it potentially exposes the internal filesystem structure to inconsistency if the client specifies wrong FIDs during replay (though see LU-15250).
  • the client should use the FID of the master directory for the parent FID (not the shard FID) when replaying the file create, and then lookup the shard FID from the directory during replay. The client doing mkdir originally selected the master directory FID and will use it for replay, so it will not change. That works around the problem with the shard FIDs changing during replay, but may still have other problems later if the shard FIDs are also used for other operations (though they might be fixed similarly to use the master FID). I think this adds complexity to the client replay process, but has the benefit that the client doesn't know as much detail about the shard layout, which might be helpful in the future.

I do not think that storing the update logs on other MDTs is a good solution to this problem for several reasons:

  • this makes distributed transactions even slower than they currently are, since it involves at least one extra MDT in each mkdir
  • there may not even be an additional MDT to store the update log (e.g. in a two-MDT system)
  • the additional MDT used to store the update log may also fail at the same time (e.g. power loss of server rack), so is not really solving the problem
    It is better to have a solution that more closely follows the existing recovery mechanisms that Lustre uses (e.g. replay with LMV xattr) instead of adding a different mechanism.
Comment by Lai Siyao [ 15/Dec/21 ]

After some tests, I found this should be a test script issue. The sync-on-lock-cancel mechanism has eliminated dependency between striped directory creation and sub file creation:
1. striped directory creation will hold UPDATE locks of all stripes after creation, the 1st stripe UPDATE lock is a local lock, while 2nd to the last stripe UPDATE locks are remote locks.
2. sub file creation will getattr on parent directory first, which will revoke UPDATE locks held held in step 1.
3. since the 1st stripe UPDATE lock is a local lock, it will be silently dropped, while for 2nd to last stripe UPDATE locks, they are remote locks, this will trigger commit-on-sharing, which guarantees 2nd to last stripes creation and update logs are committed to disk, after this the parent directory can always be recovered from update logs.
4. if all involved MDTs are rebooted at this moment, the striped directory will be recovered from update logs.
5. if sub file is located on the same MDT of the parent directory (also the 1st stripe), its creation will be replayed after parent directory replay, it's okay; otherwise the sub file creation replay may be before parent directory replay because they are replayed on different MDTs, however since 2nd to last stripe has been committed to disk already, even if the parent directory is not recovered from update logs yet, the according stripe is there already, and sub file creation replay will succeed.

In summary, if two operation replays have dependency, but they may be replayed in random order (because they are replayed on different MDTs), this dependency should be eliminated (either by commit-on-sharing or by sync-on-lock-cancel). And in this case, there is no need to implement striped directory client replay, because if there is no subsequent operation depending on it, it can be a fresh new creation, while if there exists dependency, this dependency will be eliminated by sync-on-lock-cancel.

BTW, I'm afraid this can't be tested with replay_barrier() which will fail step 2, so the test failed in the end. This should probably be tested on real machines.

I will add a test to verify sync-on-lock-cancel is triggered for sub file creation under striped directory.

Comment by Gerrit Updater [ 16/Dec/21 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45870
Subject: LU-14470 test: add striped directory creation test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9d88d35f954ff6cf93d8f53894a54e05fe39efad

Comment by Gerrit Updater [ 18/May/22 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47385
Subject: LU-14470 dne: striped mkdir replay
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f7d4bb673b3cabce4e70029e2ff50f56e94a40ca

Comment by Gerrit Updater [ 28/Sep/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47385/
Subject: LU-14470 dne: striped mkdir replay by client request
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a2e997f0bed0ae4cfdcf6d73f8a79e3d23d28a2f

Generated at Sat Feb 10 03:10:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.