[LU-11446] ldiskfs inodes nlink mismatch with DNE Created: 28/Sep/18 Updated: 04/Aug/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Zarochentsev | Assignee: | Artem Blagodarenko |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | dne2 | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
It is easy to break ldiskfs format in a DNE system with async updates by adding extra hard links from another MDTs: 1. start a DNE-enabled fs [root@vm1 tests]# REFORMAT=yes MDSCOUNT=4 OSTCOUNT=4 sh llmount.sh ... quota/lquota options: 'hash_lqs_cur_bits=3' Formatting mgs, mds, osts Format mds1: /tmp/lustre-mdt1 Format mds2: /tmp/lustre-mdt2 Format mds3: /tmp/lustre-mdt3 ... 2. create a file on MDT0 [root@vm1 tests]# touch /mnt/lustre/foo [root@vm1 tests]# 3. create a dir on another mdt. [root@vm1 tests]# lfs mkdir -i 1 /mnt/lustre/mdt1 [root@vm1 tests]# 4. create 20 hard links to /mnt/lustre/foo [root@vm1 tests]# for x in $(seq 1 20); do ln /mnt/lustre/foo /mnt/lustre/mdt1/foo-link-$x; done [root@vm1 tests]# ls -in /mnt/lustre/foo 144115205322833921 -rw-r--r--. 21 0 0 0 Sep 15 10:06 /mnt/lustre/foo [root@vm1 tests]# 5. shutdown the fs [root@vm1 tests]# MDSCOUNT=4 OSTCOUNT=4 sh llmountcleanup.sh Stopping clients: vm1.localdomain /mnt/lustre (opts:-f) Stopping client vm1.localdomain /mnt/lustre opts:-f Stopping clients: vm1.localdomain /mnt/lustre2 (opts:-f) 6. run e2fsck on MDT0 image. [root@vm1 tests]# e2fsck -fnv /tmp/lustre-mdt1
e2fsck 1.42.13.wc6 (05-Feb-2017)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 168 ref count is 21, should be 2. Fix? no
Pass 5: Checking group summary information
lustre-MDT0000: ********** WARNING: Filesystem still has errors **********
280 inodes used (0.28%, out of 100000)
7 non-contiguous files (2.5%)
0 non-contiguous directories (0.0%)
# of inodes with ind/dind/tind blocks: 1/0/0
29638 blocks used (47.42%, out of 62500)
0 bad blocks
1 large file
153 regular files
118 directories
0 character device files
0 block device files
0 fifos
1 link
0 symbolic links (0 fast symbolic links)
0 sockets
------------
272 files
[root@vm1 tests]#
The inode #168 counts all links in its nlink counter, but only two links are local: [root@vm1 tests]# debugfs -R "ncheck 168" /tmp/lustre-mdt1 debugfs 1.42.13.wc6 (05-Feb-2017) Inode Pathname 168 /REMOTE_PARENT_DIR/0x200000404:0x1:0x0 168 /ROOT/foo Segmentation fault (core dumped) [root@vm1 tests]# If we start the fs again [root@vm1 tests]# NOFORMAT=yes MDSCOUNT=4 OSTCOUNT=4 sh llmount.sh all 21 links are visible through lfs fid2path output: [root@vm1 tests]# lfs fid2path /mnt/lustre 0x200000404:0x1:0x0 /mnt/lustre/foo /mnt/lustre/mdt1/foo-link-1 /mnt/lustre/mdt1/foo-link-2 /mnt/lustre/mdt1/foo-link-3 /mnt/lustre/mdt1/foo-link-4 /mnt/lustre/mdt1/foo-link-5 /mnt/lustre/mdt1/foo-link-6 /mnt/lustre/mdt1/foo-link-7 /mnt/lustre/mdt1/foo-link-8 /mnt/lustre/mdt1/foo-link-9 /mnt/lustre/mdt1/foo-link-10 /mnt/lustre/mdt1/foo-link-11 /mnt/lustre/mdt1/foo-link-12 /mnt/lustre/mdt1/foo-link-13 /mnt/lustre/mdt1/foo-link-14 /mnt/lustre/mdt1/foo-link-15 /mnt/lustre/mdt1/foo-link-16 /mnt/lustre/mdt1/foo-link-17 /mnt/lustre/mdt1/foo-link-18 /mnt/lustre/mdt1/foo-link-19 /mnt/lustre/mdt1/foo-link-20 [root@vm1 tests]# |
| Comments |
| Comment by Andreas Dilger [ 18/Oct/18 ] |
|
I was able to reproduce this issue with 2.12. There will always be at least one hard link from REMOTE_PARENT_DIR to the local inode to ensure that the inode is not deleted. However, this is not totally robust if this remote link is removed too soon. However, I tested removing both the local (original) file and the remote hard links, and this did not cause the REMOTE_PARENT_DIR link to be lost, so e2fsck would not delete such a file, only change the link count, so no data would be lost as a result. When LU-10329 is implemented, there would be a remote link from every MDT that links to the file. However, until such a time, creating a separate entry in REMOTE_PARENT_DIR for every hard link would quickly cause problems in the MDT filesystem as the directory size limit is hit and/or performance slowdowns due to the huge directory size. So, I think this is a problem that e2fsck will consider the filesystem to be incorrect and repair the inode link count (and report the wrong count to the clients), but fortunately not a data-loss scenario since it has existed since at least 2.8 when DNE2 remote hard links were introduced. As a workaround, it might make sense to return max(i_nlink, leh_reccount) to the clients for this case? |
| Comment by Cory Spitz [ 19/Oct/18 ] |
|
adilger, thanks for confirming this issue for L2.12. You said that it is not a data-loss scenario. However, it can become one if e2fsck lowers the nlink count and then enough other names are unlinked. Once the nlink count hits 0 the other names become unconnected and Lustre can no longer perform fid2path. We expect that LFSCK can repair nlink, but we've seen at least one case where it could not (but we haven't reproduced that (yet) and it is strange because LFSCK said that it did: "namespace LFSCK repaired the object [0x300018aa1:0x1:0x0]'s nlink count from 2 to 2: rc = 0"). |
| Comment by Andreas Dilger [ 24/Oct/18 ] |
|
Cory, I agree that it might be possible to get into that situation. Strictly speaking, the REMOTE_PARENT_DIR link will not be removed, but the nlink count may hit zero after an e2fsck, but before LFSCK is run. My suggestion to fix this would be to have the mdd increase the local nlink count if it detects that the linkEA has more links than are reflected by nlink. That should "correct" the local nlink count (to the amount that linkEA can hold) and prevent the nlink count from becoming zero. Even better (though more complex) is to store the "actual" link count in linkEA or lma somewhere, and use that as the authoritative link count to return to clients, and not affect the local inode link count because of multiple hard links. We still want to keep REMOTE_PARENT_DIR, to avoid there being no local references, but we don't need to have all of the hard links. That is a much more complex solution, and probably not something to do before 2.12 as it involves changing the on-disk format and LFSCK. |
| Comment by Andreas Dilger [ 24/Oct/18 ] |
|
It may be that the better long-term strategy is the second one - the local filesystem keeps only nlink == local link count, rather than holding the whole nlink count of all remote references as well. That improves several cases:
We would only need to store a "global" link count in the LMA or linkEA if there are cross-MDT hard links to the file. That is typically very unlikely to happen, so we don't want to add overhead unless it is actually needed. |
| Comment by Gerrit Updater [ 04/Oct/19 ] |
|
Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/36371 |
| Comment by Andreas Dilger [ 04/Oct/19 ] |
|
Artem, have you given any thought to how we might handle this in a more transparent manner, separating the local disk nlink count from the distributed nlink count? Using the leh_reccount partially solves this problem, but the linkEA is not guaranteed to store all of the hard links to a file. While leh_reccount is a 32-bit value, it (currently) needs to match the number of entries in the list. That could possibly be fixed with some changes to the code (maybe a new magic?), and LFSCK, so leh_reccount always stored the total number of hard links, and the list might be shorter than this. We could base the list iteration on the size of the xattr and not the link count, or add in a separate field to the linkEA, or maybe to the LMA? Then, the MDS would not drop the last local link to a file until leh_recount became zero, instead of trusting the inode nlink count. I don't think storing all the hard links to a file in the link EA is practical as that will get very slow - 65000 links x 274 bytes/link = 17MB that needs to be rewritten on each update, and would break getxattr due to the size. Even using the full 64KiB xattr would allow at most (65536 - 24) / (2 + 16 + 8) = 2519 8-byte filenames or 1926 16-byte filenames, which is lower than we'd want for the maximum nlink count. |
| Comment by Gerrit Updater [ 30/Mar/21 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/43169 |
| Comment by Andreas Dilger [ 01/Apr/21 ] |
|
Instead of changing the existing semantics of leh_reccount to hold the total link count it probably makes more sense to use reserved2 for leh_linkcount to store the total number of links. If this field is zero, then we depend on max(inode->i_links_count, leh_reccount) as the best-guess estimate of the distributed link count, but that cannot be totally accurate given the limitations on the trusted.link xattr size. |
| Comment by Gerrit Updater [ 07/Apr/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43169/ |
| Comment by Andreas Dilger [ 07/Apr/21 ] |
|
The e2fsck patch is merged into 1.45.6.wc6, but the improvement to DNE nlink handling still needs to be done, so this ticket should not be closed yet. |
| Comment by Gerrit Updater [ 08/Apr/21 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/43231 |
| Comment by Gerrit Updater [ 15/Apr/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43324 |
| Comment by Gerrit Updater [ 21/Apr/21 ] |
|
Li Dongyang (dongyangli@ddn.com) merged in patch https://review.whamcloud.com/43231/ |
| Comment by Peter Jones [ 07/Feb/22 ] |
|
Is there any work remaining on this ticket? |
| Comment by Artem Blagodarenko (Inactive) [ 25/Mar/22 ] |
|
>Is there any work remaining on this ticket? |