Details
-
Bug
-
Resolution: Unresolved
-
Major
-
Lustre 2.11.0, Lustre 2.12.0
-
3
-
9223372036854775807
Description
It is easy to break ldiskfs format in a DNE system with async updates by adding extra hard links from another MDTs:
1. start a DNE-enabled fs
[root@vm1 tests]# REFORMAT=yes MDSCOUNT=4 OSTCOUNT=4 sh llmount.sh ... quota/lquota options: 'hash_lqs_cur_bits=3' Formatting mgs, mds, osts Format mds1: /tmp/lustre-mdt1 Format mds2: /tmp/lustre-mdt2 Format mds3: /tmp/lustre-mdt3 ...
2. create a file on MDT0
[root@vm1 tests]# touch /mnt/lustre/foo [root@vm1 tests]#
3. create a dir on another mdt.
[root@vm1 tests]# lfs mkdir -i 1 /mnt/lustre/mdt1 [root@vm1 tests]#
4. create 20 hard links to /mnt/lustre/foo
[root@vm1 tests]# for x in $(seq 1 20); do ln /mnt/lustre/foo /mnt/lustre/mdt1/foo-link-$x; done [root@vm1 tests]# ls -in /mnt/lustre/foo 144115205322833921 -rw-r--r--. 21 0 0 0 Sep 15 10:06 /mnt/lustre/foo [root@vm1 tests]#
5. shutdown the fs
[root@vm1 tests]# MDSCOUNT=4 OSTCOUNT=4 sh llmountcleanup.sh Stopping clients: vm1.localdomain /mnt/lustre (opts:-f) Stopping client vm1.localdomain /mnt/lustre opts:-f Stopping clients: vm1.localdomain /mnt/lustre2 (opts:-f)
6. run e2fsck on MDT0 image.
[root@vm1 tests]# e2fsck -fnv /tmp/lustre-mdt1 e2fsck 1.42.13.wc6 (05-Feb-2017) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Inode 168 ref count is 21, should be 2. Fix? no Pass 5: Checking group summary information lustre-MDT0000: ********** WARNING: Filesystem still has errors ********** 280 inodes used (0.28%, out of 100000) 7 non-contiguous files (2.5%) 0 non-contiguous directories (0.0%) # of inodes with ind/dind/tind blocks: 1/0/0 29638 blocks used (47.42%, out of 62500) 0 bad blocks 1 large file 153 regular files 118 directories 0 character device files 0 block device files 0 fifos 1 link 0 symbolic links (0 fast symbolic links) 0 sockets ------------ 272 files [root@vm1 tests]#
The inode #168 counts all links in its nlink counter, but only two links are local:
[root@vm1 tests]# debugfs -R "ncheck 168" /tmp/lustre-mdt1 debugfs 1.42.13.wc6 (05-Feb-2017) Inode Pathname 168 /REMOTE_PARENT_DIR/0x200000404:0x1:0x0 168 /ROOT/foo Segmentation fault (core dumped) [root@vm1 tests]#
If we start the fs again
[root@vm1 tests]# NOFORMAT=yes MDSCOUNT=4 OSTCOUNT=4 sh llmount.sh
all 21 links are visible through lfs fid2path output:
[root@vm1 tests]# lfs fid2path /mnt/lustre 0x200000404:0x1:0x0 /mnt/lustre/foo /mnt/lustre/mdt1/foo-link-1 /mnt/lustre/mdt1/foo-link-2 /mnt/lustre/mdt1/foo-link-3 /mnt/lustre/mdt1/foo-link-4 /mnt/lustre/mdt1/foo-link-5 /mnt/lustre/mdt1/foo-link-6 /mnt/lustre/mdt1/foo-link-7 /mnt/lustre/mdt1/foo-link-8 /mnt/lustre/mdt1/foo-link-9 /mnt/lustre/mdt1/foo-link-10 /mnt/lustre/mdt1/foo-link-11 /mnt/lustre/mdt1/foo-link-12 /mnt/lustre/mdt1/foo-link-13 /mnt/lustre/mdt1/foo-link-14 /mnt/lustre/mdt1/foo-link-15 /mnt/lustre/mdt1/foo-link-16 /mnt/lustre/mdt1/foo-link-17 /mnt/lustre/mdt1/foo-link-18 /mnt/lustre/mdt1/foo-link-19 /mnt/lustre/mdt1/foo-link-20 [root@vm1 tests]#
Attachments
Issue Links
- is related to
-
LU-11549 Unattached inodes after 3 min racer run.
-
- Resolved
-
-
LU-11545 debugfs: "ncheck -c" does not work correctly
-
- Resolved
-
-
LU-14600 sanity-lfsck test_30: f0 is not recovered
-
- Resolved
-
-
LU-11706 create a lustre tunable to enable/disable experimental features
-
- Resolved
-
- is related to
-
LU-10329 DNE3: REMOTE_PARENT_DIR scalability
-
- Open
-
Artem, have you given any thought to how we might handle this in a more transparent manner, separating the local disk nlink count from the distributed nlink count? Using the leh_reccount partially solves this problem, but the linkEA is not guaranteed to store all of the hard links to a file.
While leh_reccount is a 32-bit value, it (currently) needs to match the number of entries in the list. That could possibly be fixed with some changes to the code (maybe a new magic?), and LFSCK, so leh_reccount always stored the total number of hard links, and the list might be shorter than this. We could base the list iteration on the size of the xattr and not the link count, or add in a separate field to the linkEA, or maybe to the LMA? Then, the MDS would not drop the last local link to a file until leh_recount became zero, instead of trusting the inode nlink count.
I don't think storing all the hard links to a file in the link EA is practical as that will get very slow - 65000 links x 274 bytes/link = 17MB that needs to be rewritten on each update, and would break getxattr due to the size. Even using the full 64KiB xattr would allow at most (65536 - 24) / (2 + 16 + 8) = 2519 8-byte filenames or 1926 16-byte filenames, which is lower than we'd want for the maximum nlink count.