[LU-13832] "lfs migrate -m" leads to inconsistent ldiskfs directories - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.14.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I created a test directory with striped DNE directories as follows:

# export MDSCOUNT=8
# export DIR=/mnt/testfs/allmdt
# lfs mkdir -c -1 $DIR
# for D in $(seq $MDSCOUNT); do
    lfs mkdir -c 2 $DIR/dirstr$D
    rsync -a --exclude "policy.*" /etc/ $DIR/dirstr$D/
done

This created the test directories with a variety of files that can be verified. Then, migrate each directory and verify the contents have not changed (the rsync should not report any files that need to be updated):

# for D in $(seq $MDSCOUNT); do
    echo $DIR/dirstr$D
    lfs migrate -m $((RANDOM % MDSCOUNT)) -c2 $DIR/dirstr$D
    rsync -av --exclude "policy.*" --dry-run /etc/ $DIR/dirstr$D/
done

I ran this a couple of times, then ran e2fsck on the MDTs, and all of them showed the same problem on a lot of remote directories:

e2fsck 1.45.2.wc1 (27-May-2019)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Directory entry for '.' in ... (25191) is big. Split? yes
Missing '..' in directory inode 25191. Fix? yes
Setting filetype for entry '..' in ... (25191) to 2.
:
Pass 3: Checking directory connectivity   [[[ WHEN NOT FIXING ]]]
'..' in /REMOTE_PARENT_DIR/0x200000407:0x6f5:0x0 (26203) is <The NULL inode> (0), should be /REMOTE_PARENT_DIR (25001).
Fix? no
[[[ OR ]]]
Pass 3: Checking directory connectivity  [[[ WHEN FIXING ]]]
Unconnected directory inode 25191 (/???)
Connect to /lost+found? yes
:
Pass 4: Checking reference counts
Inode 2 ref count is 0, should be 11.  Fix? yes

Inode 25191 ref count is 3, should be 2.  Fix? yes

Looking at the directories under REMOTE_PARENT_DIR it appears that the ".." entry is missing from the directory, so "." is a single 4096-byte entry that consumes the whole block. It may be that this hasn't been noticed in the past because these directories are all small and do not need to be split for HTREE, which would add a ".." as part of struct dx_info.

Attachments

Issue Links

is related to

LU-14719 "lfs migrate -m" creates broken agent inodes when target MDT full

Resolved

Activity

[LU-13832] "lfs migrate -m" leads to inconsistent ldiskfs directories

Lai Siyao added a comment - 16/Sep/20 2:04 AM

commit 3f608461b387df056c9563d4c2879b05fb54a5a5 does remove the optimization for empty directory migration, which is to simplify the code since empty directory should be rare.

I haven't been able to reproduce yet, Andreas, are you testing with master branch?

Lai Siyao added a comment - 16/Sep/20 2:04 AM commit 3f608461b387df056c9563d4c2879b05fb54a5a5 does remove the optimization for empty directory migration, which is to simplify the code since empty directory should be rare. I haven't been able to reproduce yet, Andreas, are you testing with master branch?

Andrew Perepechko added a comment - 15/Sep/20 9:43 AM - edited

I wonder if laisiyao reproduced this issue with some old code. We were able to get the test that led to corruption in our case. It was simply

lfs setdirstripe -i 0 -c 2 /mnt/lustre/d
lfs migrate -m 0 /mnt/lustre/d

Apparently, the issue was related to the fact that an empy dir did not receive LMV_HASH_FLAG_MIGRATION as part of migration. Eventually, mdt_dir_layout_shrink() was not able to complete migration and returned -EALREADY.

This issue was silently fixed by

commit 3f608461b387df056c9563d4c2879b05fb54a5a5
Author: Lai Siyao <lai.siyao@whamcloud.com>
Date:   Sat Feb 15 21:26:36 2020 +0800

    LU-11025 dne: refactor dir migration

Andrew Perepechko added a comment - 15/Sep/20 9:43 AM - edited I wonder if laisiyao reproduced this issue with some old code. We were able to get the test that led to corruption in our case. It was simply lfs setdirstripe -i 0 -c 2 /mnt/lustre/d lfs migrate -m 0 /mnt/lustre/d Apparently, the issue was related to the fact that an empy dir did not receive LMV_HASH_FLAG_MIGRATION as part of migration. Eventually, mdt_dir_layout_shrink() was not able to complete migration and returned -EALREADY. This issue was silently fixed by commit 3f608461b387df056c9563d4c2879b05fb54a5a5 Author: Lai Siyao <lai.siyao@whamcloud.com> Date: Sat Feb 15 21:26:36 2020 +0800 LU-11025 dne: refactor dir migration

Andrew Perepechko added a comment - 17/Aug/20 12:08 PM

laisiyao, do I understand it correctly that your reproducer does not contain any failover or parallelism of any sort? The test looks linear with respect to mkdir/migrate.

Andrew Perepechko added a comment - 17/Aug/20 12:08 PM laisiyao , do I understand it correctly that your reproducer does not contain any failover or parallelism of any sort? The test looks linear with respect to mkdir/migrate.

Cory Spitz added a comment - 07/Aug/20 8:43 PM

I can't answer the regression question yet either. I'm sure that we'll get an answer as we zero-in on root cause. FWIW, we've seen this condition on a 2.12 LTS filesystem (albeit with some patches and back ports from 2.13.5x).

Cory Spitz added a comment - 07/Aug/20 8:43 PM I can't answer the regression question yet either. I'm sure that we'll get an answer as we zero-in on root cause. FWIW, we've seen this condition on a 2.12 LTS filesystem (albeit with some patches and back ports from 2.13.5x).

Andreas Dilger added a comment - 07/Aug/20 8:00 PM

PS: so far this is not a data loss scenario, though the on-disk consistency is affected. From my brief testing, it appears that e2fsck fixes this issue.

Andreas Dilger added a comment - 07/Aug/20 8:00 PM PS: so far this is not a data loss scenario, though the on-disk consistency is affected. From my brief testing, it appears that e2fsck fixes this issue.

Andreas Dilger added a comment - 07/Aug/20 7:57 PM

Cory, I can't say whether this is a newer regression or not. As for whether it is a 2.14 blocker depends on whether it was introduced in 2.13.5x patches, or if it has existed for a long time already.

Andreas Dilger added a comment - 07/Aug/20 7:57 PM Cory, I can't say whether this is a newer regression or not. As for whether it is a 2.14 blocker depends on whether it was introduced in 2.13.5x patches, or if it has existed for a long time already.

People

Assignee:: Lai Siyao

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 30/Jul/20 8:37 AM

Updated:: 01/Jun/22 3:34 PM

Resolved:: 01/Jun/22 3:34 PM