[LU-2901] Duplicate filename on the same ldiskfs directory on MDS Created: 04/Mar/13  Updated: 20/Aug/13  Resolved: 20/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.3, Lustre 2.1.6, Lustre 2.4.1
Fix Version/s: Lustre 2.1.6, Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Critical
Reporter: Diego Moreno (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Attachments: File debug.gz     File duplicate_name.c     File fsck_issue_duplicate    
Issue Links:
Related
is related to LU-3439 User code creating multiple lockfiles... Closed
Severity: 3
Rank (Obsolete): 6988

 Description   

We meet 3 times on 3 differents lustre MDS ldiskfs filesystem
a duplicate name file on a directory, for example:

2012 Dec 4 10:59:22 bigfoot2 :

cdep4-MDT0000 Starting fsck
cdep4-MDT0000 Pass 1: Checking inodes, blocks, and sizes
cdep4-MDT0000 Pass 2: Checking directory structure
cdep4-MDT0000 Duplicate entry 'BDE_UCD.00000-00001' found.
cdep4-MDT0000 Marking /ROOT/yack/group/USER/test9 (836767265) to be rebuilt.
cdep4-MDT0000
cdep4-MDT0000 Duplicate entry 'BDE_UCD.00000-00001' found.
cdep4-MDT0000 Marking /ROOT/yack/group/USER/test9_pscc (838872002) to be rebuilt.
cdep4-MDT0000
cdep4-MDT0000 Pass 3: Checking directory connectivity
cdep4-MDT0000 Pass 3A: optimizing directories
cdep4-MDT0000 Entry 'BDE_UCD.00000-00001' in /ROOT/yack/group/USER/test9 (836767265) has a non-unique filename.
cdep4-MDT0000 Rename to BDE_UCD.00000-0000~0? yes
cdep4-MDT0000
cdep4-MDT0000 Entry 'BDE_UCD.00000-00001' in /ROOT/yack/group/USER/test9_pscc (838872002) has a non-unique filename.
cdep4-MDT0000 Rename to BDE_UCD.00000-0000~0? yes
cdep4-MDT0000
cdep4-MDT0000 Pass 4: Checking reference counts
cdep4-MDT0000 Pass 5: Checking group summary information
cdep4-MDT0000
cdep4-MDT0000: ***** FILE SYSTEM WAS MODIFIED *****
cdep4-MDT0000: 2872950/878051328 files (0.8% non-contiguous), 111956436/878047232 blocks

In this case, running the ls command we can see the issue:

total 0
Fri Nov 30 16:20:22 + 0.00 ###############################################################################
Fri Nov 30 16:20:22 + 0.00 ## Contenu du repertoire cache_dep /cea/cache_dep/yack/group/USER/test9_ ##
Fri Nov 30 16:20:22 + 0.00 ## pscc ##
Fri Nov 30 16:20:22 + 0.00 ###############################################################################
total 331844
rw-rr- 1 USER f7 10240 Nov 30 16:19 BDE_DIVERS
rw-r---- 1 USER f7 140615680 Nov 30 16:09 BDE_MAILLAGE
rw-rr- 1 USER f7 10240 Nov 30 16:20 BDE_POST1D
rw-rr- 1 USER f7 99563008 Nov 30 16:16 BDE_UCD.00000-00001
rw-rr- 1 USER f7 99563008 Nov 30 16:16 BDE_UCD.00000-00001

Then running debugfs after the fsck we can see that both 2 files
exist on 2 different inodes:

[root@bigfoot2 ~]# /usr/lib/lustre/debugfs /dev/mapper/da1vg0_mdt
debugfs 1.42.3.wc3 (15-Aug-2012)
debugfs: ls
2 (12) . 2 (12) .. 11 (20) lost+found 589299713 (16) CONFIGS
637534209 (16) OBJECTS 12 (20) lov_objid 13 (16) oi.16
14 (12) fld 15 (16) seq_srv 16 (16) seq_ctl 17 (20) capa_keys
627048449 (16) PENDING 643825665 (12) ROOT 18 (20) last_rcvd
700448769 (20) REM_OBJ_DIR 19 (3852) CATALOGS
debugfs: cd ROOT/yack/group/USER
debugfs: cd test9_pscc
debugfs: ls
838872002 (28) . 643826507 (28) ..
838873362 (28) BDE_UCD.00000-0000~0 838875198 (48) BDE_UCD.00000-00001
838875615 (36) BDE_POST1D 838877439 (40) BDE_MAILLAGE
838877482 (36) BDE_DIVERS 838877492 (3852) BDE_PROT_LAG.00001-00001
debugfs:
debugfs: stat BDE_UCD.00000-0000~0
Inode: 838873362 Type: regular Mode: 0644 Flags: 0x0
Generation: 2475899703 Version: 0x0000002b:1ecd9699
User: 3083 Group: 5214 Size: 0
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x50b8cde4:00000000 – Fri Nov 30 16:16:52 2012
atime: 0x50bf189a:00000000 – Wed Dec 5 10:49:14 2012
mtime: 0x50b8cde4:00000000 – Fri Nov 30 16:16:52 2012
crtime: 0x50b8cc66:aa4a5734 – Fri Nov 30 16:10:30 2012
Size of extra inode fields: 28
Extended attributes stored in inode body:
lma = "00 00 00 00 00 00 00 00 71 be 71 17 02 00 00 00 4b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 " (64)
lma: fid=[0x21771be71:0x4b:0x0]
lov = "d0 0b d1 0b 01 00 00 00 4b 00 00 00 00 00 00 00 71 be 71 17 02 00 00 00 00 00 40 00 03 00 00 00 99 9f
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 13 00 00 00 18 ad 00 0
0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0d 02 00 00 c9 af 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 53 02 00 00 " (104)
link = "df f1 ea 11 01 00 00 00 3d 00 00 00 00 00 00 00 39 41 6c 65 70 68 6f 4d 00 25 00 00 00 02 17 59 a0 d1
00 00 34 fb 00 00 00 00 42 44 45 5f 55 43 44 2e 30 30 30 30 30 2d 30 30 30
30 31 " (61)
BLOCKS:

debugfs: stat BDE_UCD.00000-00001
Inode: 838875198 Type: regular Mode: 0644 Flags: 0x0
Generation: 2475897856 Version: 0x0000002b:1ec71356
User: 3083 Group: 5214 Size: 0
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x50b8c8fd:00000000 – Fri Nov 30 15:55:57 2012
atime: 0x50bf18a0:00000000 – Wed Dec 5 10:49:20 2012
mtime: 0x50b8c8fd:00000000 – Fri Nov 30 15:55:57 2012
crtime: 0x50b8c6c2:977f5a84 – Fri Nov 30 15:46:26 2012
Size of extra inode fields: 28
Extended attributes stored in inode body:
lma = "00 00 00 00 00 00 00 00 ef 0a 4f 17 02 00 00 00 52 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
0 00 00 00 00 " (64)
lma: fid=[0x2174f0aef:0x52:0x0]
lov = "d0 0b d1 0b 01 00 00 00 52 00 00 00 00 00 00 00 ef 0a 4f 17 02 00 00 00 00 00 40 00 03 00 00 00 59 27
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c2 00 00 00 9b a6 00 0
0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b9 01 00 00 08 ac 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 db 02 00 00 " (104)
link = "df f1 ea 11 01 00 00 00 3d 00 00 00 00 00 00 00 39 41 6c 65 70 68 6f 4d 00 25 00 00 00 02 17 59 a0 d1
00 00 34 fb 00 00 00 00 42 44 45 5f 55 43 44 2e 30 30 30 30 30 2d 30 30 30
30 31 " (61)
BLOCKS:

Another trace can be found attached to this ticket.

Is this a known issue?



 Comments   
Comment by Peter Jones [ 04/Mar/13 ]

Lai

Could you please comment on this one?

Thanks

Peter

Comment by Lai Siyao [ 13/Mar/13 ]

My understanding of this issue is:

  • fsck found two dir entries with the same name, and renamed the second to a unique name.
  • however `ls <dir>` still lists two files with the same name, and the file with the unique name is not shown. Could you confirm whether this is done on MDS (mount device with 'ldiskfs') or on client? BTW could you `ls -i ...` to list ino for all files also?
  • while debugfs MDS device can see the correct result (fixed by fsck).

Please correct me if I misunderstood, thanks.

Comment by Lai Siyao [ 14/Mar/13 ]

Could you list the version of e2fsprogs on your system?

Comment by Lai Siyao [ 14/Mar/13 ]

Do you know how file 'BDE_UCD.00000-00001' is created? What operations has been done on this file? Is it created multiple times?

Comment by Diego Moreno (Inactive) [ 15/Mar/13 ]

The e2fsprogs version is: 1.42.3.wc3. This issue is difficult to reproduce/trace as it raises up from time to time and it's only detected when lfsck is run once a month. It comes from an internal IO library doing lots of IOs.

From my understanding lfsck is fine and the issue is as follows:

  • On a periodic lfsck run we see there is a duplicate filename entry. We cancel the lfsck doing nothing.
  • Then from a Lustre client, doing 'ls', we obviously see the same filename twice (lfsck is right!).
  • We re-run lfsck renaming one of the duplicate filenames.
  • Everything is fine again as there's no more duplicate filenames.

The question is: how could a duplicate filename arrive?

Comment by Diego Moreno (Inactive) [ 15/Mar/13 ]

I forgot to mention that both files have the same name and size but different attributes (ctime, mtime) and fid.

Comment by Lai Siyao [ 15/Mar/13 ]

One possible cause is that on the second create, it fails to lookup by filename, so that file with the same name is created twice. But it's weird that this happens with the same name 'BDE_UCD.00000-00001' on two different directories: 'test9' and 'test9_pscc'. I've tested with the same directory structure, but nothing went wrong.

I'll see is it possible to make a debug patch to catch such case, and print message when it happens.

Comment by Lai Siyao [ 03/May/13 ]

I'm afraid this is not a lustre specific bug, because even lustre code forgets check duplicate dir entry before inserting, ldiskfs (ext4) will complain and refuse insertion. And I can't think of a way to catch such error in lustre code.

BTW in the attached trace file, I see such log messages:

2013/01/18 15:16:56 rbh-sherpa@yack40[8625/11]: SHERPA | GC of dep files disabled: not removing /cea/cache_dep/rep1/rep2/rep3/rep4/HDep-n=Temps_u=s.v=f0000000000000000-v=f3fcc4db02b7a7355
2013/01/18 23:52:58 rbh-sherpa@yack40[8625/9]: SHERPA | GC of dep files disabled: not removing /cea/cache_dep/rep1/rep2/rep3/rep4/HDep-n=Temps_u=s.v=f0000000000000000-v=f3fcc4db02b7a7355

Could you explain the meaning of this message?

Comment by Kit Westneat (Inactive) [ 30/May/13 ]

Hi Lai,

One of our customers recently hit this bug as well. Unfortunately it is a secure site, so getting information is difficult. I thought perhaps it could have been a problem with the journal getting incorrectly replayed, but they have not had any reboots. We thought that perhaps the file had been created with an unprintable character in the name, but that is not the case. Hexdumps of the names are identical.

ls -i shows both files, but has the same inode number. I think this is because it does an lstat by name. We printed out the dirents and those showed both files with different inode numbers.

The duplicate files still exist. Is there anything else we can look at before we delete the hidden one? I have asked for more details as to how the file is created, or if they can reproduce it, but I'm not sure that it will be possible.

Do you have any suggestions as to how to proceed? We are running Lustre 2.1.2.

Thanks.

Comment by Lai Siyao [ 05/Jun/13 ]

Andreas, could you take a look and give some suggestion?

Comment by Kit Westneat (Inactive) [ 05/Jun/13 ]

I was able to reproduce it on a single client (which was also the MDT) using a modified version of the reproducer in LU-3439. I used the debug daemon to get full debugging information and have attached it to the ticket, along with the reproducer and debugfs stats of the two files with the same filename.

Thanks.

Comment by Kit Westneat (Inactive) [ 05/Jun/13 ]

err I can't attach it, it's 20MB, but here is a link:
https://woscloud.corp.ddn.com/v2/files/L2YwMWU5NWI5ZGViNGRjOThiMmUxMWZmZGMyMzFiYjE0NQ==/content/inline/lu-2901.tgz

Comment by Kit Westneat (Inactive) [ 06/Jun/13 ]

Hi Lai,

I think you meant to respond to this LU, so I am posting here.

Here is what my directory looks like:
---------- 1 root root 5 Jun 5 09:31 test_file
---------- 1 root root 5 Jun 5 09:31 test_file

with -i:
144115205322835997 ---------- 1 root root 5 Jun 5 09:31 test_file
144115205322835997 ---------- 1 root root 5 Jun 5 09:31 test_file

Using debugfs on the MDT I can see that there is one with inode 45 and one with inode 25, the stat for them should both be in the tarball.

To reproduce it I ran:
rm * -f ; for x in `seq 1 8`; do ~/test & done

until I saw two files named test_file (or it ran out of files), and then I ran killall test.

I wasn't sure if the files needed to be kept open to reproduce it. I just now reproduced it after adding close(fd) to the main loop, so it seems as if it's not necessary.

I also modified the main loop to end after a certain number of iterations. I could reproduce it if the loop was run more than 1x, but not if it was only run once.

Comment by Andreas Dilger [ 07/Jun/13 ]

Attached is an updated and improved version of the test, named duplicate_name.c.

It seems that the race is only in the first time this link tries to be created, and all of the other loops are useless.

I suspect the problem is some kind of locking problem (maybe pdirops?) during link() on the MDT not grabbing the right lock on the "test_file" name/hash properly. The original file can be created due to its unique name, then link() succeeds in linking to "test_file" in a few cases if there was a racy lookup on the client?

Comment by Andreas Dilger [ 07/Jun/13 ]

Like Kit, I'm running this like:

rm $DIR/test_file*; for X in $(seq 8); do ./duplicate_name $DIR & done; wait; ls $DIR

The updated test exits immediately if it was able to create the link, and the remaining runs also exit after 10 loops. I'm able to reproduce this 100% of the time on master, sometimes getting as many as 4 links with the same name. It also works if the open(), write(), fchmod(), close() calls are removed, leaving only the link() call, and can be reproduced with only 2 threads (though not 100% of the time).

I'm attaching a full debug log from a two-thread run, which has also been patched to change LDLM_DEBUG() to print the lock resources in a format similar to DFID so they are easier to find in the logs and also include the the name hash.

I suspect the problem to be in pdirops, or the MDT's usage of it in link, since normal ext4 wouldn't ever allow this I think.

Comment by Andreas Dilger [ 07/Jun/13 ]

Full debug log with improved LDLM_DEBUG() messages.

Comment by Lai Siyao [ 07/Jun/13 ]

Liang has confirmed that pdirops is firstly included from 2.3, and this is can be reproduced on 2.1, pdirops is not the culprit.

Comment by Andreas Dilger [ 07/Jun/13 ]

I think I have understood what is happening here. The link handling in the MDT is fundamentally unsafe, because it is never checking whether the target name already exists or not. mdt_reint_link() is correctly looking up and locking the parent object and name hash, looking up and locking the source object, and calling the mdd_link()>...osd_index_ea_insert()>...ldiskfs_add_entry() but in ldiskfs there is no guarantee that the name being inserted is unique.

The ldiskfs code depends on the original caller (normally the VFS) to have done a lookup and return -EEXIST in case of an existing target. It happens that add_dirent_to_buf() does return -EEXIST if it happens to find the name in the leaf block it is traversing to find an empty slot for the dirent, but if the name is after an empty slot large enough to hold the dirent this will not happen. In the current test case, the original name of the source file is test_file_XXXXXX and this longer name is unlinked by the first thread doing the link("test_name") operation. That leaves an empty slot in the directory leaf block large enough for the second thread to also insert "test_name" into the directory. The locking of the directory is not enough, since the name is not revalidated under the lock.

Alex, I'm not sure at which level the name to be inserted should be checked? Is this something that should be handled at the MDD layer, or the OSD layer? It looks like ZFS is already checking that the name being inserted into the directory ZAP is unique, and this is also common for e.g. databases to allow checking that the inserted key is unique, so it makes sense that this be handled inside osd-ldiskfs. The add_dirent_to_buf() code is already doing this for some of the cases, but I am reluctant to patch ext4 further in this area. I'm also aware that you don't like extra "unnecessary" checks at the OSD level for operations that should be verified at the higher layers, so please provide some input on where you think this should be checked.

I've pushed two patches for this bug.

http://review.whamcloud.com/6591
A new sanity.sh test_31o case which can reproduce the problem. It is just a shell script, no need for a separate program. It reproduces 100% on ldiskfs with 8 threads, and not in several hundred runs on ZFS, which bears out the above analysis. It should be used as the basis for any fix, though the test case needs to be excluded when run against older server versions, like below (I didn't include this into the current test case so that it would run when developing the fix):

        if [ $(lustre_version_code $SINGLEMDS) -lt $(version_code 2.1.6) ] ||
           [ $(lustre_version_code $SINGLEMDS) -ge $(version_code 2.2.0) -a
             $(lustre_version_code $SINGLEMDS) -lt $(version_code 2.4.1) ]; then
                skip "Need MDS version at least 2.1.6 or 2.4.1"
                return 0
        fi

http://review.whamcloud.com/6592
The second patch cleans up the rest of the code to use the new DLDLMRES/PLDLMRES macros when printing out resource names, so that they appear in messages and logs in a consistent format. This also exposed a bug in the ll_md_blocking_ast() callback, which has been around since http://review.whamcloud.com/2271 landed (LU-1198), but is "not a bug yet" because it only affects unused fields in the FID and DLM resource so I didn't split it into a separate commit.

Comment by Lai Siyao [ 08/Jun/13 ]

VFS actually checks target for link: sys_linkat -> lookup_create. But lustre client ll_lookup_nd(O_CREATE && !O_OPEN) doesn't really lookup on server, but create a temp dentry and return, and let the real operation, eg. link/mknod later to check error like -EEXIST. But unfortunately MDS doesn't check this for link.

Comment by Andreas Dilger [ 11/Jun/13 ]

Alex, I'm not sure at which level the name to be inserted should be checked? Is this something that should be handled at the MDD layer, or the OSD layer? It looks like ZFS is already checking that the name being inserted into the directory ZAP is unique, and this is also common for e.g. databases to allow checking that the inserted key is unique, so it makes sense that this be handled inside osd-ldiskfs. The add_dirent_to_buf() code is already doing this for some of the cases, but I am reluctant to patch ext4 further in this area. I'm also aware that you don't like extra "unnecessary" checks at the OSD level for operations that should be verified at the higher layers, so please provide some input on where you think this should be checked.

Comment by Alex Zhuravlev [ 11/Jun/13 ]

yes, I agree it's better to be done in OSD, especially given ldiskfs does this partially. we were going to introduce ->update() which would make ->insert() returning -EEXIST well justified?

Comment by Alex Zhuravlev [ 12/Jun/13 ]

hmm, if we just add yet another lookup in ->insert(), then mdd_create() will be doing two lookups and that should affect performance?

Comment by Andreas Dilger [ 13/Jun/13 ]

Yes, I was wondering about this also. At one point I thought it was just a dcache lookup, but it seems the dentry being used is totally fake.

The most efficient way to do this would be to just scan the whole ldiskfs leaf block when doing the insert, since add_dirent_to_buf() is already scanning it looking for free space. It is most likely going to scan most of the used part of the block anyway, so this would add relatively little overhead. It would be good to structure this change so it is only done if being called by Lustre (e.g. if ldiskfs_dentry_param is passed). That gives us some hope of having it accepted upstream.

Comment by Alex Zhuravlev [ 13/Jun/13 ]

probably we'd have to scan more than 1 block if they share same hash value? should happen too rare to affect performance though.

Comment by Lai Siyao [ 13/Jun/13 ]

IMHO a one-line check before mdo_link() in mdt_reint_link() could fix this: check return value of mdt_lookup_version_check() for target is -ENOENT, otherwise -EEXIST is returned. (it's the same as mdt_md_create()).

Now this check is only missing for link operation, if this check should only be done in ldiskfs, other operations should follow this rule too.

Comment by Lai Siyao [ 14/Jun/13 ]

I updated the patch of http://review.whamcloud.com/#change,6591 with the way I described above.

Comment by Bob Glossman (Inactive) [ 17/Jun/13 ]

back port to b2_1:
http://review.whamcloud.com/#/c/6678

This back port leaves out the first 2 files from the original patch as advised by Oleg.

Comment by Sebastien Buisson (Inactive) [ 19/Jun/13 ]

Hi,

I think we need clarification regarding the various patches proposed in here. If I understand correctly, we have:

Does it mean that the patch that really fixes the issue is missing in b2_1?

Thanks in advance,
Sebastien.

Comment by Bob Glossman (Inactive) [ 19/Jun/13 ]

Sebastian,
The 2nd patch in master, #6592, only changes log and error messages. It uses new macros that were defined in the master version of the first patch but were left out of the b2_1 back port. This makes it hard to back port the 2nd patch.

Only the first patch, #6591, contains actual functional fixes so that was all that was back ported.

Comment by Sebastien Buisson (Inactive) [ 20/Jun/13 ]

Thanks for the clarification, Bob!

Comment by Bob Glossman (Inactive) [ 13/Aug/13 ]

backports to b2_4:
http://review.whamcloud.com/7315
http://review.whamcloud.com/7316

Comment by Peter Jones [ 20/Aug/13 ]

Seems fixes have landed where needed

Generated at Sat Feb 10 01:29:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.