[LU-2901] Duplicate filename on the same ldiskfs directory on MDS Created: 04/Mar/13 Updated: 20/Aug/13 Resolved: 20/Aug/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.3, Lustre 2.1.6, Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.1.6, Lustre 2.4.1, Lustre 2.5.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Diego Moreno (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 6988 | ||||||||
| Description |
|
We meet 3 times on 3 differents lustre MDS ldiskfs filesystem 2012 Dec 4 10:59:22 bigfoot2 : cdep4-MDT0000 Starting fsck In this case, running the ls command we can see the issue: total 0 Then running debugfs after the fsck we can see that both 2 files [root@bigfoot2 ~]# /usr/lib/lustre/debugfs /dev/mapper/da1vg0_mdt debugfs: stat BDE_UCD.00000-00001 Another trace can be found attached to this ticket. Is this a known issue? |
| Comments |
| Comment by Peter Jones [ 04/Mar/13 ] |
|
Lai Could you please comment on this one? Thanks Peter |
| Comment by Lai Siyao [ 13/Mar/13 ] |
|
My understanding of this issue is:
Please correct me if I misunderstood, thanks. |
| Comment by Lai Siyao [ 14/Mar/13 ] |
|
Could you list the version of e2fsprogs on your system? |
| Comment by Lai Siyao [ 14/Mar/13 ] |
|
Do you know how file 'BDE_UCD.00000-00001' is created? What operations has been done on this file? Is it created multiple times? |
| Comment by Diego Moreno (Inactive) [ 15/Mar/13 ] |
|
The e2fsprogs version is: 1.42.3.wc3. This issue is difficult to reproduce/trace as it raises up from time to time and it's only detected when lfsck is run once a month. It comes from an internal IO library doing lots of IOs. From my understanding lfsck is fine and the issue is as follows:
The question is: how could a duplicate filename arrive? |
| Comment by Diego Moreno (Inactive) [ 15/Mar/13 ] |
|
I forgot to mention that both files have the same name and size but different attributes (ctime, mtime) and fid. |
| Comment by Lai Siyao [ 15/Mar/13 ] |
|
One possible cause is that on the second create, it fails to lookup by filename, so that file with the same name is created twice. But it's weird that this happens with the same name 'BDE_UCD.00000-00001' on two different directories: 'test9' and 'test9_pscc'. I've tested with the same directory structure, but nothing went wrong. I'll see is it possible to make a debug patch to catch such case, and print message when it happens. |
| Comment by Lai Siyao [ 03/May/13 ] |
|
I'm afraid this is not a lustre specific bug, because even lustre code forgets check duplicate dir entry before inserting, ldiskfs (ext4) will complain and refuse insertion. And I can't think of a way to catch such error in lustre code. BTW in the attached trace file, I see such log messages: 2013/01/18 15:16:56 rbh-sherpa@yack40[8625/11]: SHERPA | GC of dep files disabled: not removing /cea/cache_dep/rep1/rep2/rep3/rep4/HDep-n=Temps_u=s.v=f0000000000000000-v=f3fcc4db02b7a7355 2013/01/18 23:52:58 rbh-sherpa@yack40[8625/9]: SHERPA | GC of dep files disabled: not removing /cea/cache_dep/rep1/rep2/rep3/rep4/HDep-n=Temps_u=s.v=f0000000000000000-v=f3fcc4db02b7a7355 Could you explain the meaning of this message? |
| Comment by Kit Westneat (Inactive) [ 30/May/13 ] |
|
Hi Lai, One of our customers recently hit this bug as well. Unfortunately it is a secure site, so getting information is difficult. I thought perhaps it could have been a problem with the journal getting incorrectly replayed, but they have not had any reboots. We thought that perhaps the file had been created with an unprintable character in the name, but that is not the case. Hexdumps of the names are identical. ls -i shows both files, but has the same inode number. I think this is because it does an lstat by name. We printed out the dirents and those showed both files with different inode numbers. The duplicate files still exist. Is there anything else we can look at before we delete the hidden one? I have asked for more details as to how the file is created, or if they can reproduce it, but I'm not sure that it will be possible. Do you have any suggestions as to how to proceed? We are running Lustre 2.1.2. Thanks. |
| Comment by Lai Siyao [ 05/Jun/13 ] |
|
Andreas, could you take a look and give some suggestion? |
| Comment by Kit Westneat (Inactive) [ 05/Jun/13 ] |
|
I was able to reproduce it on a single client (which was also the MDT) using a modified version of the reproducer in Thanks. |
| Comment by Kit Westneat (Inactive) [ 05/Jun/13 ] |
|
err I can't attach it, it's 20MB, but here is a link: |
| Comment by Kit Westneat (Inactive) [ 06/Jun/13 ] |
|
Hi Lai, I think you meant to respond to this LU, so I am posting here. Here is what my directory looks like: with -i: Using debugfs on the MDT I can see that there is one with inode 45 and one with inode 25, the stat for them should both be in the tarball. To reproduce it I ran: until I saw two files named test_file (or it ran out of files), and then I ran killall test. I wasn't sure if the files needed to be kept open to reproduce it. I just now reproduced it after adding close(fd) to the main loop, so it seems as if it's not necessary. I also modified the main loop to end after a certain number of iterations. I could reproduce it if the loop was run more than 1x, but not if it was only run once. |
| Comment by Andreas Dilger [ 07/Jun/13 ] |
|
Attached is an updated and improved version of the test, named duplicate_name.c. It seems that the race is only in the first time this link tries to be created, and all of the other loops are useless. I suspect the problem is some kind of locking problem (maybe pdirops?) during link() on the MDT not grabbing the right lock on the "test_file" name/hash properly. The original file can be created due to its unique name, then link() succeeds in linking to "test_file" in a few cases if there was a racy lookup on the client? |
| Comment by Andreas Dilger [ 07/Jun/13 ] |
|
Like Kit, I'm running this like: rm $DIR/test_file*; for X in $(seq 8); do ./duplicate_name $DIR & done; wait; ls $DIR The updated test exits immediately if it was able to create the link, and the remaining runs also exit after 10 loops. I'm able to reproduce this 100% of the time on master, sometimes getting as many as 4 links with the same name. It also works if the open(), write(), fchmod(), close() calls are removed, leaving only the link() call, and can be reproduced with only 2 threads (though not 100% of the time). I'm attaching a full debug log from a two-thread run, which has also been patched to change LDLM_DEBUG() to print the lock resources in a format similar to DFID so they are easier to find in the logs and also include the the name hash. I suspect the problem to be in pdirops, or the MDT's usage of it in link, since normal ext4 wouldn't ever allow this I think. |
| Comment by Andreas Dilger [ 07/Jun/13 ] |
|
Full debug log with improved LDLM_DEBUG() messages. |
| Comment by Lai Siyao [ 07/Jun/13 ] |
|
Liang has confirmed that pdirops is firstly included from 2.3, and this is can be reproduced on 2.1, pdirops is not the culprit. |
| Comment by Andreas Dilger [ 07/Jun/13 ] |
|
I think I have understood what is happening here. The link handling in the MDT is fundamentally unsafe, because it is never checking whether the target name already exists or not. mdt_reint_link() is correctly looking up and locking the parent object and name hash, looking up and locking the source object, and calling the mdd_link() The ldiskfs code depends on the original caller (normally the VFS) to have done a lookup and return -EEXIST in case of an existing target. It happens that add_dirent_to_buf() does return -EEXIST if it happens to find the name in the leaf block it is traversing to find an empty slot for the dirent, but if the name is after an empty slot large enough to hold the dirent this will not happen. In the current test case, the original name of the source file is test_file_XXXXXX and this longer name is unlinked by the first thread doing the link("test_name") operation. That leaves an empty slot in the directory leaf block large enough for the second thread to also insert "test_name" into the directory. The locking of the directory is not enough, since the name is not revalidated under the lock. Alex, I'm not sure at which level the name to be inserted should be checked? Is this something that should be handled at the MDD layer, or the OSD layer? It looks like ZFS is already checking that the name being inserted into the directory ZAP is unique, and this is also common for e.g. databases to allow checking that the inserted key is unique, so it makes sense that this be handled inside osd-ldiskfs. The add_dirent_to_buf() code is already doing this for some of the cases, but I am reluctant to patch ext4 further in this area. I'm also aware that you don't like extra "unnecessary" checks at the OSD level for operations that should be verified at the higher layers, so please provide some input on where you think this should be checked. I've pushed two patches for this bug. http://review.whamcloud.com/6591 if [ $(lustre_version_code $SINGLEMDS) -lt $(version_code 2.1.6) ] || [ $(lustre_version_code $SINGLEMDS) -ge $(version_code 2.2.0) -a $(lustre_version_code $SINGLEMDS) -lt $(version_code 2.4.1) ]; then skip "Need MDS version at least 2.1.6 or 2.4.1" return 0 fi http://review.whamcloud.com/6592 |
| Comment by Lai Siyao [ 08/Jun/13 ] |
|
VFS actually checks target for link: sys_linkat -> lookup_create. But lustre client ll_lookup_nd(O_CREATE && !O_OPEN) doesn't really lookup on server, but create a temp dentry and return, and let the real operation, eg. link/mknod later to check error like -EEXIST. But unfortunately MDS doesn't check this for link. |
| Comment by Andreas Dilger [ 11/Jun/13 ] |
|
Alex, I'm not sure at which level the name to be inserted should be checked? Is this something that should be handled at the MDD layer, or the OSD layer? It looks like ZFS is already checking that the name being inserted into the directory ZAP is unique, and this is also common for e.g. databases to allow checking that the inserted key is unique, so it makes sense that this be handled inside osd-ldiskfs. The add_dirent_to_buf() code is already doing this for some of the cases, but I am reluctant to patch ext4 further in this area. I'm also aware that you don't like extra "unnecessary" checks at the OSD level for operations that should be verified at the higher layers, so please provide some input on where you think this should be checked. |
| Comment by Alex Zhuravlev [ 11/Jun/13 ] |
|
yes, I agree it's better to be done in OSD, especially given ldiskfs does this partially. we were going to introduce ->update() which would make ->insert() returning -EEXIST well justified? |
| Comment by Alex Zhuravlev [ 12/Jun/13 ] |
|
hmm, if we just add yet another lookup in ->insert(), then mdd_create() will be doing two lookups and that should affect performance? |
| Comment by Andreas Dilger [ 13/Jun/13 ] |
|
Yes, I was wondering about this also. At one point I thought it was just a dcache lookup, but it seems the dentry being used is totally fake. The most efficient way to do this would be to just scan the whole ldiskfs leaf block when doing the insert, since add_dirent_to_buf() is already scanning it looking for free space. It is most likely going to scan most of the used part of the block anyway, so this would add relatively little overhead. It would be good to structure this change so it is only done if being called by Lustre (e.g. if ldiskfs_dentry_param is passed). That gives us some hope of having it accepted upstream. |
| Comment by Alex Zhuravlev [ 13/Jun/13 ] |
|
probably we'd have to scan more than 1 block if they share same hash value? should happen too rare to affect performance though. |
| Comment by Lai Siyao [ 13/Jun/13 ] |
|
IMHO a one-line check before mdo_link() in mdt_reint_link() could fix this: check return value of mdt_lookup_version_check() for target is -ENOENT, otherwise -EEXIST is returned. (it's the same as mdt_md_create()). Now this check is only missing for link operation, if this check should only be done in ldiskfs, other operations should follow this rule too. |
| Comment by Lai Siyao [ 14/Jun/13 ] |
|
I updated the patch of http://review.whamcloud.com/#change,6591 with the way I described above. |
| Comment by Bob Glossman (Inactive) [ 17/Jun/13 ] |
|
back port to b2_1: This back port leaves out the first 2 files from the original patch as advised by Oleg. |
| Comment by Sebastien Buisson (Inactive) [ 19/Jun/13 ] |
|
Hi, I think we need clarification regarding the various patches proposed in here. If I understand correctly, we have:
Does it mean that the patch that really fixes the issue is missing in b2_1? Thanks in advance, |
| Comment by Bob Glossman (Inactive) [ 19/Jun/13 ] |
|
Sebastian, Only the first patch, #6591, contains actual functional fixes so that was all that was back ported. |
| Comment by Sebastien Buisson (Inactive) [ 20/Jun/13 ] |
|
Thanks for the clarification, Bob! |
| Comment by Bob Glossman (Inactive) [ 13/Aug/13 ] |
|
backports to b2_4: |
| Comment by Peter Jones [ 20/Aug/13 ] |
|
Seems fixes have landed where needed |