[LU-4185] Incorrect permission handling when creating existing directories at ICHEC Created: 29/Oct/13 Updated: 03/Nov/17 Resolved: 06/Dec/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.4, Lustre 2.5.0, Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Rajeshwaran Ganesan | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 11322 | ||||||||||||||||
| Description |
|
One of our customers is hitting what appears to be
The bug was originally reported in bz 23459: The patch which appears to be the root cause was added as part of bz 18534: Thanks. |
| Comments |
| Comment by Peter Jones [ 29/Oct/13 ] |
|
Bobijam Could you please assist with this one? Thanks Peter |
| Comment by Zhenyu Xu [ 30/Oct/13 ] |
|
This seems relate to certain kernel version, I tried linux-2.6.32-358.23.2.el6 with current master, which also has the same ll_lookup_nd code as in 2.1.4 lustre, and I've tried both reproducer c code in bz23459, both test passed without the error, all of the mkdir return the expected errno 17 (EEXIST). What kernel are you using? I'll try it with 2.1.4 lustre code. |
| Comment by Kit Westneat (Inactive) [ 30/Oct/13 ] |
|
It looks like it's actually 2.1.5 on the servers, but 2.1.4 on the clients. Here are the kernel versions: client: mds: |
| Comment by Zhenyu Xu [ 30/Oct/13 ] |
|
can you collect -1 logs of the client and the mds? |
| Comment by Zhenyu Xu [ 31/Oct/13 ] |
|
I downloaded 3.0.74 kernel and failed build 2.1.4 lustre client upon it. (We don't support 3.0 kernel on 2.1.x lustre yet). Can you teach me how to build it? Also I'd still like to have -1 logs of the client and the mds to do log analysis. Thank you. |
| Comment by Kit Westneat (Inactive) [ 31/Oct/13 ] |
|
I think it's actually the SLES11 SP2 kernel: We are working on getting the lnet.debug=-1 logs. |
| Comment by Enrico Tagliavini - ICHEC [ 01/Nov/13 ] |
|
Hi I'm the ICHEC Systems Administrator reporting the problem. I want to point out the directory torque tries to create doesn't exist: The lustre filesystem on the client is mounted at /ichec/work and the torque tmpdir is set to /ichec/work/scratch/tmp/ which means torque will try to create /ichec/work/scratch/tmp/$JOB_ID which is unique, thus not existing. The C code snipped failing from the version of torque we are using (4.2.5) should be: part = strtok(path, "/"); if (part == NULL) { rc = -1; goto done; }(part - 1) = '/'; / leading '/' */ while ((part = strtok(NULL, "/")) != NULL) } *(part - 1) = '/'; /* very last component */ if (mkdir(path, mode) == -1) } This code tries to create the whole tree first, in case it doesn't exist. In our case only the very last component is the only one not existing. Note that if I tried too the reproduces from LU #1101 and I failed reproducing the bug with it for now. The only way to be 100% sure the bug will show itself is to reboot the node and submit a torque job. |
| Comment by Enrico Tagliavini - ICHEC [ 01/Nov/13 ] |
|
Update: I found a way to reproduce the issue using the reproducer code from the original bz entry, and I was wrong about the direcotry triggering the problem: it the already existing one. Sorry for the confusion. I guess I run the command the wrong way in my previous attempts. enrico@r3i1n14:~/mkdirbug/reproduce> ./reproducer /ichec/work/scratch/tmp/ Python can reproduce this easily as well: enrico@r3i1n14:~/mkdirbug/reproduce> python |
| Comment by Bob Glossman (Inactive) [ 01/Nov/13 ] |
|
I'm trying to reproduce but can't get it to happen. No matter what I do I only see errno 17, EEXIST. I never get errno 13, EACCESS. I've tried both sles11sp2 clients and centos6.4 clients and see the same behavior. What are the permissions in all the intermediate dirs in your path? I note that all your stats show st_uid=0, st_gid=0. I assume that means you are running and creating all your dirs as root? If you could just be a little more clear of the exact sequence of cmds to reproduce the problem it would be a big help. |
| Comment by Enrico Tagliavini - ICHEC [ 01/Nov/13 ] |
|
Sorry, following are more details. If you need something else ask
enrico@r3i0n0:~> mount | grep 'type lustre' >>> from os import stat This is the simplest way to reproduce the issue. Permission of the tmpdir Using the C program reproducer (updated version, more readable) from the original BZ: enrico@r3i0n1:~/mkdirbug/reproduce> pwd ERROR: inconsistency detected: previous rc: 13 vs current rc: 0 The error doesn't happen if you are executing the program from the tmpdir (obviously): Thank you |
| Comment by Bob Glossman (Inactive) [ 01/Nov/13 ] |
|
Could you please attach the source of your C program reproducer? I might find that easier to work with and a stronger guarantee that I'm trying the exact same thing as you. I note your mount flags show _netdev. I'm not familiar with that option. What is it? Doubt it's related but I want to pin down all the diffs between you and me. |
| Comment by Enrico Tagliavini - ICHEC [ 01/Nov/13 ] |
|
This is the source code for the reproducer |
| Comment by Bob Glossman (Inactive) [ 01/Nov/13 ] |
|
Thank you for your reproducer. Seems to run fine for me, no error: # ./reproducer /mnt/lustre/tmp Iteration: 1 Iteration: 2 doing stat before creating directory Iteration: 3 Iteration: 4 doing stat before creating directory Iteration: 5 Iteration: 6 doing stat before creating directory Iteration: 7 Iteration: 8 doing stat before creating directory Iteration: 9 I am running as root. I am running on a current sles11sp2 client. Will try rolling back to a 2.1 lustre version client to see if the problem is version specific. |
| Comment by Kit Westneat (Inactive) [ 01/Nov/13 ] |
|
Here are the debug=-1 logs that requested previously: Is there any reason to think that it's not the same issue identified by Hongchao in bz 23459? It appears that the potentially incorrect namedata lookup shortcut is still in llite/namei.c. |
| Comment by Zhenyu Xu [ 01/Nov/13 ] |
|
I think that patch can cure this issue. Just want to understand further. |
| Comment by Kit Westneat (Inactive) [ 01/Nov/13 ] |
|
Ok cool, just want to make sure we are all on the same page. I was looking at the client dk and I think that code path is hit here: 00000080:00002000:0.0:1383306592.567518:0:4982:0:(dcache.c:116:ll_dcompare()) found name tmp(ffff880840b64500) flags 0x20e010 refc 1 00000080:00000001:0.0:1383306592.567519:0:4982:0:(dcache.c:123:ll_dcompare()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:0.0:1383306592.567520:0:4982:0:(namei.c:571:ll_lookup_nd()) Process entered 00000080:00000001:0.0:1383306592.567521:0:4982:0:(dcache.c:199:ll_set_dd()) Process entered 00000080:00002000:0.0:1383306592.567521:0:4982:0:(dcache.c:204:ll_set_dd()) ldd on dentry tmp (ffff88083d4152c0) parent ffff88084c088180 inode (null) refc 1 00000080:00000010:0.0:1383306592.567522:0:4982:0:(dcache.c:209:ll_set_dd()) kmalloced 'lld': 104 at ffff88084c721dc0. 00000080:00000001:0.0:1383306592.567523:0:4982:0:(dcache.c:222:ll_set_dd()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:0.0:1383306592.567523:0:4982:0:(namei.c:593:ll_lookup_nd()) Process leaving (rc=0 : 0 : 0) I can't tell if it's returning EPERM because of that though. |
| Comment by Zhenyu Xu [ 04/Nov/13 ] |
|
I think it could relates to the discretionary access control of the fs, if the running task does not has CAP_DAC_OVERRIDE capability (i.e. override DAC access), this issue could happen. int generic_permission(struct inode *inode, int mask, int (*check_acl)(struct inode *inode, int mask)) { int ret; /* * Do the basic POSIX ACL permission checks. */ ret = acl_permission_check(inode, mask, check_acl); if (ret != -EACCES) return ret; /* * Read/write DACs are always overridable. * Executable DACs are overridable if at least one exec bit is set. */ if (!(mask & MAY_EXEC) || execute_ok(inode)) if (capable(CAP_DAC_OVERRIDE)) return 0; /* * Searching includes executable on directories, else just read. */ mask &= MAY_READ | MAY_WRITE | MAY_EXEC; if (mask == MAY_READ || (S_ISDIR(inode->i_mode) && !(mask & MAY_WRITE))) if (capable(CAP_DAC_READ_SEARCH)) return 0; return -EACCES; } |
| Comment by Zhenyu Xu [ 04/Nov/13 ] |
|
with patch in bz 23459, the -EEXIST would be detected before this permission checking. |
| Comment by Zhenyu Xu [ 04/Nov/13 ] |
|
Hi Oleg, Could you take a look, I am still confused what the root cause of the -EACCES (I think its from the checking of the parent of the last dir dentry, but I don't know what the relationship it has with the existing dir, if we are to create non-existing dir, the same permission checking path still will go through). |
| Comment by Kit Westneat (Inactive) [ 04/Nov/13 ] |
|
I think the issue is that the application code (Torque) uses mkdir to check for existence, even if it does not have permission to create the directory. Lustre believes that the directory doesn't exist, even though it does, so it does a permissions check to see if Torque can create the directory, which fails. These applications require that mkdir return EEXIST before checking permissions. Bob, I think you need to run the reproducer as a non-privileged user in order for it to work. The directory should exist, but the user should not have permission to create it. The expected error is EEXIST, but because of the bug you will get EACCESS. Here are the instructions for how to use the reproducer from the bugzilla ticket linked earlier:
|
| Comment by Kit Westneat (Inactive) [ 07/Nov/13 ] |
|
Any updates? Should we try the patch from bz 23459? |
| Comment by Zhenyu Xu [ 07/Nov/13 ] |
|
yes, I think you can try that patch. But I'm not sure to include it in the source tree, since it would kill a little open & create performance, while just to return different error code for creating existing directory scenario. Let's try it first, to see whether it can handle the issue which torque meets. |
| Comment by Kit Westneat (Inactive) [ 07/Nov/13 ] |
|
Ok, sounds good. I wonder what the performance hit is, is there any automated mdtest'ing done? I'll let you know how it goes. |
| Comment by Patrick Farrell (Inactive) [ 12/Nov/13 ] |
|
Just a quick note - We're seeing this as well at Cray, both on 2.4.1 on SLES and master on CentOS. I've got a simpler reproducer as well. As root: If you then do ls -l and try again, you get EEXIST as expected. I'll test the suggested patch against master. |
| Comment by Bob Glossman (Inactive) [ 12/Nov/13 ] |
|
Patrick, |
| Comment by Patrick Farrell (Inactive) [ 12/Nov/13 ] |
|
Bob, Interesting! The version of master I tested against was slightly out of date. I'll rebuild and test again with the latest. For what it's worth, my testing of master was on CentOS. Also, I forgot to specify, but the SLES I used with 2.4.1 was SLES11SP2 (as in your test). |
| Comment by Patrick Farrell (Inactive) [ 12/Nov/13 ] |
|
Bob, I just tested with a fresh pull of master on CentOS and I'm still seeing the issue as described. The patch from bz 23459 does fix my reproducer. Still, as noted about, there's a possible performance impact. |
| Comment by Bob Glossman (Inactive) [ 12/Nov/13 ] |
|
Patrick, You are entirely correct. I must have mistyped something during my master test. Reran the experiment. Now seeing the problem reproduce with both 2.4.1 and master sles11sp2 clients. EPERM reported instead of EEXIST in both cases. |
| Comment by Bob Glossman (Inactive) [ 12/Nov/13 ] |
|
Have also confirmed the same failure can be seen with current master on sles11sp3 clients as well. EPERM instead of EEXIST reported there too. I believe this is good evidence that the problem really is in lustre client code, is still there in current master, and the specific distro or kernel version doesn't matter. |
| Comment by Zhenyu Xu [ 13/Nov/13 ] |
|
The patch is tracking at http://review.whamcloud.com/8257, but I'm not sure to include it in the source tree, since it would kill a little open & create performance, while just to return a different error code for creating existing directory scenario. |
| Comment by Enrico Tagliavini - ICHEC [ 13/Nov/13 ] |
|
Hi Zhenyu, mkdir() of an existing directory return EEXIST on common filesystems, whatever the permission is. I have no idea if there is a standard mandating that or it is just common usage, but if lustre is going to make something different it will break a lot of workflows, for the good of performances. A lot of applications will need to add a stat before the creation useful only when lustre is the underlaying file system, leading to a performance hit anyway. From the performance PoV this is indeed not the ideal solution, but if you are trying to make a directory I think you really want to know first if it exists already, before knowing if you have the right permission to create it or not. Kind regards |
| Comment by Patrick Farrell (Inactive) [ 29/Aug/14 ] |
|
One additional thing about this that bothers me: It would be nice to have this addressed: |
| Comment by Patrick Farrell (Inactive) [ 16/Jul/15 ] |
|
Yuck - I recommend abandoning this change. I just took a closer look at the performance impact, and it's bad. Below are single node mdtest results. The directory creation is what's interesting - As that is create without open, which is the case we're optimizing away (and the patch would undo that optimization). File creates are create+open, and as expected, the performance doesn't change. Without the fix in place: With the fix in place, same node, same system: Like I said, 30-50% drop in performance. Yuck. |
| Comment by David Trudgian [ 15/Feb/16 ] |
|
Hi, We are a DDN customer, who recently hit this issue with (I believe) 2 packages. Firstly, running MATLAB 2014b and later we receive 'Permission Denied' errors when using MATLAB's built-in mkdir function to create a nested directory structure on our lustre fs. Using MATLAB's system command to call out to shell 'mkdir -p' always works without issue. Mathworks investigated for us, but indicated they can't support lustre. MATLAB's in-built mkdir works prior to 2014a, so (after coming across the 2nd example) I think it's related to this The second occurrence thing that led me here was failing output directory creation for the cufflinks bioinformatics tool on the lustre fs. Code is available, and I was able to take their implementation of mkpath (for recursive directory creation) and see that it failed trying to run mkdir on an existing path - receiving EPERM not EEXIST. The relevant code is here: https://github.com/cole-trapnell-lab/cufflinks/blob/master/src/common.cpp Search mkpath, currently line 262. I can make a patch to cufflinks to run a stat before attempting a mkdir, rather than just relying on mkdir returning EEXIST. However, it would be nice to know if there is ongoing work to try and address this. Thanks. |
| Comment by David Trudgian [ 15/Feb/16 ] |
|
Issue ref for the cufflinks package we have the problem with: https://github.com/cole-trapnell-lab/cufflinks/issues/53 |
| Comment by Zhenyu Xu [ 01/Mar/16 ] |
|
updated the http://review.whamcloud.com/8257 to address this issue while not impacting the performance. |
| Comment by Zhenyu Xu [ 01/Mar/16 ] |
|
Though the updated patch can solve this issue, it raises another issue which is that it will ignore ACL permission failure opon directories. |
| Comment by Zhenyu Xu [ 01/Mar/16 ] |
|
Updated the patch which adds a client proc entry to control whether bypass this optimization, by default, this optimization is turned on. And with special applications which only check EEXIST return value of mkdir, we can turn it off by "lctl set_param" command. $ lctl set_param llite.<client_uuid>.create_no_open_optimization=0 |
| Comment by Andreas Dilger [ 02/Jun/16 ] |
|
It would be worthwhile to check if patch http://review.whamcloud.com/20354 " |
| Comment by Oleg Drokin [ 06/Jul/16 ] |
|
I tried to read the comments to get a better handle on this issue. The reproducer from Cray is basically: This seems to be a pretty narrow case, though. |
| Comment by Oleg Drokin [ 06/Jul/16 ] |
|
I guess if somebody confirm that the case is distilled to the apps that basically get a work path in the form of /mnt/lustre/homes/user/somedir/app_folder where user only has write permissions from user onward, but the app does the equivalent of: This way we do not penalize all the important fast cases being: The |
| Comment by Gerrit Updater [ 06/Jul/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/21169 |
| Comment by Patrick Farrell (Inactive) [ 06/Jul/16 ] |
|
Oleg - I think you've captured the problem as I understand end users are reporting it (as well as the Cray reproducer). One question - Which Gerrit link are we using? Bobi Jam has updated to a patch like yours as well. |
| Comment by Oleg Drokin [ 06/Jul/16 ] |
|
Bobi's patch has a test which is better than mine that does not. |
| Comment by Oleg Drokin [ 06/Jul/16 ] |
|
Patrick: It would be great if you can verify the current patch (Either one) is ok in majority of operations like mdtest and such. |
| Comment by Oleg Drokin [ 07/Jul/16 ] |
|
BTW, interesting thing, apparently NFS has the same problem as Lustre and I have not heard any complaints. [green@fedora1 crash]$ mkdir aaa mkdir: cannot create directory 'aaa': Permission denied [green@fedora1 crash]$ mkdir lost+found mkdir: cannot create directory 'lost+found': Permission denied [green@fedora1 crash]$ ls -ld lost+found drwx------ 2 root root 16384 May 25 2013 lost+found [green@fedora1 crash]$ mkdir lost+found mkdir: cannot create directory 'lost+found': File exists |
| Comment by Oleg Drokin [ 09/Jul/16 ] |
|
After some discussion with the kernel guys, it looks like EEXIST is not a requirement, so the applications that depend on it are broken. See http://marc.info/?l=linux-kernel&m=146803277310677&w=2 and http://marc.info/?l=linux-nfs&m=146803381011004&w=2 So I guess it's a good idea to file bugreports with those apps even if we have this also adjusted in Lustre. |
| Comment by Gerrit Updater [ 05/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/8257/ |
| Comment by Joseph Gmitter (Inactive) [ 06/Dec/16 ] |
|
In discussion with Bobijam, this issue is resolved with the landing of https://review.whamcloud.com/#/c/8257/ |