<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 03:07:01 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-14121] EACESS Permission denied for slurm log files when using nodemap with admin=0</title>
                <link>https://jira.whamcloud.com/browse/LU-14121</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Hello,&lt;br/&gt;
We have recently deployed new filesystems running 2.12.5 and our users are hitting a problem with Slurm jobs which write the slurm log files onto the new filesystems.&lt;/p&gt;

&lt;p&gt;A bit of context on our environment:&lt;/p&gt;

&lt;p&gt;Our main HPC cluster is mounting a subdirectory mount of the wider filesystem via a nodemap. The nodemap settings are as follows:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@rds-mds9 ~]# lctl get_param nodemap.csd3.*
nodemap.csd3.admin_nodemap=0
nodemap.csd3.audit_mode=1
nodemap.csd3.deny_unknown=0
nodemap.csd3.exports=
[
...
]
nodemap.csd3.fileset=/tenant/csd3
nodemap.csd3.id=2
nodemap.csd3.idmap=[

]
nodemap.csd3.map_mode=both
nodemap.csd3.ranges=
[
...
]
nodemap.csd3.sepol=

nodemap.csd3.squash_gid=99
nodemap.csd3.squash_uid=99
nodemap.csd3.trusted_nodemap=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Of special note, we are setting the &apos;trusted&apos; property to &apos;1&apos; as we do not do any id-mapping on the main cluster, which are all bound into our central LDAP domain. We are also setting &apos;admin&apos; to 0, which I understand to be equivalent to &apos;root_squash&apos; across all nodes in the nodemap.&lt;/p&gt;

&lt;p&gt;Consequently we aren&apos;t setting the usual settings to configure root_squash, relying on nodemaps for this feature entirely:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@rds-mds9 ~]# lctl get_param mdt.*.root_squash
mdt.rds-d7-MDT0000.root_squash=0:0
mdt.rds-d7-MDT0002.root_squash=0:0
[root@rds-mds9 ~]# lctl get_param mdt.*.nosquash_nids
mdt.rds-d7-MDT0000.nosquash_nids=NONE
mdt.rds-d7-MDT0002.nosquash_nids=NONE
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The problem we are seeing is that when a user launches a slurm job that will write it&apos;s slurm output log files into a directory on this filesystem, then it will sometimes fail to create this log file and abort the job.&lt;/p&gt;

&lt;p&gt;I have included the full strace capture from the slurmd daemon when it hits this error, along with a simple slurm job file showing the job definition:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;slurm.FAIL.strace&lt;/li&gt;
&lt;/ul&gt;


&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;example job script&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
#!/bin/bash
#SBATCH -A SUPPORT-CPU
#SBATCH --output=/rds-d7/project/mjr208/testdir_groups/slurm.out.5
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=1:00

echo &lt;span class=&quot;code-quote&quot;&gt;&quot;STARTING JOB&quot;&lt;/span&gt;
pwd
exit 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The key portion of the strace is here:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;div class=&quot;error&quot;&gt;&lt;span class=&quot;error&quot;&gt;Unable to find source-code formatter for language: strace slurmd.&lt;/span&gt; Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml&lt;/div&gt;&lt;pre&gt;
$ grep 72268 slurm.FAIL.strace | grep -v -E &lt;span class=&quot;code-quote&quot;&gt;&apos;(futex|tgkill|TKILL)&apos;&lt;/span&gt;
...
72268 20:44:49.224728 geteuid()         = 0 &amp;lt;0.000005&amp;gt;                                                                                                                                                              
72268 20:44:49.224747 umask(002)        = 022 &amp;lt;0.000005&amp;gt;                                                                                                                                                            
72268 20:44:49.224764 getuid()          = 0 &amp;lt;0.000004&amp;gt;                             
72268 20:44:49.224780 getgid()          = 0 &amp;lt;0.000005&amp;gt;                                                                                                                                                              
72268 20:44:49.224796 getcwd(&lt;span class=&quot;code-quote&quot;&gt;&quot;/&lt;span class=&quot;code-keyword&quot;&gt;var&lt;/span&gt;/spool/slurm/slurmd&quot;&lt;/span&gt;, 4096) = 24 &amp;lt;0.000005&amp;gt;
72268 20:44:49.224814 getgroups(0, NULL) = 51 &amp;lt;0.000005&amp;gt;                                                                                                                                                            
72268 20:44:49.224831 getgroups(51, [815, 901, 902, 904, 905, 910, 1099, 1100, 1101, 8053, 8054, 10573, 14500, 14998, 17501, 41000, 41029, 41037, 41357, 42000, 42006, 42042, 42080, 42100, 42112, 42113, 43000, 430
01, 43003, 43005, 43007, 43009, ...]) = 51 &amp;lt;0.000005&amp;gt;                                                                                                                                                               
72268 20:44:49.224870 getuid()          = 0 &amp;lt;0.000005&amp;gt;                           
72268 20:44:49.225163 setresgid(-1, 10573, -1 &amp;lt;unfinished ...&amp;gt;                                                                                                                                                      
72268 20:44:49.225175 &amp;lt;... setresgid resumed&amp;gt;) = 0 &amp;lt;0.000008&amp;gt;                                                                                                                                                       
72268 20:44:49.225574 setgroups(51, [10573, 50045, 80001, 80004, 80007, 80016, 80019, 43003, 43001, 43005, 43009, 43007, 41357, 80017, 43010, 42042, 42006, 41029, 42080, 41037, 17501, 901, 8054, 905, 14998, 904, 
14500, 90022, 815, 902, 1100, 8053, ...] &amp;lt;unfinished ...&amp;gt;                                                                                                                                                           
72268 20:44:49.225608 &amp;lt;... setgroups resumed&amp;gt;) = 0 &amp;lt;0.000008&amp;gt;                                                                                                                                                       
72268 20:44:49.225895 setresuid(-1, 10573, -1 &amp;lt;unfinished ...&amp;gt;                                                                                                                                                      
72268 20:44:49.225907 &amp;lt;... setresuid resumed&amp;gt;) = 0 &amp;lt;0.000008&amp;gt;                                      
72268 20:44:49.225919 open(&lt;span class=&quot;code-quote&quot;&gt;&quot;/dev/&lt;span class=&quot;code-keyword&quot;&gt;null&lt;/span&gt;&quot;&lt;/span&gt;, O_RDONLY) = 10 &amp;lt;0.000006&amp;gt;
72268 20:44:49.225938 fcntl(10, F_SETFD, FD_CLOEXEC) = 0 &amp;lt;0.000005&amp;gt;                                                                                                                                                 
72268 20:44:49.225955 open(&lt;span class=&quot;code-quote&quot;&gt;&quot;/rds-d7/project/mjr208/testdir_groups/slurm.out.5&quot;&lt;/span&gt;, O_WRONLY|O_CREAT|O_TRUNC|O_APPEND, 0666) = -1 EACCES (Permission denied) &amp;lt;0.000785&amp;gt;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So this process is the slurmstepd process running as root forked by slurmd when starting the job, and it appears to be setting it&apos;s effective UID and GID to that of the job&apos;s user, setting it&apos;s groups to that of the user, and then attempting to open the slurm output file at which point it fails.&lt;/p&gt;

&lt;p&gt;The permissions on this directory are clearly writeable by the user:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[mjr208@cpu-p-129]:~/ $ ls -ld /rds-d7/project/mjr208/testdir_groups
drwxrws--- 2 mjr208 rds-kwtZ8ccHIQg-managers 4.0K Nov  5 20:30 /rds-d7/project/mjr208/testdir_groups
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I have also captured debug traces from the MDS when this happens, which are attached in the file:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;slurm.FAIL.mds.llog&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I have noticed in this file the following line which I think might be significant:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000004:08000000:35.0:1604606961.472081:0:150250:0:(mdd_permission.c:315:__mdd_permission_internal()) permission denied, mode 45f8, fsuid 99, uid 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This made me suspect whether the root_squash setting of the nodemap could be to blame here, and sure enough, if I set the &apos;admin&apos; property to 1 and re-run the slurm job, it runs successfully without error.&lt;/p&gt;

&lt;p&gt;To test this I added an additional debugging print line to the __mdd_permission_internal function to just print the value of fsiud and uid before it does the permission check:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;--- a/lustre/mdd/mdd_permission.c
+++ b/lustre/mdd/mdd_permission.c
@@ -276,6 +276,8 @@ int __mdd_permission_internal(const struct lu_env *env, struct mdd_object *obj,
  LASSERT(la != NULL);
 
  mode = la-&amp;gt;la_mode;
+ CDEBUG(D_SEC, &quot;Checking access: mode %x, fsuid %u, uid %u\n&quot;,
+        la-&amp;gt;la_mode, uc-&amp;gt;uc_fsuid, la-&amp;gt;la_uid);
  if (uc-&amp;gt;uc_fsuid == la-&amp;gt;la_uid) {
    mode &amp;gt;&amp;gt;= 6;
         } else {
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and I captured another debug trace from the MDS as before, which is attached in:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;slurm.no-root-squash.SUCCESS.mds.llog&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The debug line I added above shows:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000004:08000000:33.0:1604607230.414011:0:150250:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 0
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So it appears that the &apos;admin&apos; property on this nodemap is having an effect here that is causing problems for slurm. Are we using this wrong, or is this possibly a bug?&lt;/p&gt;

&lt;p&gt;To help isolate this further, I&apos;ve made a little reproducer C program that tries to copy the sequence of system calls that the slurmstepd daemon is doing before attempting to open the file. Attached are also strace, and debug traces from both the client and MDS for this program, both with and without the nodemap &apos;admin&apos; property set:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;rds-d7-eacces.c&lt;/li&gt;
	&lt;li&gt;reproducer.FAIL.client.llog&lt;/li&gt;
	&lt;li&gt;reproducer.FAIL.client.strace&lt;/li&gt;
	&lt;li&gt;reproducer.FAIL.mds.llog&lt;/li&gt;
	&lt;li&gt;reproducer.no-root-squash.SUCCESS.client.llog&lt;/li&gt;
	&lt;li&gt;reproducer.no-root-squash.SUCCESS.mds.llog&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;At the moment, we are having to ask our users to re-direct their slurm output files to another filesystem, as we don&apos;t see this behaviour on our (much) older filesystems. We may also just enable the &apos;admin&apos; property for now, however I would really like to get to the bottom of why this is failing.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Matt&lt;/p&gt;</description>
                <environment>Servers&lt;br/&gt;
-----------&lt;br/&gt;
lustre-2.12.5-1.el7.x86_64&lt;br/&gt;
3.10.0-1127.8.2.el7_lustre.x86_64&lt;br/&gt;
MOFED 4.9-0.1.7.0&lt;br/&gt;
&lt;br/&gt;
Clients&lt;br/&gt;
---------&lt;br/&gt;
lustre-client-2.12.5-1.el7.x86_64&lt;br/&gt;
3.10.0-1127.19.1.el7.x86_64&lt;br/&gt;
MOFED 4.9-0.1.7.0</environment>
        <key id="61564">LU-14121</key>
            <summary>EACESS Permission denied for slurm log files when using nodemap with admin=0</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="sebastien">Sebastien Buisson</assignee>
                                    <reporter username="mrb">Matt R&#225;s&#243;-Barnett</reporter>
                        <labels>
                            <label>LTS12</label>
                    </labels>
                <created>Thu, 5 Nov 2020 21:38:34 +0000</created>
                <updated>Wed, 7 Jul 2021 07:42:01 +0000</updated>
                            <resolved>Sun, 13 Dec 2020 16:00:49 +0000</resolved>
                                    <version>Lustre 2.12.5</version>
                                    <fixVersion>Lustre 2.14.0</fixVersion>
                    <fixVersion>Lustre 2.12.7</fixVersion>
                                        <due></due>
                            <votes>0</votes>
                                    <watches>4</watches>
                                                                            <comments>
                            <comment id="284521" author="pjones" created="Fri, 6 Nov 2020 18:49:20 +0000"  >&lt;p&gt;Sebastien&lt;/p&gt;

&lt;p&gt;Could you please advise?&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="284683" author="sebastien" created="Mon, 9 Nov 2020 12:35:13 +0000"  >&lt;p&gt;Hi Matt,&lt;/p&gt;

&lt;p&gt;Thanks for this very well documented ticket. In order to reproduce the issue, I installed Lustre 2.12.5 on my client and server nodes. I started the Lustre file system, and created a directory similar to yours (including the setgid bit), with the &lt;tt&gt;uids&lt;/tt&gt; as shown below:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# id user0
uid=500(user0) gid=500(user0) groups=500(user0),5001(user0g1),5002(user0g2)

# grep user0 /etc/passwd
user00:x:501:501::/home/user00:/bin/bash
user0:x:500:500::/home/user0:/bin/bash
# grep user0 /etc/group
user00:x:501:
user0:x:500:
user0g1:x:5001:user0
user0g2:x:5002:user0

# ls -ld /mnt/lustre/subdir
drwxr-xr-x 5 root root 4096 Nov  9 20:48 /mnt/lustre/subdir

# ls -ld /mnt/lustre/subdir/testdir_groups_1
drwxrws--- 2 user0 user0g1 4096 Nov  9 18:34 /mnt/lustre/subdir/testdir_groups_1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then I unmounted my client, and created the following nodemap definition, similar to what you did:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# lctl get_param nodemap.c0.*
nodemap.c0.admin_nodemap=0
nodemap.c0.audit_mode=1
nodemap.c0.deny_unknown=0
nodemap.c0.exports=[

]
nodemap.c0.fileset=/subdir
nodemap.c0.id=1
nodemap.c0.idmap=[

]
nodemap.c0.map_mode=both
nodemap.c0.ranges=
[
 { id: 1, start_nid: 10.128.11.159@tcp, end_nid:10.128.11.159@tcp }
]
nodemap.c0.sepol=

nodemap.c0.squash_gid=99
nodemap.c0.squash_uid=99
nodemap.c0.trusted_nodemap=1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then I remounted the client, so that the nodemap is taken into account:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# mount -t lustre
10.128.11.155@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)

# ls -l /mnt/seb
drwxrws--- 2 user0  user0g1 4096 Nov  9 21:07 testdir_groups_1
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Line above shows we have subdirectory mount enforced.&lt;/p&gt;

&lt;p&gt;So I tried your reproducer:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# id
uid=0(root) gid=0(root) groups=0(root)
# /tmp/rds-d7-eacces user0 /mnt/lustre/testdir_groups_1/slurm.out
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 0
[INFO] Username of ruid: root
[INFO] Number of groups: &apos;1&apos;
     0 - root,
[INFO] Looking up user: user0
[INFO] &apos;user0&apos; uid: 500, gid: 500
[INFO] Setting egid to: 500
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 500
[INFO] Number of groups for user: &apos;3&apos;
   500 - user0,   5001 - user0g1,   5002 - user0g2,
[INFO] Setting groups of calling process to groups of user0
[INFO] Number of groups for calling process: &apos;3&apos;
[INFO] Printing process groups
   500 - user0,   5001 - user0g1,   5002 - user0g2,
[INFO] set euid to : 500
[INFO] Real UID: 0 | Effective UID: 500
[INFO] Real GID: 0 | Effective GID: 500
[INFO] open rds file: /mnt/lustre/testdir_groups_1/slurm.out
[INFO] rds directory fd: 3
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, no permission denied in my case. I also added the debug trace you inserted in &lt;tt&gt;__mdd_permission_internal&lt;/tt&gt;, it shows:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000004:08000000:5.0:1604915212.229442:0:19824:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 500, uid 500
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Looking at the result with &lt;tt&gt;root&lt;/tt&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# ls -ld /mnt/lustre/testdir_groups_1
drwxrws--- 2 user0 user0g1 4096 Nov  9 18:36 /mnt/lustre/testdir_groups_1
# ls -l /mnt/lustre/testdir_groups_1
ls: cannot open directory /mnt/lustre/testdir_groups_1: Permission denied
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;tt&gt;Permission denied&lt;/tt&gt; is expected above, as admin property is set to 0. But when accessing as user &lt;tt&gt;user0&lt;/tt&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;$ id
uid=500(user0) gid=500(user0) groups=500(user0),5001(user0g1),5002(user0g2)
-bash-4.2$ ls -l /mnt/lustre/testdir_groups_1
total 0
-rw-r--r-- 1 user0 user0g1 0 Nov  9 18:36 slurm.out
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Maybe I am not using the reproducer properly?&lt;/p&gt;


&lt;p&gt;What is weird in your case is that the debug line you added says &lt;tt&gt;uid 0&lt;/tt&gt;. This is the &lt;tt&gt;uid&lt;/tt&gt; of the parent directory in which the file is being created, so I am wondering if your are accessing the right directory. Could you please show the access rights of the target directory, from a client node that is part of the Trusted nodemap? Also, could you please show the mount command you use on client side?&lt;/p&gt;

&lt;p&gt;In order to ease debugging, would you mind applying the following patch to add more information, and then collect Lustre debug logs on client and server (MDS) side with just the &lt;tt&gt;sec&lt;/tt&gt; debug mask, when you launch the reproducer program?&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;diff --git a/lustre/mdd/mdd_dir.c b/lustre/mdd/mdd_dir.c
index ddb3062..1b6b677 100644
--- a/lustre/mdd/mdd_dir.c
+++ b/lustre/mdd/mdd_dir.c
@@ -91,6 +91,7 @@ __mdd_lookup(const struct lu_env *env, struct md_object *pobj,
                       mdd2obd_dev(m)-&amp;gt;obd_name, PFID(mdo2fid(mdd_obj)));
        }

+       CDEBUG(D_SEC, &quot;calling mdd_permission_internal_locked for %s\n&quot;, name);
        rc = mdd_permission_internal_locked(env, mdd_obj, pattr, mask,
                                            MOR_TGT_PARENT);
        if (rc)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The exact command used for the reproducer, and its standard output, would be helpful too.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="284690" author="mrb" created="Mon, 9 Nov 2020 14:12:26 +0000"  >&lt;p&gt;Hi Sebastien, thanks for looking into this for me.&lt;/p&gt;

&lt;p&gt;Here is some of the information you requested.&lt;/p&gt;

&lt;p&gt;The directory I am writing to has the following permissions, and I also include the parent directory above this, as this is the first directory that does not have world-readable (755) configured in our hierarchy:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# Parent directory
[mjr208@cpu-p-231]:~/ $ ls -ld /rds-d7/project/mjr208/testdir_groups
drwxrws--- 2 mjr208 rds-kwtZ8ccHIQg-managers 4.0K Nov  9 13:13 /rds-d7/project/mjr208/testdir_groups

# Parent of this directory
[mjr208@cpu-p-231]:~/ $ ls -ld /rds-d7/project/mjr208
drwxrws--- 5 nobody mjr208 4.0K Nov  4 19:13 /rds-d7/project/mjr208
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I wonder could the &apos;uid 0, fsid 99&apos; seen in the debug line I added in&#160;__mdd_permission_internal() could be from this directory instead?&lt;/p&gt;

&lt;p&gt;I&apos;m just recompiling with the added debug line and will re-run with the extra information as you&apos;ve suggested and report back shortly.&#160;You are running the reproducer correctly, however one thing that I didn&apos;t mention before is that I have had inconsistent results with it.&lt;/p&gt;

&lt;p&gt;For example I just re-ran the reproducer now on a fresh client in the nodemap, exactly as before, using the same output file, same directory, and the first iteration succeeded (no error), and then re-running the exact same command, the second iteration failed:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;########### ITERATION 1
[root@cpu-p-231 ~]# lctl clear; &amp;gt; client.rds-d7.eacess.llog.daemon; lctl mark &quot;STARTING TEST&quot;; strace -o 
client.rds-d7.eacess.strace ./rds-d7-eacces mjr208 /rds-d7/project/mjr208/testdir_groups/slurm.out.7     
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 0
[INFO] Username of ruid: root
[INFO] Number of groups: &apos;2&apos;
     0 - root,    988 - sfcb, 
[INFO] Looking up user: mjr208
[INFO] &apos;mjr208&apos; uid: 10573, gid: 10573
[INFO] Setting egid to: 10573
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] Number of groups for user: &apos;51&apos;
 ... 
[INFO] Setting groups of calling process to groups of mjr208
[INFO] Number of groups for calling process: &apos;51&apos;
[INFO] Printing process groups
 ...
[INFO] set euid to : 10573
[INFO] Real UID: 0 | Effective UID: 10573
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] open rds file: /rds-d7/project/mjr208/testdir_groups/slurm.out.7
[INFO] rds directory fd: 7
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;######## ITERATION 2
[root@cpu-p-231 ~]# lctl clear; &amp;gt; client.rds-d7.eacess.llog.daemon; lctl mark &quot;STARTING TEST&quot;; strace -o client.rds-d7.eacess.strace ./rds-d7-eacces mjr208 /rds-d7/project/mjr208/testdir_groups/slurm.out.7
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 0
[INFO] Username of ruid: root
[INFO] Number of groups: &apos;2&apos;
     0 - root,    988 - sfcb, 
[INFO] Looking up user: mjr208
[INFO] &apos;mjr208&apos; uid: 10573, gid: 10573
[INFO] Setting egid to: 10573
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] Number of groups for user: &apos;51&apos;
 ... 
[INFO] Setting groups of calling process to groups of mjr208
[INFO] Number of groups for calling process: &apos;51&apos;
[INFO] Printing process groups
 ... 
[INFO] set euid to : 10573
[INFO] Real UID: 0 | Effective UID: 10573
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] open rds file: /rds-d7/project/mjr208/testdir_groups/slurm.out.7
open error : Permission denied
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Re-running 10 more times after this, they all consistently fail.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;However&lt;/b&gt; I&apos;ve found that I can reset this behaviour so that the reproducer doesn&apos;t fail again by simply &apos;cd&apos; to this directory as my user from another shell on the node. eg:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;Last login: Mon Nov  9 13:31:06 GMT 2020 on pts/3
[mjr208@cpu-p-231]:~/ $ cd /rds-d7/project/mjr208                                                        
[mjr208@cpu-p-231]:mjr208/ $ ls -ltr
total 12K
drwxrws--- 2 mjr208 mjr208                   4.0K Nov  4 19:05 testdir
drwxrws--- 2 mjr208 rds-kwtZ8ccHIQg-managers 4.0K Nov  4 19:09 testdir_acls
drwxrws--- 2 mjr208 rds-kwtZ8ccHIQg-managers 4.0K Nov  9 13:13 testdir_groups &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then I re-run the reproducer:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@cpu-p-231 ~]# lctl clear; &amp;gt; client.rds-d7.eacess.llog.daemon; lctl mark &quot;STARTING TEST&quot;; strace -o client.rds-d7.eacess.strace ./rds-d7-eacces mjr208 /rds-d7/project/mjr208/testdir_groups/slurm.out.7
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 0
[INFO] Username of ruid: root
[INFO] Number of groups: &apos;2&apos;
     0 - root,    988 - sfcb, 
[INFO] Looking up user: mjr208
[INFO] &apos;mjr208&apos; uid: 10573, gid: 10573
[INFO] Setting egid to: 10573
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] Number of groups for user: &apos;51&apos;
... 
[INFO] Setting groups of calling process to groups of mjr208
[INFO] Number of groups for calling process: &apos;51&apos;
[INFO] Printing process groups
...
[INFO] set euid to : 10573
[INFO] Real UID: 0 | Effective UID: 10573
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] open rds file: /rds-d7/project/mjr208/testdir_groups/slurm.out.7
[INFO] rds directory fd: 7 &lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Which runs successfully. Re-running this again, fails as before.&lt;/p&gt;

&lt;p&gt;Here are the MDS logs for the above test with &lt;tt&gt;&apos;debug=sec&apos;:&lt;/tt&gt;&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;MDS logs when no error&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# MDS logs when no error is seen.
# First mjr208 user in another shell cd to /rds-d7/project/mjr208 and runs &apos;ls -ltr&apos;
# Then reproducer program is run
[root@rds-mds9 ~]# cat mds.rds-d7.eacess.llog.daemon.20201109-135320
00000001:02000400:2.0F:1604929978.610749:0:183864:0:(debug.c:510:libcfs_debug_mark_buffer()) DEBUG MARKER: STARTING TEST
00000004:08000000:15.0F:1604929984.842158:0:210588:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 0
00000004:08000000:15.0:1604929984.842963:0:210588:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 0
00000004:08000000:15.0F:1604929990.902950:0:220614:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 10573
00000004:08000000:15.0:1604929990.902981:0:220614:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 81a4, fsuid 10573, uid 10573
Debug log: 5 lines, 5 kept, 0 dropped, 0 bad.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;MDS logs when error is seen immediately after&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# Re-running the reproducer immediately after the above
[root@rds-mds9 ~]# cat mds.rds-d7.eacess.llog.daemon.20201109-140338
00000001:02000400:2.0F:1604930614.679069:0:194884:0:(debug.c:510:libcfs_debug_mark_buffer()) DEBUG MARKER: STARTING TEST
00000004:08000000:23.0F:1604930616.962154:0:210578:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 99, uid 0
00000004:08000000:23.0:1604930616.962160:0:210578:0:(mdd_permission.c:315:__mdd_permission_internal()) permission denied, mode 45f8, fsuid 99, uid 0
Debug log: 3 lines, 3 kept, 0 dropped, 0 bad.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Sorry for the long reply again, I hope what I&apos;m doing here is clear.&lt;br/&gt;
Can you think of any explanation for this behaviour at all?&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                            <comment id="284694" author="mrb" created="Mon, 9 Nov 2020 14:24:53 +0000"  >&lt;p&gt;Just a further comment on the above. I just split the MDS logs when I get no error into two files:&lt;br/&gt;
1) what is produced when I &apos;cd&apos; to the directory as my user, and run &apos;ls -ltr&apos;&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@rds-mds9 ~]# cat mds.rds-d7.eacess.llog.daemon.20201109-141610
00000001:02000400:2.0F:1604931356.286982:0:206961:0:(debug.c:510:libcfs_debug_mark_buffer()) DEBUG MARKER: STARTING TEST
00000004:08000000:49.0F:1604931366.341767:0:210612:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 0
00000004:08000000:49.0:1604931366.342936:0:210612:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 0
Debug log: 3 lines, 3 kept, 0 dropped, 0 bad.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;2) when I run the reproducer program&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@rds-mds9 ~]# cat mds.rds-d7.eacess.llog.daemon.20201109-141627
00000001:02000400:2.0F:1604931370.254125:0:207029:0:(debug.c:510:libcfs_debug_mark_buffer()) DEBUG MARKER: STARTING TEST
Debug log: 1 lines, 1 kept, 0 dropped, 0 bad.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So the reproducer isn&apos;t actually triggering that code path it seems? Is it possible that there is a cached lookup of some access permissions on the directory in question?&lt;/p&gt;</comment>
                            <comment id="284698" author="mrb" created="Mon, 9 Nov 2020 14:55:55 +0000"  >&lt;p&gt;I&apos;ve re-run the above with the additional CDEBUG line you suggested added to the code. There are three log captures:&lt;/p&gt;

&lt;p&gt; 1) From shell as user mjr208 on the same client, running:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[mjr208@cpu-p-231]:mjr208/ $ ls -ltr
total 12K
drwxrws--- 2 mjr208 mjr208                   4.0K Nov  4 19:05 testdir
drwxrws--- 2 mjr208 rds-kwtZ8ccHIQg-managers 4.0K Nov  4 19:09 testdir_acls
drwxrws--- 2 mjr208 rds-kwtZ8ccHIQg-managers 4.0K Nov  9 13:44 testdir_groups
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;MDS logs&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@rds-mds9 ~]# cat mds.rds-d7.eacess.llog.daemon.20201109-144336
00000001:02000400:3.0F:1604933000.317642:0:5706:0:(debug.c:510:libcfs_debug_mark_buffer()) DEBUG MARKER: STARTING TEST
00000004:08000000:1.0F:1604933011.669748:0:286340:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 0
00000004:08000000:1.0:1604933011.670507:0:286340:0:(mdd_dir.c:94:__mdd_lookup()) calling mdd_permission_internal_locked for testdir_groups
00000004:08000000:1.0:1604933011.670508:0:286340:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 0
Debug log: 4 lines, 4 kept, 0 dropped, 0 bad.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;2) From shell as user root on the same client, running the reproducer - &lt;b&gt;no error&lt;/b&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@cpu-p-231 ~]# ./rds-d7-eacces mjr208 /rds-d7/project/mjr208/testdir_groups/slurm.out.7
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 0
[INFO] Username of ruid: root
[INFO] Number of groups: &apos;2&apos;
     0 - root,    988 - sfcb, 
[INFO] Looking up user: mjr208
[INFO] &apos;mjr208&apos; uid: 10573, gid: 10573
[INFO] Setting egid to: 10573
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] Number of groups for user: &apos;51&apos;
 ... 
[INFO] Setting groups of calling process to groups of mjr208
[INFO] Number of groups for calling process: &apos;51&apos;
[INFO] Printing process groups
 ...
[INFO] set euid to : 10573
[INFO] Real UID: 0 | Effective UID: 10573
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] open rds file: /rds-d7/project/mjr208/testdir_groups/slurm.out.7
[INFO] rds directory fd: 7
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;MDS logs&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@rds-mds9 ~]# cat mds.rds-d7.eacess.llog.daemon.20201109-144344
00000001:02000400:3.0F:1604933016.182371:0:5860:0:(debug.c:510:libcfs_debug_mark_buffer()) DEBUG MARKER: STARTING TEST
00000004:08000000:1.0F:1604933021.006260:0:285135:0:(mdd_dir.c:94:__mdd_lookup()) calling mdd_permission_internal_locked for slurm.out.7
00000004:08000000:1.0:1604933021.006262:0:285135:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 10573
00000004:08000000:35.0F:1604933021.006391:0:285135:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 81a4, fsuid 10573, uid 10573
Debug log: 4 lines, 4 kept, 0 dropped, 0 bad.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;3) From shell as user root on the same client, re-running the reproducer - &lt;b&gt;error&lt;/b&gt;:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@cpu-p-231 ~]# ./rds-d7-eacces mjr208 /rds-d7/project/mjr208/testdir_groups/slurm.out.7
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 0
[INFO] Username of ruid: root
[INFO] Number of groups: &apos;2&apos;
     0 - root,    988 - sfcb, 
[INFO] Looking up user: mjr208
[INFO] &apos;mjr208&apos; uid: 10573, gid: 10573
[INFO] Setting egid to: 10573
[INFO] Real UID: 0 | Effective UID: 0
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] Number of groups for user: &apos;51&apos;
 ...
[INFO] Setting groups of calling process to groups of mjr208
[INFO] Number of groups for calling process: &apos;51&apos;
[INFO] Printing process groups
 ...
[INFO] set euid to : 10573
[INFO] Real UID: 0 | Effective UID: 10573
[INFO] Real GID: 0 | Effective GID: 10573
[INFO] open rds file: /rds-d7/project/mjr208/testdir_groups/slurm.out.7
open error : Permission denied
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedHeader panelHeader&quot; style=&quot;border-bottom-width: 1px;&quot;&gt;&lt;b&gt;MDS logs&lt;/b&gt;&lt;/div&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[root@rds-mds9 ~]# cat mds.rds-d7.eacess.llog.daemon.20201109-144350
00000001:02000400:3.0F:1604933024.533435:0:5905:0:(debug.c:510:libcfs_debug_mark_buffer()) DEBUG MARKER: STARTING TEST
00000004:08000000:35.0F:1604933026.143219:0:285136:0:(mdd_dir.c:94:__mdd_lookup()) calling mdd_permission_internal_locked for testdir_groups
00000004:08000000:35.0:1604933026.143221:0:285136:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 99, uid 0
00000004:08000000:35.0:1604933026.143225:0:285136:0:(mdd_permission.c:315:__mdd_permission_internal()) permission denied, mode 45f8, fsuid 99, uid 0
Debug log: 4 lines, 4 kept, 0 dropped, 0 bad.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="284699" author="sebastien" created="Mon, 9 Nov 2020 14:58:56 +0000"  >&lt;p&gt;Hi Matt,&lt;/p&gt;

&lt;p&gt;Thanks for all this additional information, and for re-running tests on your side with the debug traces.&lt;/p&gt;

&lt;p&gt;I created a tree hierarchy similar to what you have, with same permissions and bits, as I was initially missing an intermediate directory in the path.&lt;/p&gt;

&lt;p&gt;&quot;Good news&quot; is that I can now reproduce your issue! The first iteration works fine, but all subsequent ones hit the problem, no matter we specify the same file or a different one to the reproducer, as long as it is in the same directory.&lt;br/&gt;
I also confirm that doing a simple &lt;tt&gt;ls&lt;/tt&gt; in the parent directory of &lt;tt&gt;testdir_groups&lt;/tt&gt; &quot;resets&quot; the behaviour for just one run of the reproducer, and then it is back to failure.&lt;/p&gt;

&lt;p&gt;It definitely points to something that would get cached somewhere... but I have no clue yet. I need to investigate further now that I can reproduce.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="284700" author="mrb" created="Mon, 9 Nov 2020 15:08:05 +0000"  >&lt;p&gt;Ok that&apos;s good news that it&apos;s reproducible - I was starting to get worried that we had something weird going on at our end!&lt;/p&gt;

&lt;p&gt;Indeed I was just checking the intermediate directory just now, and noticed that creating the output file at that level didn&apos;t hit the problem at all, only in the level below.&#160;&lt;/p&gt;

&lt;p&gt;Anyway, thanks for your time looking at this,&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                            <comment id="284714" author="sebastien" created="Mon, 9 Nov 2020 17:05:33 +0000"  >&lt;p&gt;In my case, I cannot manage to have the reproducer passing, even on the first iteration, unless I do &lt;tt&gt;ls testdir_groups/..&lt;/tt&gt; beforehand.&lt;/p&gt;

&lt;p&gt;Maybe you could try to use this workaround in Slurm before I am able to provide you with a fix?&lt;/p&gt;</comment>
                            <comment id="284722" author="mrb" created="Mon, 9 Nov 2020 17:49:43 +0000"  >&lt;p&gt;Sure, for the moment we are just asking users to redirect their slurm output files to their home directories as a workaround which is good enough for now. We may also temporarily set the &apos;admin&apos; property to &apos;1&apos; if it becomes a big hassle.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;/p&gt;

&lt;p&gt;Matt&lt;/p&gt;</comment>
                            <comment id="284834" author="sebastien" created="Tue, 10 Nov 2020 15:56:07 +0000"  >&lt;p&gt;Hi,&lt;/p&gt;

&lt;p&gt;After more investigation, it turns out the problem stems from the way the nodemap feature interprets the squash concept. In the current implementation, if the &lt;tt&gt;real uid&lt;/tt&gt; is squashed, then the &lt;tt&gt;fsuid&lt;/tt&gt; is similarly squashed, no matter what is the value of the &lt;tt&gt;effective uid&lt;/tt&gt;. Maybe this squashing is a little bit too strict, and should take care of the &lt;tt&gt;effective uid&lt;/tt&gt; when making a decision. But the behavior needs to be changed with caution...&lt;/p&gt;

&lt;p&gt;In your case, the &lt;tt&gt;fsuid&lt;/tt&gt; is changed from &lt;tt&gt;10573&lt;/tt&gt; to &lt;tt&gt;99&lt;/tt&gt; on MDS side, which prevents access to the &lt;tt&gt;testdir_groups&lt;/tt&gt; directory. When you &lt;tt&gt;ls&lt;/tt&gt; as user &lt;tt&gt;mjr208&lt;/tt&gt;, it does not involve any &lt;tt&gt;effective uid&lt;/tt&gt;, and this user is not squashed, so it &quot;loads&quot; into the client&apos;s cache the information regarding the &lt;tt&gt;testdir_groups&lt;/tt&gt; directory. This is why the subsequent access by the reproducer succeeds.&lt;/p&gt;

&lt;p&gt;I will try to see if it is possible to change the way squashing in nodemap is currently implemented. Meanwhile, what you can do is to set the admin property to 1 on your nodemap, and use the legacy &lt;tt&gt;root_squash&lt;/tt&gt; mechanism to prevent &lt;tt&gt;root&lt;/tt&gt; access. I checked on my local test system, and it works as expected: the reproducer runs fine, and &lt;tt&gt;root&lt;/tt&gt; cannot access non-world readable/writable files and directories.&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Sebastien.&lt;/p&gt;</comment>
                            <comment id="284857" author="mrb" created="Tue, 10 Nov 2020 18:11:10 +0000"  >&lt;p&gt;Thanks for the update Sebastien.&lt;/p&gt;

&lt;p&gt;I will do as you suggest using the traditional root_squash mechanism, however it would be good if it&apos;s possible to change this to use the effective uid in future - we are moving to the &apos;multi-tenant&apos; model with Lustre using nodemaps for all sub-clusters, and they are very convenient administrative grouping, so would be nice to be able to use these for root-squash enablement in future.&lt;/p&gt;

&lt;p&gt;Thanks again,&lt;br/&gt;
Matt&lt;/p&gt;</comment>
                            <comment id="285128" author="gerrit" created="Fri, 13 Nov 2020 11:13:10 +0000"  >&lt;p&gt;Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/40645&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40645&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14121&quot; title=&quot;EACESS Permission denied for slurm log files when using nodemap with admin=0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14121&quot;&gt;&lt;del&gt;LU-14121&lt;/del&gt;&lt;/a&gt; nodemap: make squashing rely on fsuid&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 7f59a4d9bb2b2a8c4f3381dc228a08343fa757bc&lt;/p&gt;</comment>
                            <comment id="287422" author="gerrit" created="Sun, 13 Dec 2020 08:23:24 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/40645/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40645/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14121&quot; title=&quot;EACESS Permission denied for slurm log files when using nodemap with admin=0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14121&quot;&gt;&lt;del&gt;LU-14121&lt;/del&gt;&lt;/a&gt; nodemap: do not force fsuid/fsgid squashing&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: master&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 355787745f21b22bb36210bb1c8e41fb34e7b665&lt;/p&gt;</comment>
                            <comment id="287435" author="pjones" created="Sun, 13 Dec 2020 16:00:49 +0000"  >&lt;p&gt;Landed for 2.14&lt;/p&gt;</comment>
                            <comment id="287484" author="gerrit" created="Mon, 14 Dec 2020 15:53:01 +0000"  >&lt;p&gt;Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/40961&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40961&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14121&quot; title=&quot;EACESS Permission denied for slurm log files when using nodemap with admin=0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14121&quot;&gt;&lt;del&gt;LU-14121&lt;/del&gt;&lt;/a&gt; nodemap: do not force fsuid/fsgid squashing&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 7f09b0889752af9de2f49e41ccc449d9bb673bab&lt;/p&gt;</comment>
                            <comment id="290206" author="gerrit" created="Sat, 23 Jan 2021 08:18:21 +0000"  >&lt;p&gt;Oleg Drokin (green@whamcloud.com) merged in patch &lt;a href=&quot;https://review.whamcloud.com/40961/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/40961/&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14121&quot; title=&quot;EACESS Permission denied for slurm log files when using nodemap with admin=0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14121&quot;&gt;&lt;del&gt;LU-14121&lt;/del&gt;&lt;/a&gt; nodemap: do not force fsuid/fsgid squashing&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_12&lt;br/&gt;
Current Patch Set: &lt;br/&gt;
Commit: 02b04c64a87aefb4f6e953d88f3cde1861b6844d&lt;/p&gt;</comment>
                            <comment id="306424" author="gerrit" created="Wed, 7 Jul 2021 07:42:01 +0000"  >&lt;p&gt;Diego Moreno (morenod@ethz.ch) uploaded a new patch: &lt;a href=&quot;https://review.whamcloud.com/44164&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/44164&lt;/a&gt;&lt;br/&gt;
Subject: &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-14121&quot; title=&quot;EACESS Permission denied for slurm log files when using nodemap with admin=0&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-14121&quot;&gt;&lt;del&gt;LU-14121&lt;/del&gt;&lt;/a&gt; nodemap: do not force fsuid/fsgid squashing&lt;br/&gt;
Project: fs/lustre-release&lt;br/&gt;
Branch: b2_10&lt;br/&gt;
Current Patch Set: 1&lt;br/&gt;
Commit: 425c44e181cf1f21ff4cb32b661c925ca9654db1&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                                        </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="62324">LU-14327</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="36568" name="rds-d7-eacces.c" size="4044" author="mrb" created="Thu, 5 Nov 2020 21:37:05 +0000"/>
                            <attachment id="36574" name="reproducer.FAIL.client.llog" size="105755" author="mrb" created="Thu, 5 Nov 2020 21:36:51 +0000"/>
                            <attachment id="36573" name="reproducer.FAIL.client.strace" size="142137" author="mrb" created="Thu, 5 Nov 2020 21:36:51 +0000"/>
                            <attachment id="36572" name="reproducer.FAIL.mds.llog" size="14314375" author="mrb" created="Thu, 5 Nov 2020 21:37:41 +0000"/>
                            <attachment id="36570" name="reproducer.no-root-squash.SUCCESS.client.llog" size="1629604" author="mrb" created="Thu, 5 Nov 2020 21:37:05 +0000"/>
                            <attachment id="36571" name="reproducer.no-root-squash.SUCCESS.mds.llog" size="21942938" author="mrb" created="Thu, 5 Nov 2020 21:37:44 +0000"/>
                            <attachment id="36576" name="slurm.FAIL.mds.llog" size="19008485" author="mrb" created="Thu, 5 Nov 2020 21:37:36 +0000"/>
                            <attachment id="36569" name="slurm.FAIL.strace" size="18719479" author="mrb" created="Thu, 5 Nov 2020 21:37:25 +0000"/>
                            <attachment id="36575" name="slurm.no-root-squash.SUCCESS.mds.llog" size="39517179" author="mrb" created="Thu, 5 Nov 2020 21:38:24 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i01ejb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10021"><![CDATA[2]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>