Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14121

EACESS Permission denied for slurm log files when using nodemap with admin=0

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0, Lustre 2.12.7
    • Lustre 2.12.5
    • 2
    • 9223372036854775807

    Description

      Hello,
      We have recently deployed new filesystems running 2.12.5 and our users are hitting a problem with Slurm jobs which write the slurm log files onto the new filesystems.

      A bit of context on our environment:

      Our main HPC cluster is mounting a subdirectory mount of the wider filesystem via a nodemap. The nodemap settings are as follows:

      [root@rds-mds9 ~]# lctl get_param nodemap.csd3.*
      nodemap.csd3.admin_nodemap=0
      nodemap.csd3.audit_mode=1
      nodemap.csd3.deny_unknown=0
      nodemap.csd3.exports=
      [
      ...
      ]
      nodemap.csd3.fileset=/tenant/csd3
      nodemap.csd3.id=2
      nodemap.csd3.idmap=[
      
      ]
      nodemap.csd3.map_mode=both
      nodemap.csd3.ranges=
      [
      ...
      ]
      nodemap.csd3.sepol=
      
      nodemap.csd3.squash_gid=99
      nodemap.csd3.squash_uid=99
      nodemap.csd3.trusted_nodemap=1
      

      Of special note, we are setting the 'trusted' property to '1' as we do not do any id-mapping on the main cluster, which are all bound into our central LDAP domain. We are also setting 'admin' to 0, which I understand to be equivalent to 'root_squash' across all nodes in the nodemap.

      Consequently we aren't setting the usual settings to configure root_squash, relying on nodemaps for this feature entirely:

      [root@rds-mds9 ~]# lctl get_param mdt.*.root_squash
      mdt.rds-d7-MDT0000.root_squash=0:0
      mdt.rds-d7-MDT0002.root_squash=0:0
      [root@rds-mds9 ~]# lctl get_param mdt.*.nosquash_nids
      mdt.rds-d7-MDT0000.nosquash_nids=NONE
      mdt.rds-d7-MDT0002.nosquash_nids=NONE
      

      The problem we are seeing is that when a user launches a slurm job that will write it's slurm output log files into a directory on this filesystem, then it will sometimes fail to create this log file and abort the job.

      I have included the full strace capture from the slurmd daemon when it hits this error, along with a simple slurm job file showing the job definition:

      • slurm.FAIL.strace
      example job script
      #!/bin/bash
      #SBATCH -A SUPPORT-CPU
      #SBATCH --output=/rds-d7/project/mjr208/testdir_groups/slurm.out.5
      #SBATCH --nodes=1
      #SBATCH --ntasks=1
      #SBATCH --time=1:00
      
      echo "STARTING JOB"
      pwd
      exit 0
      

      The key portion of the strace is here:

      Unable to find source-code formatter for language: strace slurmd. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      $ grep 72268 slurm.FAIL.strace | grep -v -E '(futex|tgkill|TKILL)'
      ...
      72268 20:44:49.224728 geteuid()         = 0 <0.000005>                                                                                                                                                              
      72268 20:44:49.224747 umask(002)        = 022 <0.000005>                                                                                                                                                            
      72268 20:44:49.224764 getuid()          = 0 <0.000004>                             
      72268 20:44:49.224780 getgid()          = 0 <0.000005>                                                                                                                                                              
      72268 20:44:49.224796 getcwd("/var/spool/slurm/slurmd", 4096) = 24 <0.000005>
      72268 20:44:49.224814 getgroups(0, NULL) = 51 <0.000005>                                                                                                                                                            
      72268 20:44:49.224831 getgroups(51, [815, 901, 902, 904, 905, 910, 1099, 1100, 1101, 8053, 8054, 10573, 14500, 14998, 17501, 41000, 41029, 41037, 41357, 42000, 42006, 42042, 42080, 42100, 42112, 42113, 43000, 430
      01, 43003, 43005, 43007, 43009, ...]) = 51 <0.000005>                                                                                                                                                               
      72268 20:44:49.224870 getuid()          = 0 <0.000005>                           
      72268 20:44:49.225163 setresgid(-1, 10573, -1 <unfinished ...>                                                                                                                                                      
      72268 20:44:49.225175 <... setresgid resumed>) = 0 <0.000008>                                                                                                                                                       
      72268 20:44:49.225574 setgroups(51, [10573, 50045, 80001, 80004, 80007, 80016, 80019, 43003, 43001, 43005, 43009, 43007, 41357, 80017, 43010, 42042, 42006, 41029, 42080, 41037, 17501, 901, 8054, 905, 14998, 904, 
      14500, 90022, 815, 902, 1100, 8053, ...] <unfinished ...>                                                                                                                                                           
      72268 20:44:49.225608 <... setgroups resumed>) = 0 <0.000008>                                                                                                                                                       
      72268 20:44:49.225895 setresuid(-1, 10573, -1 <unfinished ...>                                                                                                                                                      
      72268 20:44:49.225907 <... setresuid resumed>) = 0 <0.000008>                                      
      72268 20:44:49.225919 open("/dev/null", O_RDONLY) = 10 <0.000006>
      72268 20:44:49.225938 fcntl(10, F_SETFD, FD_CLOEXEC) = 0 <0.000005>                                                                                                                                                 
      72268 20:44:49.225955 open("/rds-d7/project/mjr208/testdir_groups/slurm.out.5", O_WRONLY|O_CREAT|O_TRUNC|O_APPEND, 0666) = -1 EACCES (Permission denied) <0.000785>
      

      So this process is the slurmstepd process running as root forked by slurmd when starting the job, and it appears to be setting it's effective UID and GID to that of the job's user, setting it's groups to that of the user, and then attempting to open the slurm output file at which point it fails.

      The permissions on this directory are clearly writeable by the user:

      [mjr208@cpu-p-129]:~/ $ ls -ld /rds-d7/project/mjr208/testdir_groups
      drwxrws--- 2 mjr208 rds-kwtZ8ccHIQg-managers 4.0K Nov  5 20:30 /rds-d7/project/mjr208/testdir_groups
      

      I have also captured debug traces from the MDS when this happens, which are attached in the file:

      • slurm.FAIL.mds.llog

      I have noticed in this file the following line which I think might be significant:

      00000004:08000000:35.0:1604606961.472081:0:150250:0:(mdd_permission.c:315:__mdd_permission_internal()) permission denied, mode 45f8, fsuid 99, uid 0
      

      This made me suspect whether the root_squash setting of the nodemap could be to blame here, and sure enough, if I set the 'admin' property to 1 and re-run the slurm job, it runs successfully without error.

      To test this I added an additional debugging print line to the __mdd_permission_internal function to just print the value of fsiud and uid before it does the permission check:

      --- a/lustre/mdd/mdd_permission.c
      +++ b/lustre/mdd/mdd_permission.c
      @@ -276,6 +276,8 @@ int __mdd_permission_internal(const struct lu_env *env, struct mdd_object *obj,
        LASSERT(la != NULL);
       
        mode = la->la_mode;
      + CDEBUG(D_SEC, "Checking access: mode %x, fsuid %u, uid %u\n",
      +        la->la_mode, uc->uc_fsuid, la->la_uid);
        if (uc->uc_fsuid == la->la_uid) {
          mode >>= 6;
               } else {
      

      and I captured another debug trace from the MDS as before, which is attached in:

      • slurm.no-root-squash.SUCCESS.mds.llog

      The debug line I added above shows:

      00000004:08000000:33.0:1604607230.414011:0:150250:0:(mdd_permission.c:280:__mdd_permission_internal()) Checking access: mode 45f8, fsuid 10573, uid 0
      

      So it appears that the 'admin' property on this nodemap is having an effect here that is causing problems for slurm. Are we using this wrong, or is this possibly a bug?

      To help isolate this further, I've made a little reproducer C program that tries to copy the sequence of system calls that the slurmstepd daemon is doing before attempting to open the file. Attached are also strace, and debug traces from both the client and MDS for this program, both with and without the nodemap 'admin' property set:

      • rds-d7-eacces.c
      • reproducer.FAIL.client.llog
      • reproducer.FAIL.client.strace
      • reproducer.FAIL.mds.llog
      • reproducer.no-root-squash.SUCCESS.client.llog
      • reproducer.no-root-squash.SUCCESS.mds.llog

      At the moment, we are having to ask our users to re-direct their slurm output files to another filesystem, as we don't see this behaviour on our (much) older filesystems. We may also just enable the 'admin' property for now, however I would really like to get to the bottom of why this is failing.

      Cheers,
      Matt

      Attachments

        1. rds-d7-eacces.c
          4 kB
        2. reproducer.FAIL.client.llog
          103 kB
        3. reproducer.FAIL.client.strace
          139 kB
        4. reproducer.FAIL.mds.llog
          13.65 MB
        5. reproducer.no-root-squash.SUCCESS.client.llog
          1.55 MB
        6. reproducer.no-root-squash.SUCCESS.mds.llog
          20.93 MB
        7. slurm.FAIL.mds.llog
          18.13 MB
        8. slurm.FAIL.strace
          17.85 MB
        9. slurm.no-root-squash.SUCCESS.mds.llog
          37.69 MB

        Issue Links

          Activity

            [LU-14121] EACESS Permission denied for slurm log files when using nodemap with admin=0

            Diego Moreno (morenod@ethz.ch) uploaded a new patch: https://review.whamcloud.com/44164
            Subject: LU-14121 nodemap: do not force fsuid/fsgid squashing
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 425c44e181cf1f21ff4cb32b661c925ca9654db1

            gerrit Gerrit Updater added a comment - Diego Moreno (morenod@ethz.ch) uploaded a new patch: https://review.whamcloud.com/44164 Subject: LU-14121 nodemap: do not force fsuid/fsgid squashing Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 425c44e181cf1f21ff4cb32b661c925ca9654db1

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40961/
            Subject: LU-14121 nodemap: do not force fsuid/fsgid squashing
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 02b04c64a87aefb4f6e953d88f3cde1861b6844d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40961/ Subject: LU-14121 nodemap: do not force fsuid/fsgid squashing Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 02b04c64a87aefb4f6e953d88f3cde1861b6844d

            Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40961
            Subject: LU-14121 nodemap: do not force fsuid/fsgid squashing
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 7f09b0889752af9de2f49e41ccc449d9bb673bab

            gerrit Gerrit Updater added a comment - Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40961 Subject: LU-14121 nodemap: do not force fsuid/fsgid squashing Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 7f09b0889752af9de2f49e41ccc449d9bb673bab
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40645/
            Subject: LU-14121 nodemap: do not force fsuid/fsgid squashing
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 355787745f21b22bb36210bb1c8e41fb34e7b665

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40645/ Subject: LU-14121 nodemap: do not force fsuid/fsgid squashing Project: fs/lustre-release Branch: master Current Patch Set: Commit: 355787745f21b22bb36210bb1c8e41fb34e7b665

            Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40645
            Subject: LU-14121 nodemap: make squashing rely on fsuid
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7f59a4d9bb2b2a8c4f3381dc228a08343fa757bc

            gerrit Gerrit Updater added a comment - Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/40645 Subject: LU-14121 nodemap: make squashing rely on fsuid Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7f59a4d9bb2b2a8c4f3381dc228a08343fa757bc

            Thanks for the update Sebastien.

            I will do as you suggest using the traditional root_squash mechanism, however it would be good if it's possible to change this to use the effective uid in future - we are moving to the 'multi-tenant' model with Lustre using nodemaps for all sub-clusters, and they are very convenient administrative grouping, so would be nice to be able to use these for root-squash enablement in future.

            Thanks again,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Thanks for the update Sebastien. I will do as you suggest using the traditional root_squash mechanism, however it would be good if it's possible to change this to use the effective uid in future - we are moving to the 'multi-tenant' model with Lustre using nodemaps for all sub-clusters, and they are very convenient administrative grouping, so would be nice to be able to use these for root-squash enablement in future. Thanks again, Matt

            Hi,

            After more investigation, it turns out the problem stems from the way the nodemap feature interprets the squash concept. In the current implementation, if the real uid is squashed, then the fsuid is similarly squashed, no matter what is the value of the effective uid. Maybe this squashing is a little bit too strict, and should take care of the effective uid when making a decision. But the behavior needs to be changed with caution...

            In your case, the fsuid is changed from 10573 to 99 on MDS side, which prevents access to the testdir_groups directory. When you ls as user mjr208, it does not involve any effective uid, and this user is not squashed, so it "loads" into the client's cache the information regarding the testdir_groups directory. This is why the subsequent access by the reproducer succeeds.

            I will try to see if it is possible to change the way squashing in nodemap is currently implemented. Meanwhile, what you can do is to set the admin property to 1 on your nodemap, and use the legacy root_squash mechanism to prevent root access. I checked on my local test system, and it works as expected: the reproducer runs fine, and root cannot access non-world readable/writable files and directories.

            Cheers,
            Sebastien.

            sebastien Sebastien Buisson added a comment - Hi, After more investigation, it turns out the problem stems from the way the nodemap feature interprets the squash concept. In the current implementation, if the real uid is squashed, then the fsuid is similarly squashed, no matter what is the value of the effective uid . Maybe this squashing is a little bit too strict, and should take care of the effective uid when making a decision. But the behavior needs to be changed with caution... In your case, the fsuid is changed from 10573 to 99 on MDS side, which prevents access to the testdir_groups directory. When you ls as user mjr208 , it does not involve any effective uid , and this user is not squashed, so it "loads" into the client's cache the information regarding the testdir_groups directory. This is why the subsequent access by the reproducer succeeds. I will try to see if it is possible to change the way squashing in nodemap is currently implemented. Meanwhile, what you can do is to set the admin property to 1 on your nodemap, and use the legacy root_squash mechanism to prevent root access. I checked on my local test system, and it works as expected: the reproducer runs fine, and root cannot access non-world readable/writable files and directories. Cheers, Sebastien.

            Sure, for the moment we are just asking users to redirect their slurm output files to their home directories as a workaround which is good enough for now. We may also temporarily set the 'admin' property to '1' if it becomes a big hassle.

            Cheers,

            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Sure, for the moment we are just asking users to redirect their slurm output files to their home directories as a workaround which is good enough for now. We may also temporarily set the 'admin' property to '1' if it becomes a big hassle. Cheers, Matt

            In my case, I cannot manage to have the reproducer passing, even on the first iteration, unless I do ls testdir_groups/.. beforehand.

            Maybe you could try to use this workaround in Slurm before I am able to provide you with a fix?

            sebastien Sebastien Buisson added a comment - In my case, I cannot manage to have the reproducer passing, even on the first iteration, unless I do ls testdir_groups/.. beforehand. Maybe you could try to use this workaround in Slurm before I am able to provide you with a fix?

            People

              sebastien Sebastien Buisson
              mrb Matt Rásó-Barnett (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: