Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18200

ll_setattr_raw does not account for noninitial linux namespaces

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      xref https://jira.whamcloud.com/browse/LUDOC-533 (Current intended level of support for Linux (user) namespaces)

      xref https://github.com/moby/moby/issues/48413 ([Lustre filesystem] dockerd rootless: Order of chtimes and chmod forces requirement on CAP_FOWNER)

      See the moby issue for the user-facing trigger of this bug, where rootless docker fails to set file time attributes. A short strace summary is the following, where this happens as root in a user namespace, with uids 0-65535 mapped:

      [pid 496980] fchownat(AT_FDCWD, "/home/ubuntu", 1000, 1000, AT_SYMLINK_NOFOLLOW ) = 0
      [pid 496980] utimensat(AT_FDCWD, "/home/ubuntu", [{tv_sec=1722513791, tv_nsec=0} /* 2024-08-01T14:03:11+0200 */, {tv_sec=1722513791, tv_nsec=0} /* 2024-08-01T14:03:11+0200 */], 0) = -1 EPERM (Operation not per... 

      A full docker-independent reproducer script will be attached.

      The kernel side permissions stuff was figured out by a friend, I'm documenting this here and filing the issue.

      To figure out what utimensat is calling, we look for ATTR_ATIME_SET and similar, and we can find matches against llite in the code, which according to https://wiki.lustre.org/Understanding_Lustre_Internals, seems to be the HAL for linux <-> lustre.

      Here, we see ll_setattr_raw checks current_fsuid) against the inodes uid, and if those don't match, it checks for CAP_FOWNER:

              /* POSIX: check before ATTR_*TIME_SET set (from inode_change_ok) */
              if (attr->ia_valid & TIMES_SET_FLAGS) {  
                      if ((!uid_eq(current_fsuid(), inode->i_uid)) &&
                          !capable(CAP_FOWNER))
                              GOTO(clear, rc = -EPERM);
              } 

      (per https://www.man7.org/linux/man-pages/man7/capabilities.7.html

             CAP_FOWNER
                    •  Bypass permission checks on operations that normally
                       require the filesystem UID of the process to match the
                       UID of the file (e.g., chmod(2), utime(2)), excluding
                       those operations covered by CAP_DAC_OVERRIDE and
                       CAP_DAC_READ_SEARCH;
                    •  set inode flags (see FS_IOC_SETFLAGS(2const)) on
                       arbitrary files;
                    •  set Access Control Lists (ACLs) on arbitrary files;
                    •  ignore directory sticky bit on file deletion;
                    •  modify user extended attributes on sticky directory
                       owned by any user;
                    •  specify O_NOATIME for arbitrary files in open(2) and
                       fcntl(2). 

      )

       

      In the kernel source, it can be seen that capable() only checks against the initial namespace:

      bool capable(int cap)
      {
              return ns_capable(&init_user_ns, cap);
      } 

      The solution is probably to use the inode_owner_or_capable function, which is extended to handle namespaces:

      /**
      * inode_owner_or_capable - check current task permissions to inode
      * @mnt_userns: user namespace of the mount the inode was found from
      * @inode: inode being checked
      *
      * Return true if current either has CAP_FOWNER in a namespace with the
      * inode owner uid mapped, or owns the file.
      *
      * If the inode has been found through an idmapped mount the user namespace of
      * the vfsmount must be passed through @mnt_userns. This function will then take
      * care to map the inode according to @mnt_userns before checking permissions.
      * On non-idmapped mounts or if permission checking is to be performed on the
      * raw inode simply passs init_user_ns.
      */
      bool inode_owner_or_capable(struct user_namespace *mnt_userns,
                                  const struct inode *inode)
      {
              kuid_t i_uid;
              struct user_namespace *ns;        i_uid = i_uid_into_mnt(mnt_userns, inode);
              if (uid_eq(current_fsuid(), i_uid))
                      return true;        ns = current_user_ns();
              if (kuid_has_mapping(ns, i_uid) && ns_capable(ns, CAP_FOWNER))
                      return true;
              return false;
      } 

      A note for novices; root in user namespaces indeed gets a full set of capabilities:

      $ unshare -U -r -m capsh --print
      Current: =ep
      Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
      Ambient set =
      Current IAB:
      Securebits: 00/0x0/1'b0 (no-new-privs=0)
       secure-noroot: no (unlocked)
       secure-no-suid-fixup: no (unlocked)
       secure-keep-caps: no (unlocked)
       secure-no-ambient-raise: no (unlocked)
      uid=0(root) euid=0(root)
      gid=0(root)
      groups=65534(nobody),65534(nobody),0(root)
      Guessed mode: UNCERTAIN (0)
      

      Given all this, it's likely there are other parts of the code that have similar problems?

      Attachments

        Issue Links

          Activity

            [LU-18200] ll_setattr_raw does not account for noninitial linux namespaces

            One thing to check is if the MDS is allowing CAP_FOWNER from the client, or if this is being rejected on the client?

            You can enable client capabilities on the MGS with a parameter:

            mgs# lctl set_param -P mdt.lfs-*.enable_cap_mask=+cap_fowner
            

            This was added in LU-13791, but if the second "named capabilities" patch is not in your branch, then you would need to set the capability mask by a hex value (CAP_FOWNER=0x8).

            adilger Andreas Dilger added a comment - One thing to check is if the MDS is allowing CAP_FOWNER from the client, or if this is being rejected on the client? You can enable client capabilities on the MGS with a parameter: mgs# lctl set_param -P mdt.lfs-*.enable_cap_mask=+cap_fowner This was added in LU-13791 , but if the second "named capabilities" patch is not in your branch, then you would need to set the capability mask by a hex value ( CAP_FOWNER=0x8 ).
            jikim Jinseok Kim added a comment -

            Hello team,

            Is it any updates?

            Similar issues have been reproduced in Samsung Electronics in Korea.
            If there are any updates, please share them.

            [yr9.choi@agpu1443 ~]$ docker pull repo.samsungds.net/docker.io/ubuntu:20.04
            20.04: Pulling from ubuntu
            d9802f032d67: Extracting [==================================================>]  27.51MB/27.51MB
            failed to register layer: chtimes /var/cache/apt/archives/partial: operation not permitted 
            jikim Jinseok Kim added a comment - Hello team, Is it any updates? Similar issues have been reproduced in Samsung Electronics in Korea. If there are any updates, please share them. [yr9.choi@agpu1443 ~]$ docker pull repo.samsungds.net/docker.io/ubuntu:20.04 20.04: Pulling from ubuntu d9802f032d67: Extracting [==================================================>]  27.51MB/27.51MB failed to register layer: chtimes / var /cache/apt/archives/partial: operation not permitted

            The build process is relatively simple after downloading the sources:

            # sh autogen.sh
            # ./configure --disable-server --disable-tests
            # make rpms
            # rpm -Fvh lustre-client-*.rpm
            
            adilger Andreas Dilger added a comment - The build process is relatively simple after downloading the sources: # sh autogen.sh # ./configure --disable-server --disable-tests # make rpms # rpm -Fvh lustre-client-*.rpm
            lstrbg Dan Trent added a comment -

            I haven't really figured out how to compile Lustre from source, nor how to set up a local test service. We only did a bit of digging in the source to see if we could diagnose the issue that way.

            lstrbg Dan Trent added a comment - I haven't really figured out how to compile Lustre from source, nor how to set up a local test service. We only did a bit of digging in the source to see if we could diagnose the issue that way.

            lstrbg I've moved this over to the LU project.

            The first question to ask is whether you've tried to replace the two uid_eq()+capable() lines in ll_setsttr_raw() with inode_owner_or_capable() yourself, and whether that fixed the issue for you? If that fixes the problem, then submitting a patch to the tree would be relatively straight forward.

            adilger Andreas Dilger added a comment - lstrbg I've moved this over to the LU project. The first question to ask is whether you've tried to replace the two uid_eq()+capable() lines in ll_setsttr_raw() with inode_owner_or_capable() yourself, and whether that fixed the issue for you? If that fixes the problem, then submitting a patch to the tree would be relatively straight forward.
            lstrbg Dan Trent added a comment -

            I've attached the reproducer script. It assumes $HOME is a Lustre directory. It can be rewritten to take some other directory.

            lstrbg Dan Trent added a comment - I've attached the reproducer script. It assumes $HOME is a Lustre directory. It can be rewritten to take some other directory.
            lstrbg Dan Trent added a comment -

            I filed this under the wrong component, it's not a DOC issue.

            lstrbg Dan Trent added a comment - I filed this under the wrong component, it's not a DOC issue.

            People

              core-lustre-triage Core Lustre Triage
              lstrbg Dan Trent
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: