Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18200

ll_setattr_raw does not account for noninitial linux namespaces

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      xref https://jira.whamcloud.com/browse/LUDOC-533 (Current intended level of support for Linux (user) namespaces)

      xref https://github.com/moby/moby/issues/48413 ([Lustre filesystem] dockerd rootless: Order of chtimes and chmod forces requirement on CAP_FOWNER)

      See the moby issue for the user-facing trigger of this bug, where rootless docker fails to set file time attributes. A short strace summary is the following, where this happens as root in a user namespace, with uids 0-65535 mapped:

      [pid 496980] fchownat(AT_FDCWD, "/home/ubuntu", 1000, 1000, AT_SYMLINK_NOFOLLOW ) = 0
      [pid 496980] utimensat(AT_FDCWD, "/home/ubuntu", [{tv_sec=1722513791, tv_nsec=0} /* 2024-08-01T14:03:11+0200 */, {tv_sec=1722513791, tv_nsec=0} /* 2024-08-01T14:03:11+0200 */], 0) = -1 EPERM (Operation not per... 

      A full docker-independent reproducer script will be attached.

      The kernel side permissions stuff was figured out by a friend, I'm documenting this here and filing the issue.

      To figure out what utimensat is calling, we look for ATTR_ATIME_SET and similar, and we can find matches against llite in the code, which according to https://wiki.lustre.org/Understanding_Lustre_Internals, seems to be the HAL for linux <-> lustre.

      Here, we see ll_setattr_raw checks current_fsuid) against the inodes uid, and if those don't match, it checks for CAP_FOWNER:

              /* POSIX: check before ATTR_*TIME_SET set (from inode_change_ok) */
              if (attr->ia_valid & TIMES_SET_FLAGS) {  
                      if ((!uid_eq(current_fsuid(), inode->i_uid)) &&
                          !capable(CAP_FOWNER))
                              GOTO(clear, rc = -EPERM);
              } 

      (per https://www.man7.org/linux/man-pages/man7/capabilities.7.html

             CAP_FOWNER
                    •  Bypass permission checks on operations that normally
                       require the filesystem UID of the process to match the
                       UID of the file (e.g., chmod(2), utime(2)), excluding
                       those operations covered by CAP_DAC_OVERRIDE and
                       CAP_DAC_READ_SEARCH;
                    •  set inode flags (see FS_IOC_SETFLAGS(2const)) on
                       arbitrary files;
                    •  set Access Control Lists (ACLs) on arbitrary files;
                    •  ignore directory sticky bit on file deletion;
                    •  modify user extended attributes on sticky directory
                       owned by any user;
                    •  specify O_NOATIME for arbitrary files in open(2) and
                       fcntl(2). 

      )

       

      In the kernel source, it can be seen that capable() only checks against the initial namespace:

      bool capable(int cap)
      {
              return ns_capable(&init_user_ns, cap);
      } 

      The solution is probably to use the inode_owner_or_capable function, which is extended to handle namespaces:

      /**
      * inode_owner_or_capable - check current task permissions to inode
      * @mnt_userns: user namespace of the mount the inode was found from
      * @inode: inode being checked
      *
      * Return true if current either has CAP_FOWNER in a namespace with the
      * inode owner uid mapped, or owns the file.
      *
      * If the inode has been found through an idmapped mount the user namespace of
      * the vfsmount must be passed through @mnt_userns. This function will then take
      * care to map the inode according to @mnt_userns before checking permissions.
      * On non-idmapped mounts or if permission checking is to be performed on the
      * raw inode simply passs init_user_ns.
      */
      bool inode_owner_or_capable(struct user_namespace *mnt_userns,
                                  const struct inode *inode)
      {
              kuid_t i_uid;
              struct user_namespace *ns;        i_uid = i_uid_into_mnt(mnt_userns, inode);
              if (uid_eq(current_fsuid(), i_uid))
                      return true;        ns = current_user_ns();
              if (kuid_has_mapping(ns, i_uid) && ns_capable(ns, CAP_FOWNER))
                      return true;
              return false;
      } 

      A note for novices; root in user namespaces indeed gets a full set of capabilities:

      $ unshare -U -r -m capsh --print
      Current: =ep
      Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
      Ambient set =
      Current IAB:
      Securebits: 00/0x0/1'b0 (no-new-privs=0)
       secure-noroot: no (unlocked)
       secure-no-suid-fixup: no (unlocked)
       secure-keep-caps: no (unlocked)
       secure-no-ambient-raise: no (unlocked)
      uid=0(root) euid=0(root)
      gid=0(root)
      groups=65534(nobody),65534(nobody),0(root)
      Guessed mode: UNCERTAIN (0)
      

      Given all this, it's likely there are other parts of the code that have similar problems?

      Attachments

        Issue Links

          Activity

            People

              core-lustre-triage Core Lustre Triage
              lstrbg Dan Trent
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: