Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18200

ll_setattr_raw does not account for noninitial linux namespaces

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      xref https://jira.whamcloud.com/browse/LUDOC-533 (Current intended level of support for Linux (user) namespaces)

      xref https://github.com/moby/moby/issues/48413 ([Lustre filesystem] dockerd rootless: Order of chtimes and chmod forces requirement on CAP_FOWNER)

      See the moby issue for the user-facing trigger of this bug, where rootless docker fails to set file time attributes. A short strace summary is the following, where this happens as root in a user namespace, with uids 0-65535 mapped:

      [pid 496980] fchownat(AT_FDCWD, "/home/ubuntu", 1000, 1000, AT_SYMLINK_NOFOLLOW ) = 0
      [pid 496980] utimensat(AT_FDCWD, "/home/ubuntu", [{tv_sec=1722513791, tv_nsec=0} /* 2024-08-01T14:03:11+0200 */, {tv_sec=1722513791, tv_nsec=0} /* 2024-08-01T14:03:11+0200 */], 0) = -1 EPERM (Operation not per... 

      A full docker-independent reproducer script will be attached.

      The kernel side permissions stuff was figured out by a friend, I'm documenting this here and filing the issue.

      To figure out what utimensat is calling, we look for ATTR_ATIME_SET and similar, and we can find matches against llite in the code, which according to https://wiki.lustre.org/Understanding_Lustre_Internals, seems to be the HAL for linux <-> lustre.

      Here, we see ll_setattr_raw checks current_fsuid) against the inodes uid, and if those don't match, it checks for CAP_FOWNER:

              /* POSIX: check before ATTR_*TIME_SET set (from inode_change_ok) */
              if (attr->ia_valid & TIMES_SET_FLAGS) {  
                      if ((!uid_eq(current_fsuid(), inode->i_uid)) &&
                          !capable(CAP_FOWNER))
                              GOTO(clear, rc = -EPERM);
              } 

      (per https://www.man7.org/linux/man-pages/man7/capabilities.7.html

             CAP_FOWNER
                    •  Bypass permission checks on operations that normally
                       require the filesystem UID of the process to match the
                       UID of the file (e.g., chmod(2), utime(2)), excluding
                       those operations covered by CAP_DAC_OVERRIDE and
                       CAP_DAC_READ_SEARCH;
                    •  set inode flags (see FS_IOC_SETFLAGS(2const)) on
                       arbitrary files;
                    •  set Access Control Lists (ACLs) on arbitrary files;
                    •  ignore directory sticky bit on file deletion;
                    •  modify user extended attributes on sticky directory
                       owned by any user;
                    •  specify O_NOATIME for arbitrary files in open(2) and
                       fcntl(2). 

      )

       

      In the kernel source, it can be seen that capable() only checks against the initial namespace:

      bool capable(int cap)
      {
              return ns_capable(&init_user_ns, cap);
      } 

      The solution is probably to use the inode_owner_or_capable function, which is extended to handle namespaces:

      /**
      * inode_owner_or_capable - check current task permissions to inode
      * @mnt_userns: user namespace of the mount the inode was found from
      * @inode: inode being checked
      *
      * Return true if current either has CAP_FOWNER in a namespace with the
      * inode owner uid mapped, or owns the file.
      *
      * If the inode has been found through an idmapped mount the user namespace of
      * the vfsmount must be passed through @mnt_userns. This function will then take
      * care to map the inode according to @mnt_userns before checking permissions.
      * On non-idmapped mounts or if permission checking is to be performed on the
      * raw inode simply passs init_user_ns.
      */
      bool inode_owner_or_capable(struct user_namespace *mnt_userns,
                                  const struct inode *inode)
      {
              kuid_t i_uid;
              struct user_namespace *ns;        i_uid = i_uid_into_mnt(mnt_userns, inode);
              if (uid_eq(current_fsuid(), i_uid))
                      return true;        ns = current_user_ns();
              if (kuid_has_mapping(ns, i_uid) && ns_capable(ns, CAP_FOWNER))
                      return true;
              return false;
      } 

      A note for novices; root in user namespaces indeed gets a full set of capabilities:

      $ unshare -U -r -m capsh --print
      Current: =ep
      Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
      Ambient set =
      Current IAB:
      Securebits: 00/0x0/1'b0 (no-new-privs=0)
       secure-noroot: no (unlocked)
       secure-no-suid-fixup: no (unlocked)
       secure-keep-caps: no (unlocked)
       secure-no-ambient-raise: no (unlocked)
      uid=0(root) euid=0(root)
      gid=0(root)
      groups=65534(nobody),65534(nobody),0(root)
      Guessed mode: UNCERTAIN (0)
      

      Given all this, it's likely there are other parts of the code that have similar problems?

      Attachments

        Issue Links

          Activity

            [LU-18200] ll_setattr_raw does not account for noninitial linux namespaces
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-13791 [ LU-13791 ]

            One thing to check is if the MDS is allowing CAP_FOWNER from the client, or if this is being rejected on the client?

            You can enable client capabilities on the MGS with a parameter:

            mgs# lctl set_param -P mdt.lfs-*.enable_cap_mask=+cap_fowner
            

            This was added in LU-13791, but if the second "named capabilities" patch is not in your branch, then you would need to set the capability mask by a hex value (CAP_FOWNER=0x8).

            adilger Andreas Dilger added a comment - One thing to check is if the MDS is allowing CAP_FOWNER from the client, or if this is being rejected on the client? You can enable client capabilities on the MGS with a parameter: mgs# lctl set_param -P mdt.lfs-*.enable_cap_mask=+cap_fowner This was added in LU-13791 , but if the second "named capabilities" patch is not in your branch, then you would need to set the capability mask by a hex value ( CAP_FOWNER=0x8 ).
            jikim Jinseok Kim added a comment -

            Hello team,

            Is it any updates?

            Similar issues have been reproduced in Samsung Electronics in Korea.
            If there are any updates, please share them.

            [yr9.choi@agpu1443 ~]$ docker pull repo.samsungds.net/docker.io/ubuntu:20.04
            20.04: Pulling from ubuntu
            d9802f032d67: Extracting [==================================================>]  27.51MB/27.51MB
            failed to register layer: chtimes /var/cache/apt/archives/partial: operation not permitted 
            jikim Jinseok Kim added a comment - Hello team, Is it any updates? Similar issues have been reproduced in Samsung Electronics in Korea. If there are any updates, please share them. [yr9.choi@agpu1443 ~]$ docker pull repo.samsungds.net/docker.io/ubuntu:20.04 20.04: Pulling from ubuntu d9802f032d67: Extracting [==================================================>]  27.51MB/27.51MB failed to register layer: chtimes / var /cache/apt/archives/partial: operation not permitted
            maloo Maloo made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 39596 ]
            maloo Maloo made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 39332 ]
            maloo Maloo made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 39085 ]
            maloo Maloo made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 38956 ]
            maloo Maloo made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 38852 ]
            jcasper James Casper (Inactive) made changes -
            Remote Link Original: This issue links to "Page (Whamcloud Community Wiki)" [ 38732 ]
            maloo Maloo made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 38732 ]

            People

              core-lustre-triage Core Lustre Triage
              lstrbg Dan Trent
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: