Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
xref https://jira.whamcloud.com/browse/LUDOC-533 (Current intended level of support for Linux (user) namespaces)
xref https://github.com/moby/moby/issues/48413 ([Lustre filesystem] dockerd rootless: Order of chtimes and chmod forces requirement on CAP_FOWNER)
See the moby issue for the user-facing trigger of this bug, where rootless docker fails to set file time attributes. A short strace summary is the following, where this happens as root in a user namespace, with uids 0-65535 mapped:
[pid 496980] fchownat(AT_FDCWD, "/home/ubuntu", 1000, 1000, AT_SYMLINK_NOFOLLOW ) = 0 [pid 496980] utimensat(AT_FDCWD, "/home/ubuntu", [{tv_sec=1722513791, tv_nsec=0} /* 2024-08-01T14:03:11+0200 */, {tv_sec=1722513791, tv_nsec=0} /* 2024-08-01T14:03:11+0200 */], 0) = -1 EPERM (Operation not per...
A full docker-independent reproducer script will be attached.
The kernel side permissions stuff was figured out by a friend, I'm documenting this here and filing the issue.
To figure out what utimensat is calling, we look for ATTR_ATIME_SET and similar, and we can find matches against llite in the code, which according to https://wiki.lustre.org/Understanding_Lustre_Internals, seems to be the HAL for linux <-> lustre.
Here, we see ll_setattr_raw checks current_fsuid) against the inodes uid, and if those don't match, it checks for CAP_FOWNER:
/* POSIX: check before ATTR_*TIME_SET set (from inode_change_ok) */ if (attr->ia_valid & TIMES_SET_FLAGS) { if ((!uid_eq(current_fsuid(), inode->i_uid)) && !capable(CAP_FOWNER)) GOTO(clear, rc = -EPERM); }
(per https://www.man7.org/linux/man-pages/man7/capabilities.7.html
CAP_FOWNER
• Bypass permission checks on operations that normally
require the filesystem UID of the process to match the
UID of the file (e.g., chmod(2), utime(2)), excluding
those operations covered by CAP_DAC_OVERRIDE and
CAP_DAC_READ_SEARCH;
• set inode flags (see FS_IOC_SETFLAGS(2const)) on
arbitrary files;
• set Access Control Lists (ACLs) on arbitrary files;
• ignore directory sticky bit on file deletion;
• modify user extended attributes on sticky directory
owned by any user;
• specify O_NOATIME for arbitrary files in open(2) and
fcntl(2).
)
In the kernel source, it can be seen that capable() only checks against the initial namespace:
bool capable(int cap) { return ns_capable(&init_user_ns, cap); }
The solution is probably to use the inode_owner_or_capable function, which is extended to handle namespaces:
/** * inode_owner_or_capable - check current task permissions to inode * @mnt_userns: user namespace of the mount the inode was found from * @inode: inode being checked * * Return true if current either has CAP_FOWNER in a namespace with the * inode owner uid mapped, or owns the file. * * If the inode has been found through an idmapped mount the user namespace of * the vfsmount must be passed through @mnt_userns. This function will then take * care to map the inode according to @mnt_userns before checking permissions. * On non-idmapped mounts or if permission checking is to be performed on the * raw inode simply passs init_user_ns. */ bool inode_owner_or_capable(struct user_namespace *mnt_userns, const struct inode *inode) { kuid_t i_uid; struct user_namespace *ns; i_uid = i_uid_into_mnt(mnt_userns, inode); if (uid_eq(current_fsuid(), i_uid)) return true; ns = current_user_ns(); if (kuid_has_mapping(ns, i_uid) && ns_capable(ns, CAP_FOWNER)) return true; return false; }
A note for novices; root in user namespaces indeed gets a full set of capabilities:
$ unshare -U -r -m capsh --print
Current: =ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
Ambient set =
Current IAB:
Securebits: 00/0x0/1'b0 (no-new-privs=0)
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=65534(nobody),65534(nobody),0(root)
Guessed mode: UNCERTAIN (0)
Given all this, it's likely there are other parts of the code that have similar problems?
Attachments
Issue Links
- is related to
-
LU-13791 Capabilities are not effective
-
- Resolved
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
One thing to check is if the MDS is allowing CAP_FOWNER from the client, or if this is being rejected on the client?
You can enable client capabilities on the MGS with a parameter:
This was added in
LU-13791, but if the second "named capabilities" patch is not in your branch, then you would need to set the capability mask by a hex value (CAP_FOWNER=0x8).