Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.10.4
-
None
-
centos 7.5, x86_64, OPA, zfs 0.7.9
-
3
-
9223372036854775807
Description
2.10.4 client seems to have introduced a regression from 2.10.3.
we now see this message from clients
Jun 7 06:33:32 john73 kernel: Invalid argument reading file caps for /home/fstars/dwf_prepipe/dwf_prepipe_processccd.py Jun 7 10:55:40 bryan8 kernel: Invalid argument reading file caps for /bin/date Jun 7 11:05:29 john75 kernel: Invalid argument reading file caps for /usr/bin/basename Jun 7 11:51:29 john97 kernel: Invalid argument reading file caps for /usr/bin/id Jun 7 11:51:29 john97 kernel: Invalid argument reading file caps for /apps/lmod/lmod/lmod/libexec/addto
the upshot of which is that those files then can't be exec'd by the kernel.
all our servers are now centos 7.4 and 2.10.4 + LU10988 lfsck patch, zfs 0.7.9.
we have 4 lustre filesystems in the cluster and this 'fail caps' issue happens on them all. more on the root filesystem because there are more exe's there.
for some files it seems to happen on all clients and be persistent eg. all the 2.10.4 client nodes see this
[root@john72 ~]# g++ -bash: /usr/bin/g++: Invalid argument [root@john72 ~]# dmesg | tail -1 [616489.562465] Invalid argument reading file caps for /usr/bin/g++
and for other files it's transient. eg. the exe's on the nodes listed above all work again now
[root@john97 ~]# /usr/bin/id uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel)
g++ is interesting because it's hard-linked 4 times (to c+, ...), which might be part of why it persists? copying each of c, g+. etc. to a separate (non-hardlinked) file is a workaround and lets it be exec'd again, but that doesn't explain all the other files that sometimes work and sometimes don't.
apart from things like g++, the problem is rare, less than once per client per day.
as a workaround (so we can get all clients onto the more secure centos7.5) we'd like to run 2.10.3 on centos7.5 for a while, but it doesn't seem to work (looks to mount, but then ls says 'not a directory'). I don't suppose there's a patch or two that'll let 2.10.3 be functional on centos7.5? thanks.
cheers,
robin