[LU-11074] Invalid argument reading file caps - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.12.0, Lustre 2.10.5
Affects Version/s: Lustre 2.10.4
Labels:
None
Environment:
centos 7.5, x86_64, OPA, zfs 0.7.9

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

2.10.4 client seems to have introduced a regression from 2.10.3.

we now see this message from clients

Jun  7 06:33:32 john73 kernel: Invalid argument reading file caps for /home/fstars/dwf_prepipe/dwf_prepipe_processccd.py
Jun  7 10:55:40 bryan8 kernel: Invalid argument reading file caps for /bin/date
Jun  7 11:05:29 john75 kernel: Invalid argument reading file caps for /usr/bin/basename
Jun  7 11:51:29 john97 kernel: Invalid argument reading file caps for /usr/bin/id
Jun  7 11:51:29 john97 kernel: Invalid argument reading file caps for /apps/lmod/lmod/lmod/libexec/addto

the upshot of which is that those files then can't be exec'd by the kernel.

all our servers are now centos 7.4 and 2.10.4 + LU10988 lfsck patch, zfs 0.7.9.
we have 4 lustre filesystems in the cluster and this 'fail caps' issue happens on them all. more on the root filesystem because there are more exe's there.

for some files it seems to happen on all clients and be persistent eg. all the 2.10.4 client nodes see this

[root@john72 ~]# g++
-bash: /usr/bin/g++: Invalid argument
[root@john72 ~]# dmesg | tail -1
[616489.562465] Invalid argument reading file caps for /usr/bin/g++

and for other files it's transient. eg. the exe's on the nodes listed above all work again now

[root@john97 ~]# /usr/bin/id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel)

g++ is interesting because it's hard-linked 4 times (to c+, ...), which might be part of why it persists? copying each of c, g+. etc. to a separate (non-hardlinked) file is a workaround and lets it be exec'd again, but that doesn't explain all the other files that sometimes work and sometimes don't.

apart from things like g++, the problem is rare, less than once per client per day.

as a workaround (so we can get all clients onto the more secure centos7.5) we'd like to run 2.10.3 on centos7.5 for a while, but it doesn't seem to work (looks to mount, but then ls says 'not a directory'). I don't suppose there's a patch or two that'll let 2.10.3 be functional on centos7.5? thanks.

cheers,
robin

Attachments

Issue Links

is related to

LU-11123 LustreError in ll_xattr_list() server bug: replied size 236 > 132

Resolved

is related to

LU-11107 getxattr() returns 0 length values for nonexistent xattrs (with xattr_cache=0)

Resolved

Activity

[LU-11074] Invalid argument reading file caps

Gerrit Updater added a comment - 03/Aug/18 8:07 PM

John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32901/
Subject: ~~LU-11074~~ mdc: set correct body eadatasize for getxattr()
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: f99f9345e46b5b19a8dca2aae4d348c99d8e2481

Gerrit Updater added a comment - 03/Aug/18 8:07 PM John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32901/ Subject: LU-11074 mdc: set correct body eadatasize for getxattr() Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: f99f9345e46b5b19a8dca2aae4d348c99d8e2481

Gerrit Updater added a comment - 30/Jul/18 4:22 PM

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32901
Subject: ~~LU-11074~~ mdc: set correct body eadatasize for getxattr()
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: c8bf7d0fb95618a06a493228707cd1e830da78f8

Gerrit Updater added a comment - 30/Jul/18 4:22 PM Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32901 Subject: LU-11074 mdc: set correct body eadatasize for getxattr() Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: c8bf7d0fb95618a06a493228707cd1e830da78f8

SC Admin added a comment - 30/Jul/18 4:07 PM

just to follow up, this and ~~LU-11107~~ have fixed the issue for us.
thanks!

cheers,
robin

SC Admin added a comment - 30/Jul/18 4:07 PM just to follow up, this and LU-11107 have fixed the issue for us. thanks! cheers, robin

Peter Jones added a comment - 18/Jul/18 12:49 PM

Landed for 2.12

Peter Jones added a comment - 18/Jul/18 12:49 PM Landed for 2.12

Gerrit Updater added a comment - 18/Jul/18 6:01 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32739/
Subject: ~~LU-11074~~ mdc: set correct body eadatasize for getxattr()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dea1cde92014545d97406bf8adba20840abdb1a9

Gerrit Updater added a comment - 18/Jul/18 6:01 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32739/ Subject: LU-11074 mdc: set correct body eadatasize for getxattr() Project: fs/lustre-release Branch: master Current Patch Set: Commit: dea1cde92014545d97406bf8adba20840abdb1a9

John Hammond added a comment - 02/Jul/18 1:12 PM

Yes, I believe that ~~LU-11107~~ is the real issue. https://review.whamcloud.com/32739 should just reduce your chances of hitting it.

John Hammond added a comment - 02/Jul/18 1:12 PM Yes, I believe that LU-11107 is the real issue. https://review.whamcloud.com/32739 should just reduce your chances of hitting it.

SC Admin added a comment - 30/Jun/18 9:10 AM

Hi John,

after booting a few nodes into this, I'm still seeing the occasional 'file caps' failure so yeah, you're right - there's more bugs in this area somewhere.

cheers,
robin

SC Admin added a comment - 30/Jun/18 9:10 AM Hi John, after booting a few nodes into this, I'm still seeing the occasional 'file caps' failure so yeah, you're right - there's more bugs in this area somewhere. cheers, robin

SC Admin added a comment - 29/Jun/18 1:59 PM

Hi John,

yeah, that seems to work for g++ with 862.3.3 kernel. thanks. nicely done

I'll roll it out onto a few nodes and keep and eye on them and see if it's also fixed the sporadic 'file caps' failures we were seeing.

cheers,
robin

SC Admin added a comment - 29/Jun/18 1:59 PM Hi John, yeah, that seems to work for g++ with 862.3.3 kernel. thanks. nicely done I'll roll it out onto a few nodes and keep and eye on them and see if it's also fixed the sporadic 'file caps' failures we were seeing. cheers, robin

John Hammond added a comment - 28/Jun/18 6:51 PM

Hi Robin,

OK, thank you for your reproducer. It's reproducing the issue for me as well. There appear to a few bugs here. I have a fix for one of them at https://review.whamcloud.com/32739. I believe this change will give you a workaround for the file caps issue. I am testing it locally now as well as looking at fixes for the other bugs.

John Hammond added a comment - 28/Jun/18 6:51 PM Hi Robin, OK, thank you for your reproducer. It's reproducing the issue for me as well. There appear to a few bugs here. I have a fix for one of them at https://review.whamcloud.com/32739 . I believe this change will give you a workaround for the file caps issue. I am testing it locally now as well as looking at fixes for the other bugs.

SC Admin added a comment - 28/Jun/18 6:03 AM

Hi,

in case it wasn't clear, there's no overlayfs involved in the above reproducer at all - only Lustre. the node was booted into a server ramdisk image to do the testing.

the reproducer is super-simple, but please let me know if you want me to gather debug logs from eg. 7.4 kernel + 2.10.4 and 7.5 kernel + 2.10.4 anyway. not hard for me to do.

cheers,
robin

SC Admin added a comment - 28/Jun/18 6:03 AM Hi, in case it wasn't clear, there's no overlayfs involved in the above reproducer at all - only Lustre. the node was booted into a server ramdisk image to do the testing. the reproducer is super-simple, but please let me know if you want me to gather debug logs from eg. 7.4 kernel + 2.10.4 and 7.5 kernel + 2.10.4 anyway. not hard for me to do. cheers, robin

People

Assignee:: John Hammond

Reporter:: SC Admin

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 07/Jun/18 9:34 AM

Updated:: 03/Aug/18 8:37 PM

Resolved:: 18/Jul/18 12:49 PM