[LU-11074] Invalid argument reading file caps Created: 07/Jun/18 Updated: 03/Aug/18 Resolved: 18/Jul/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.4 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.5 |
| Type: | Bug | Priority: | Minor |
| Reporter: | SC Admin (Inactive) | Assignee: | John Hammond |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
centos 7.5, x86_64, OPA, zfs 0.7.9 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
2.10.4 client seems to have introduced a regression from 2.10.3. we now see this message from clients Jun 7 06:33:32 john73 kernel: Invalid argument reading file caps for /home/fstars/dwf_prepipe/dwf_prepipe_processccd.py Jun 7 10:55:40 bryan8 kernel: Invalid argument reading file caps for /bin/date Jun 7 11:05:29 john75 kernel: Invalid argument reading file caps for /usr/bin/basename Jun 7 11:51:29 john97 kernel: Invalid argument reading file caps for /usr/bin/id Jun 7 11:51:29 john97 kernel: Invalid argument reading file caps for /apps/lmod/lmod/lmod/libexec/addto the upshot of which is that those files then can't be exec'd by the kernel. all our servers are now centos 7.4 and 2.10.4 + LU10988 lfsck patch, zfs 0.7.9. for some files it seems to happen on all clients and be persistent eg. all the 2.10.4 client nodes see this [root@john72 ~]# g++ -bash: /usr/bin/g++: Invalid argument [root@john72 ~]# dmesg | tail -1 [616489.562465] Invalid argument reading file caps for /usr/bin/g++ and for other files it's transient. eg. the exe's on the nodes listed above all work again now [root@john97 ~]# /usr/bin/id uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel) g++ is interesting because it's hard-linked 4 times (to c+, ...), which might be part of why it persists? copying each of c, g+. etc. to a separate (non-hardlinked) file is a workaround and lets it be exec'd again, but that doesn't explain all the other files that sometimes work and sometimes don't. apart from things like g++, the problem is rare, less than once per client per day. as a workaround (so we can get all clients onto the more secure centos7.5) we'd like to run 2.10.3 on centos7.5 for a while, but it doesn't seem to work (looks to mount, but then ls says 'not a directory'). I don't suppose there's a patch or two that'll let 2.10.3 be functional on centos7.5? thanks. cheers, |
| Comments |
| Comment by John Hammond [ 07/Jun/18 ] |
|
Hi Robin, Are you using any Linux Security Modules? Could you enable full debugging, clear the debug log, reproduce this, dump the log and attach? (You may need to increase the debug_mb parameter to get a full capture.) |
| Comment by SC Admin (Inactive) [ 07/Jun/18 ] |
|
Hey John, no, not using any LSM. I'll gather the debug for eg. g++ when a node clears of jobs. otherwise there'll be lots of noise. cheers, |
| Comment by John Hammond [ 08/Jun/18 ] |
|
Which 7.5 kernel are you using? |
| Comment by Peter Jones [ 08/Jun/18 ] |
|
Robin Any idea how long it will take to get the debug logs? Peter |
| Comment by SC Admin (Inactive) [ 08/Jun/18 ] |
|
we're using 862.3.2 kernel, the latest AFAIK. I'm being hesitant about debug logs 'cos I'm not 100% convinced it's a lustre bug. we definitely don't see this issue with rhel7.4 + 2.10.3, but the complication is that we use overlayfs over our root lustre filesystem. overlayfs changed a lot between 7.4 and 7.5 and I've re-patched it etc, but it might still be an overlayfs bug, or an overlayfs interaction with lustre that's now different vs. a pure lustre bug. the thing that indicates it's maybe a real lustre issue is that we see the 'file caps' problem on all filesystems - /home, /apps, /fred(dagg) - and not just on /images (which is the only one with overlayfs over it). AFAIK the only thing these 4 filesystems share is the root inode, which is on overlayfs. it seems really unlikely that the node is healthy for all accesses via the root inode/dentry, and at the same time sees 'file caps' fail on one of the pure lustre filesystems, but I wanted to try a few things first. eg. patch the rhel 7.5 kernel with a bunch of stable capabilities namespace backports that rhel seem to have omitted... unfortunately that didn't fix it. the g++ 'file caps' bug (the one that's trivial to reproduce) doesn't happen if I go directly to lustre, so there's definitely something wrong with overlayfs. I was sure I'd tried this before making this bug report, but I guess not. however, g++ failing via overlayfs and working via lustre doesn't explain the much rarer fails direct to lustre on the other 3 filesystems (+/- that shared root inode). but I can't reproduce those at will - they are rare. so I don't see how I can get you a debug trace for those. I can't figure out from 'git log v2_10_3..v2_10_4' on b2_10 which patch(es) make the lustre client work with rhel7.5's kernel. if there is one or two that you can point me at then that would help. cheers, |
| Comment by Andreas Dilger [ 11/Jun/18 ] |
|
If you can't find which patch is the source of the problem, I'd suggest to use git bisect with your "good" reproducer (possibly run multiple times to ensure you don't get a false pass) to isolate the issue to a single patch. That will allow us to identify which patch introduced the problem and possibly see how it is interacting badly with overlayfs. |
| Comment by Peter Jones [ 11/Jun/18 ] |
|
Lai Can you please investigate? Thanks Peter |
| Comment by Peter Jones [ 11/Jun/18 ] |
|
Sorry - Lai, I intended that comment for another ticket |
| Comment by SC Admin (Inactive) [ 11/Jun/18 ] |
|
Hi, thanks for the activity on the bug, it is much appreciated. but unless you have a solid suspicion of what's wrong, then please don't work on this for now. I built 2.10.4 for centos7.4 on the weekend and have been rebooting clients into it since. hopefully I can work out from that if 'file caps' is a lustre 2.10.4 issue or a rhel7.5 kernel + overlayfs issue. sorry, I should have thought of doing that before... cheers, |
| Comment by SC Admin (Inactive) [ 27/Jun/18 ] |
|
Hi, I've finally had some time to look into this again. seems there's a regression with Lustre on the rhel/centos 7.5 kernel. the rhel/centos 7.4 kernel is fine, but the 7.5 kernel breaks Lustre when getting file capabilities from files with lots of hard links. a reproducer is: # echo blah > a
# getcap a
# for f in {b..f}; do ln a $f; done
# getcap a
Failed to get capabilities of file `a' (Invalid argument)
# cat /sys/fs/lustre/version
2.10.4
# uname -a
Linux john5 3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
our 'real world' example is a g++ exe on Lustre with 4 hard links which always fails 'getcap', but the above reproducer (on a different Lustre fs with more MDTs) required more than 4 hard links to see the same problem. I went out to >200 hard links with the same example as above with Lustre 2.10.4 and centos 7.4 kernel, and it was fine. cheers, |
| Comment by SC Admin (Inactive) [ 28/Jun/18 ] |
|
Hi, in case it wasn't clear, there's no overlayfs involved in the above reproducer at all - only Lustre. the node was booted into a server ramdisk image to do the testing. the reproducer is super-simple, but please let me know if you want me to gather debug logs from eg. 7.4 kernel + 2.10.4 and 7.5 kernel + 2.10.4 anyway. not hard for me to do. cheers, |
| Comment by John Hammond [ 28/Jun/18 ] |
|
Hi Robin, OK, thank you for your reproducer. It's reproducing the issue for me as well. There appear to a few bugs here. I have a fix for one of them at https://review.whamcloud.com/32739. I believe this change will give you a workaround for the file caps issue. I am testing it locally now as well as looking at fixes for the other bugs. |
| Comment by SC Admin (Inactive) [ 29/Jun/18 ] |
|
Hi John, yeah, that seems to work for g++ with 862.3.3 kernel. thanks. nicely done I'll roll it out onto a few nodes and keep and eye on them and see if it's also fixed the sporadic 'file caps' failures we were seeing. cheers, |
| Comment by SC Admin (Inactive) [ 30/Jun/18 ] |
|
Hi John, after booting a few nodes into this, I'm still seeing the occasional 'file caps' failure so yeah, you're right - there's more bugs in this area somewhere. cheers, |
| Comment by John Hammond [ 02/Jul/18 ] |
|
Yes, I believe that |
| Comment by Gerrit Updater [ 18/Jul/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32739/ |
| Comment by Peter Jones [ 18/Jul/18 ] |
|
Landed for 2.12 |
| Comment by SC Admin (Inactive) [ 30/Jul/18 ] |
|
just to follow up, this and cheers, |
| Comment by Gerrit Updater [ 30/Jul/18 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32901 |
| Comment by Gerrit Updater [ 03/Aug/18 ] |
|
John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32901/ |