Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0, Lustre 2.10.5
    • Lustre 2.10.4
    • None
    • centos 7.5, x86_64, OPA, zfs 0.7.9
    • 3
    • 9223372036854775807

    Description

      2.10.4 client seems to have introduced a regression from 2.10.3.

      we now see this message from clients

      Jun  7 06:33:32 john73 kernel: Invalid argument reading file caps for /home/fstars/dwf_prepipe/dwf_prepipe_processccd.py
      Jun  7 10:55:40 bryan8 kernel: Invalid argument reading file caps for /bin/date
      Jun  7 11:05:29 john75 kernel: Invalid argument reading file caps for /usr/bin/basename
      Jun  7 11:51:29 john97 kernel: Invalid argument reading file caps for /usr/bin/id
      Jun  7 11:51:29 john97 kernel: Invalid argument reading file caps for /apps/lmod/lmod/lmod/libexec/addto
      

      the upshot of which is that those files then can't be exec'd by the kernel.

      all our servers are now centos 7.4 and 2.10.4 + LU10988 lfsck patch, zfs 0.7.9.
      we have 4 lustre filesystems in the cluster and this 'fail caps' issue happens on them all. more on the root filesystem because there are more exe's there.

      for some files it seems to happen on all clients and be persistent eg. all the 2.10.4 client nodes see this

      [root@john72 ~]# g++
      -bash: /usr/bin/g++: Invalid argument
      [root@john72 ~]# dmesg | tail -1
      [616489.562465] Invalid argument reading file caps for /usr/bin/g++
      

      and for other files it's transient. eg. the exe's on the nodes listed above all work again now

      [root@john97 ~]# /usr/bin/id
      uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel)
      

      g++ is interesting because it's hard-linked 4 times (to c+, ...), which might be part of why it persists? copying each of c, g+. etc. to a separate (non-hardlinked) file is a workaround and lets it be exec'd again, but that doesn't explain all the other files that sometimes work and sometimes don't.

      apart from things like g++, the problem is rare, less than once per client per day.

      as a workaround (so we can get all clients onto the more secure centos7.5) we'd like to run 2.10.3 on centos7.5 for a while, but it doesn't seem to work (looks to mount, but then ls says 'not a directory'). I don't suppose there's a patch or two that'll let 2.10.3 be functional on centos7.5? thanks.

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-11074] Invalid argument reading file caps
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32739/
            Subject: LU-11074 mdc: set correct body eadatasize for getxattr()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: dea1cde92014545d97406bf8adba20840abdb1a9

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32739/ Subject: LU-11074 mdc: set correct body eadatasize for getxattr() Project: fs/lustre-release Branch: master Current Patch Set: Commit: dea1cde92014545d97406bf8adba20840abdb1a9
            jhammond John Hammond added a comment -

            Yes, I believe that LU-11107 is the real issue. https://review.whamcloud.com/32739 should just reduce your chances of hitting it.

            jhammond John Hammond added a comment - Yes, I believe that LU-11107 is the real issue. https://review.whamcloud.com/32739 should just reduce your chances of hitting it.
            scadmin SC Admin added a comment -

            Hi John,

            after booting a few nodes into this, I'm still seeing the occasional 'file caps' failure so yeah, you're right - there's more bugs in this area somewhere.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi John, after booting a few nodes into this, I'm still seeing the occasional 'file caps' failure so yeah, you're right - there's more bugs in this area somewhere. cheers, robin
            scadmin SC Admin added a comment -

            Hi John,

            yeah, that seems to work for g++ with 862.3.3 kernel. thanks. nicely done

            I'll roll it out onto a few nodes and keep and eye on them and see if it's also fixed the sporadic 'file caps' failures we were seeing.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi John, yeah, that seems to work for g++ with 862.3.3 kernel. thanks. nicely done I'll roll it out onto a few nodes and keep and eye on them and see if it's also fixed the sporadic 'file caps' failures we were seeing. cheers, robin
            jhammond John Hammond added a comment -

            Hi Robin,

            OK, thank you for your reproducer. It's reproducing the issue for me as well. There appear to a few bugs here. I have a fix for one of them at https://review.whamcloud.com/32739. I believe this change will give you a workaround for the file caps issue. I am testing it locally now as well as looking at fixes for the other bugs.

            jhammond John Hammond added a comment - Hi Robin, OK, thank you for your reproducer. It's reproducing the issue for me as well. There appear to a few bugs here. I have a fix for one of them at https://review.whamcloud.com/32739 . I believe this change will give you a workaround for the file caps issue. I am testing it locally now as well as looking at fixes for the other bugs.
            scadmin SC Admin added a comment -

            Hi,

            in case it wasn't clear, there's no overlayfs involved in the above reproducer at all - only Lustre. the node was booted into a server ramdisk image to do the testing.

            the reproducer is super-simple, but please let me know if you want me to gather debug logs from eg. 7.4 kernel + 2.10.4 and 7.5 kernel + 2.10.4 anyway. not hard for me to do.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi, in case it wasn't clear, there's no overlayfs involved in the above reproducer at all - only Lustre. the node was booted into a server ramdisk image to do the testing. the reproducer is super-simple, but please let me know if you want me to gather debug logs from eg. 7.4 kernel + 2.10.4 and 7.5 kernel + 2.10.4 anyway. not hard for me to do. cheers, robin
            scadmin SC Admin added a comment -

            Hi,

            I've finally had some time to look into this again. seems there's a regression with Lustre on the rhel/centos 7.5 kernel.

            the rhel/centos 7.4 kernel is fine, but the 7.5 kernel breaks Lustre when getting file capabilities from files with lots of hard links.

            a reproducer is:

            # echo blah > a
            # getcap a
            # for f in {b..f}; do ln a $f; done
            # getcap a
            Failed to get capabilities of file `a' (Invalid argument)
            # cat /sys/fs/lustre/version 
            2.10.4
            # uname -a
            Linux john5 3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
            

            our 'real world' example is a g++ exe on Lustre with 4 hard links which always fails 'getcap', but the above reproducer (on a different Lustre fs with more MDTs) required more than 4 hard links to see the same problem.

            I went out to >200 hard links with the same example as above with Lustre 2.10.4 and centos 7.4 kernel, and it was fine.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi, I've finally had some time to look into this again. seems there's a regression with Lustre on the rhel/centos 7.5 kernel. the rhel/centos 7.4 kernel is fine, but the 7.5 kernel breaks Lustre when getting file capabilities from files with lots of hard links. a reproducer is: # echo blah > a # getcap a # for f in {b..f}; do ln a $f; done # getcap a Failed to get capabilities of file `a' (Invalid argument) # cat /sys/fs/lustre/version 2.10.4 # uname -a Linux john5 3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux our 'real world' example is a g++ exe on Lustre with 4 hard links which always fails 'getcap', but the above reproducer (on a different Lustre fs with more MDTs) required more than 4 hard links to see the same problem. I went out to >200 hard links with the same example as above with Lustre 2.10.4 and centos 7.4 kernel, and it was fine. cheers, robin
            scadmin SC Admin added a comment -

            Hi,

            thanks for the activity on the bug, it is much appreciated. but unless you have a solid suspicion of what's wrong, then please don't work on this for now.

            I built 2.10.4 for centos7.4 on the weekend and have been rebooting clients into it since.

            hopefully I can work out from that if 'file caps' is a lustre 2.10.4 issue or a rhel7.5 kernel + overlayfs issue.

            sorry, I should have thought of doing that before...

            cheers,
            robin

            scadmin SC Admin added a comment - Hi, thanks for the activity on the bug, it is much appreciated. but unless you have a solid suspicion of what's wrong, then please don't work on this for now. I built 2.10.4 for centos7.4 on the weekend and have been rebooting clients into it since. hopefully I can work out from that if 'file caps' is a lustre 2.10.4 issue or a rhel7.5 kernel + overlayfs issue. sorry, I should have thought of doing that before... cheers, robin
            pjones Peter Jones added a comment -

            Sorry - Lai, I intended that comment for another ticket

            pjones Peter Jones added a comment - Sorry - Lai, I intended that comment for another ticket
            pjones Peter Jones added a comment -

            Lai

            Can you please investigate?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Can you please investigate? Thanks Peter

            People

              jhammond John Hammond
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: