Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0, Lustre 2.10.5
    • Lustre 2.10.4
    • None
    • centos 7.5, x86_64, OPA, zfs 0.7.9
    • 3
    • 9223372036854775807

    Description

      2.10.4 client seems to have introduced a regression from 2.10.3.

      we now see this message from clients

      Jun  7 06:33:32 john73 kernel: Invalid argument reading file caps for /home/fstars/dwf_prepipe/dwf_prepipe_processccd.py
      Jun  7 10:55:40 bryan8 kernel: Invalid argument reading file caps for /bin/date
      Jun  7 11:05:29 john75 kernel: Invalid argument reading file caps for /usr/bin/basename
      Jun  7 11:51:29 john97 kernel: Invalid argument reading file caps for /usr/bin/id
      Jun  7 11:51:29 john97 kernel: Invalid argument reading file caps for /apps/lmod/lmod/lmod/libexec/addto
      

      the upshot of which is that those files then can't be exec'd by the kernel.

      all our servers are now centos 7.4 and 2.10.4 + LU10988 lfsck patch, zfs 0.7.9.
      we have 4 lustre filesystems in the cluster and this 'fail caps' issue happens on them all. more on the root filesystem because there are more exe's there.

      for some files it seems to happen on all clients and be persistent eg. all the 2.10.4 client nodes see this

      [root@john72 ~]# g++
      -bash: /usr/bin/g++: Invalid argument
      [root@john72 ~]# dmesg | tail -1
      [616489.562465] Invalid argument reading file caps for /usr/bin/g++
      

      and for other files it's transient. eg. the exe's on the nodes listed above all work again now

      [root@john97 ~]# /usr/bin/id
      uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel)
      

      g++ is interesting because it's hard-linked 4 times (to c+, ...), which might be part of why it persists? copying each of c, g+. etc. to a separate (non-hardlinked) file is a workaround and lets it be exec'd again, but that doesn't explain all the other files that sometimes work and sometimes don't.

      apart from things like g++, the problem is rare, less than once per client per day.

      as a workaround (so we can get all clients onto the more secure centos7.5) we'd like to run 2.10.3 on centos7.5 for a while, but it doesn't seem to work (looks to mount, but then ls says 'not a directory'). I don't suppose there's a patch or two that'll let 2.10.3 be functional on centos7.5? thanks.

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-11074] Invalid argument reading file caps

            John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32901/
            Subject: LU-11074 mdc: set correct body eadatasize for getxattr()
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: f99f9345e46b5b19a8dca2aae4d348c99d8e2481

            gerrit Gerrit Updater added a comment - John L. Hammond (jhammond@whamcloud.com) merged in patch https://review.whamcloud.com/32901/ Subject: LU-11074 mdc: set correct body eadatasize for getxattr() Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: f99f9345e46b5b19a8dca2aae4d348c99d8e2481

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32901
            Subject: LU-11074 mdc: set correct body eadatasize for getxattr()
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: c8bf7d0fb95618a06a493228707cd1e830da78f8

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32901 Subject: LU-11074 mdc: set correct body eadatasize for getxattr() Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: c8bf7d0fb95618a06a493228707cd1e830da78f8
            scadmin SC Admin added a comment -

            just to follow up, this and LU-11107 have fixed the issue for us.
            thanks!

            cheers,
            robin

            scadmin SC Admin added a comment - just to follow up, this and LU-11107 have fixed the issue for us. thanks! cheers, robin
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32739/
            Subject: LU-11074 mdc: set correct body eadatasize for getxattr()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: dea1cde92014545d97406bf8adba20840abdb1a9

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32739/ Subject: LU-11074 mdc: set correct body eadatasize for getxattr() Project: fs/lustre-release Branch: master Current Patch Set: Commit: dea1cde92014545d97406bf8adba20840abdb1a9
            jhammond John Hammond added a comment -

            Yes, I believe that LU-11107 is the real issue. https://review.whamcloud.com/32739 should just reduce your chances of hitting it.

            jhammond John Hammond added a comment - Yes, I believe that LU-11107 is the real issue. https://review.whamcloud.com/32739 should just reduce your chances of hitting it.
            scadmin SC Admin added a comment -

            Hi John,

            after booting a few nodes into this, I'm still seeing the occasional 'file caps' failure so yeah, you're right - there's more bugs in this area somewhere.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi John, after booting a few nodes into this, I'm still seeing the occasional 'file caps' failure so yeah, you're right - there's more bugs in this area somewhere. cheers, robin
            scadmin SC Admin added a comment -

            Hi John,

            yeah, that seems to work for g++ with 862.3.3 kernel. thanks. nicely done

            I'll roll it out onto a few nodes and keep and eye on them and see if it's also fixed the sporadic 'file caps' failures we were seeing.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi John, yeah, that seems to work for g++ with 862.3.3 kernel. thanks. nicely done I'll roll it out onto a few nodes and keep and eye on them and see if it's also fixed the sporadic 'file caps' failures we were seeing. cheers, robin
            jhammond John Hammond added a comment -

            Hi Robin,

            OK, thank you for your reproducer. It's reproducing the issue for me as well. There appear to a few bugs here. I have a fix for one of them at https://review.whamcloud.com/32739. I believe this change will give you a workaround for the file caps issue. I am testing it locally now as well as looking at fixes for the other bugs.

            jhammond John Hammond added a comment - Hi Robin, OK, thank you for your reproducer. It's reproducing the issue for me as well. There appear to a few bugs here. I have a fix for one of them at https://review.whamcloud.com/32739 . I believe this change will give you a workaround for the file caps issue. I am testing it locally now as well as looking at fixes for the other bugs.
            scadmin SC Admin added a comment -

            Hi,

            in case it wasn't clear, there's no overlayfs involved in the above reproducer at all - only Lustre. the node was booted into a server ramdisk image to do the testing.

            the reproducer is super-simple, but please let me know if you want me to gather debug logs from eg. 7.4 kernel + 2.10.4 and 7.5 kernel + 2.10.4 anyway. not hard for me to do.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi, in case it wasn't clear, there's no overlayfs involved in the above reproducer at all - only Lustre. the node was booted into a server ramdisk image to do the testing. the reproducer is super-simple, but please let me know if you want me to gather debug logs from eg. 7.4 kernel + 2.10.4 and 7.5 kernel + 2.10.4 anyway. not hard for me to do. cheers, robin

            People

              jhammond John Hammond
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: