Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11074

Invalid argument reading file caps

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.10.4
    • Fix Version/s: Lustre 2.12.0, Lustre 2.10.5
    • Labels:
      None
    • Environment:
      centos 7.5, x86_64, OPA, zfs 0.7.9
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      2.10.4 client seems to have introduced a regression from 2.10.3.

      we now see this message from clients

      Jun  7 06:33:32 john73 kernel: Invalid argument reading file caps for /home/fstars/dwf_prepipe/dwf_prepipe_processccd.py
      Jun  7 10:55:40 bryan8 kernel: Invalid argument reading file caps for /bin/date
      Jun  7 11:05:29 john75 kernel: Invalid argument reading file caps for /usr/bin/basename
      Jun  7 11:51:29 john97 kernel: Invalid argument reading file caps for /usr/bin/id
      Jun  7 11:51:29 john97 kernel: Invalid argument reading file caps for /apps/lmod/lmod/lmod/libexec/addto
      

      the upshot of which is that those files then can't be exec'd by the kernel.

      all our servers are now centos 7.4 and 2.10.4 + LU10988 lfsck patch, zfs 0.7.9.
      we have 4 lustre filesystems in the cluster and this 'fail caps' issue happens on them all. more on the root filesystem because there are more exe's there.

      for some files it seems to happen on all clients and be persistent eg. all the 2.10.4 client nodes see this

      [root@john72 ~]# g++
      -bash: /usr/bin/g++: Invalid argument
      [root@john72 ~]# dmesg | tail -1
      [616489.562465] Invalid argument reading file caps for /usr/bin/g++
      

      and for other files it's transient. eg. the exe's on the nodes listed above all work again now

      [root@john97 ~]# /usr/bin/id
      uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel)
      

      g++ is interesting because it's hard-linked 4 times (to c+, ...), which might be part of why it persists? copying each of c, g+. etc. to a separate (non-hardlinked) file is a workaround and lets it be exec'd again, but that doesn't explain all the other files that sometimes work and sometimes don't.

      apart from things like g++, the problem is rare, less than once per client per day.

      as a workaround (so we can get all clients onto the more secure centos7.5) we'd like to run 2.10.3 on centos7.5 for a while, but it doesn't seem to work (looks to mount, but then ls says 'not a directory'). I don't suppose there's a patch or two that'll let 2.10.3 be functional on centos7.5? thanks.

      cheers,
      robin

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jhammond John Hammond (Inactive)
                Reporter:
                scadmin SC Admin
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: