Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18101

sanityn test_25a: FAIL: checkstat /mnt/lustre2/d25a.sanityn/f1 #2

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0
    • Ubuntu 24.04 client
      SLES 15 SP6 client
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/dbb37b62-af78-469d-b324-8b599665b4e7

      test_25a failed with the following error:

      == sanityn test 25a: change ACL on one mountpoint be seen on another ============================================================= 15:55:46 (1722527746)
      running as uid/gid/euid/egid 500/500/500/500, groups:
       [checkstat] [-v] [/mnt/lustre2/d25a.sanityn/f1]
      running as uid/gid/euid/egid 500/500/500/500, groups:
       [checkstat] [-v] [/mnt/lustre2/d25a.sanityn/f1]
       sanityn test_25a: @@@@@@ FAIL: checkstat /mnt/lustre2/d25a.sanityn/f1 #2 
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master/4559 - 6.8.0-35-generic
      servers: https://build.whamcloud.com/job/lustre-master/4559 - 5.14.0-427.24.1_lustre.el9.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanityn test_25a - checkstat /mnt/lustre2/d25a.sanityn/f1 #2

      Attachments

        Issue Links

          Activity

            [LU-18101] sanityn test_25a: FAIL: checkstat /mnt/lustre2/d25a.sanityn/f1 #2
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56552/
            Subject: LU-18101 sec: fix ACL handling on recent kernels again
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 13fd5ebef3a7a1ae3574458674e16ca782b181e7

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56552/ Subject: LU-18101 sec: fix ACL handling on recent kernels again Project: fs/lustre-release Branch: master Current Patch Set: Commit: 13fd5ebef3a7a1ae3574458674e16ca782b181e7

            "Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56552
            Subject: LU-18101 sec: fix again ACL handling on recent kernels
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b36c67b32d1ebbf4b85810d080556ab5c799202e

            gerrit Gerrit Updater added a comment - "Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56552 Subject: LU-18101 sec: fix again ACL handling on recent kernels Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b36c67b32d1ebbf4b85810d080556ab5c799202e

            Note that sanityn test_25b is also failing with the same problem, but it is normally skipped during Ubuntu 24.04 testing because the test has only one MDT.

            adilger Andreas Dilger added a comment - Note that sanityn test_25b is also failing with the same problem, but it is normally skipped during Ubuntu 24.04 testing because the test has only one MDT.
            adilger Andreas Dilger added a comment - - edited

            It looks like this problem has only hit on Ubuntu 24.04 and SLES 15sp6, so with newer kernels. It would appear that Lustre is correctly invalidating the ACL xattr from the xattr cache on the client, but the ACL itself is cached in the inode/VFS somewhere else and that is not being refreshed correctly.

            adilger Andreas Dilger added a comment - - edited It looks like this problem has only hit on Ubuntu 24.04 and SLES 15sp6, so with newer kernels. It would appear that Lustre is correctly invalidating the ACL xattr from the xattr cache on the client, but the ACL itself is cached in the inode/VFS somewhere else and that is not being refreshed correctly.

            The test changes ACLs on file $DIR2/$tdir/f1 with multiple values, but does not clear the cache between changes.

            Sebastien, which cache are you referring to? The xattr cache, or is there a separate level of ACL cache?

            The xattr cache for security xattrs should be cleared when any security xattr/ACL/permission is changed on the MDS, otherwise a client can cache incorrect file access permissions for a long time (granting or denying access incorrectly).

            If this is not happening automatically by the MDS acquiring the MDS_INODELOCK_PERM DLM lock bit during the xattr/ACL update, and this triggering client DLM lock (and related ACL/xattr) cancellation and refresh, then I would consider this a bug in the code, and not the test. Covering this failure up by having the test cancel the DLM locks, or otherwise flushing the xattrs from the client cache would just be hiding the problem until it will eventually be hit by a customer.

            adilger Andreas Dilger added a comment - The test changes ACLs on file $DIR2/$tdir/f1 with multiple values, but does not clear the cache between changes. Sebastien, which cache are you referring to? The xattr cache, or is there a separate level of ACL cache? The xattr cache for security xattrs should be cleared when any security xattr/ACL/permission is changed on the MDS, otherwise a client can cache incorrect file access permissions for a long time (granting or denying access incorrectly). If this is not happening automatically by the MDS acquiring the MDS_INODELOCK_PERM DLM lock bit during the xattr/ACL update, and this triggering client DLM lock (and related ACL/xattr) cancellation and refresh, then I would consider this a bug in the code, and not the test. Covering this failure up by having the test cancel the DLM locks, or otherwise flushing the xattrs from the client cache would just be hiding the problem until it will eventually be hit by a customer.
            pjones Peter Jones added a comment -

            Seems to be low frequency failure so can defer to a future release

            pjones Peter Jones added a comment - Seems to be low frequency failure so can defer to a future release

            I think the problem might be in the test itself. The test changes ACLs on file $DIR2/$tdir/f1 with multiple values, but does not clear the cache between changes.

            sebastien Sebastien Buisson added a comment - I think the problem might be in the test itself. The test changes ACLs on file $DIR2/$tdir/f1 with multiple values, but does not clear the cache between changes.
            cfaber Colin Faber added a comment -

            Hi sebastien can you triage this one as well?

            cfaber Colin Faber added a comment - Hi sebastien can you triage this one as well?

            People

              sebastien Sebastien Buisson
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: