Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6528

permission denied over NFS (intermittent

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.5.3
    • None
    • 3
    • 9223372036854775807

    Description

      (This might be related to LU-5730 but as the symptoms are different, I'm opening a new ticket.)

      Our users are

      Attachments

        1. dk-dump-client.bz2
          191 kB
        2. dk-dump-client-2015-05-07.bz2
          1.40 MB
        3. dk-dump-client-2015-05-11.bz2
          1.46 MB
        4. dk-dump-mds.bz2
          2.44 MB
        5. dk-dump-mds-2015-05-07.bz2
          1.27 MB
        6. dk-dump-mds-2015-05-11.bz2
          1.62 MB
        7. LU-6528_debug.diff
          0.4 kB
        8. LU-6528.diff
          4 kB
        9. LU-6528-reproducer1.sh
          0.6 kB
        10. messages-client.bz2
          964 kB

        Issue Links

          Activity

            [LU-6528] permission denied over NFS (intermittent

            Yes, 37058 is the owner of top_hT52pM and is the user id that I'm running the tests as:

            [bnh65367@ws104 frederik-test]$ ls -lnd top_hT52pM
            drwx--S---+ 8597 37058 111733 266240 May 11 12:01 top_hT52pM
            [bnh65367@ws104 frederik-test]$ id
            uid=37058(bnh65367) gid=37058(bnh65367) groups=37058(bnh65367),500(dcs),37900(dls_staff),37907(dls_sysadmin),37910(dls-softinst),37911(dls-i15),109216(nt11000_2) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
            [bnh65367@ws104 frederik-test]$ 
            

            uid "1200328" is a random user where I don't recognise the name. Re-running the tests again shows the same mismatch but a different fsuid.

            NFS root_squash is enabled, root user (uid=0) is mapped to nfsnobody (uid=65534).

            Could you confirm which way round it is, is fsuid the user (apparently) trying to access the file system or is it the apparent owner of the file?

            ferner Frederik Ferner (Inactive) added a comment - Yes, 37058 is the owner of top_hT52pM and is the user id that I'm running the tests as: [bnh65367@ws104 frederik-test]$ ls -lnd top_hT52pM drwx--S---+ 8597 37058 111733 266240 May 11 12:01 top_hT52pM [bnh65367@ws104 frederik-test]$ id uid=37058(bnh65367) gid=37058(bnh65367) groups=37058(bnh65367),500(dcs),37900(dls_staff),37907(dls_sysadmin),37910(dls-softinst),37911(dls-i15),109216(nt11000_2) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [bnh65367@ws104 frederik-test]$ uid "1200328" is a random user where I don't recognise the name. Re-running the tests again shows the same mismatch but a different fsuid. NFS root_squash is enabled, root user (uid=0) is mapped to nfsnobody (uid=65534). Could you confirm which way round it is, is fsuid the user (apparently) trying to access the file system or is it the apparent owner of the file?
            laisiyao Lai Siyao added a comment -
            LustreError: 2723:0:(mdd_permission.c:310:__mdd_permission_internal()) permission denied, mode 45c0, fsuid 1200328, uid 37058
            

            shows uid doesn't match, which is the cause permission is denied.

            Could you help verify '37058' is owner of "top_hT52pM"? and which should be current test user too. Besides, could you find the user with uid "1200328"?

            BTW, what's the status of NFS root_squash? and if it's enabled, what's the uid for root_squash?

            laisiyao Lai Siyao added a comment - LustreError: 2723:0:(mdd_permission.c:310:__mdd_permission_internal()) permission denied, mode 45c0, fsuid 1200328, uid 37058 shows uid doesn't match, which is the cause permission is denied. Could you help verify '37058' is owner of "top_hT52pM"? and which should be current test user too. Besides, could you find the user with uid "1200328"? BTW, what's the status of NFS root_squash? and if it's enabled, what's the uid for root_squash?

            Latest dumps, collected with these debug flags on both MDS and client/NFS server:

            trace super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck

            ferner Frederik Ferner (Inactive) added a comment - Latest dumps, collected with these debug flags on both MDS and client/NFS server: trace super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck

            After applying the patch we initially failed to reproduce it until today (possibly related to restarting the MDS?)

            I have just reproduced it again, though /var/log/messages on the MDS server doesn't quite contain what I was expecting:

            May 11 11:23:59 cs04r-sc-mds02-03 kernel: format at ldlm_pool.c:628:ldlm_pool_recalc doesn't end in newline
            May 11 11:43:42 cs04r-sc-mds02-03 kernel: format at ldlm_pool.c:628:ldlm_pool_recalc doesn't end in newline
            May 11 11:45:49 cs04r-sc-mds02-03 kernel: format at ldlm_pool.c:628:ldlm_pool_recalc doesn't end in newline
            May 11 11:51:10 cs04r-sc-mds02-03 kernel: Lustre: MGS: non-config logname received: params
            May 11 11:51:23 cs04r-sc-mds02-03 kernel: Lustre: MGS: non-config logname received: params
            May 11 11:54:11 cs04r-sc-mds02-03 kernel: Lustre: MGS: non-config logname received: params
            May 11 11:56:43 cs04r-sc-mds02-03 kernel: Lustre: MGS: non-config logname received: params
            May 11 12:01:35 cs04r-sc-mds02-03 kernel: format at mdd_permission.c:310:__mdd_permission_internal doesn't end in newline
            

            Ah, dmesg contained this:

            LustreError: 2723:0:(mdd_permission.c:310:__mdd_permission_internal()) permission denied, mode 45c0, fsuid 1200328, uid 37058
            

            I'll upload the dumps as well, but in case it matters, here is the output:

            Traceback (most recent call last):
              File "/home/bnh65367/bin/LU-6528-reproducer1.sh", line 16, in <module>
                fd, pth = tempfile.mkstemp(prefix='script', dir=sub_dir)
              File "/usr/lib64/python2.6/tempfile.py", line 293, in mkstemp
                return _mkstemp_inner(dir, prefix, suffix, flags)
              File "/usr/lib64/python2.6/tempfile.py", line 228, in _mkstemp_inner
                fd = _os.open(file, flags, 0600)
            OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_hT52pM/sub_GYZ7QL/script1IzmRI'
            

            And this is what seems to be the relevant bit of syslog entries on the NFS server:

            May 11 12:01:34 p45-storage kernel: nfsd: MKDIR(3)    40: 97010001 000038a5 0000bb99 00000002 00000674 00000000 sub_GYZ7QL
            May 11 12:01:34 p45-storage kernel: nfsd: fh_verify(40: 97010001 000038a5 0000bb99 00000002 00000674 00000000)
            May 11 12:01:34 p45-storage kernel: nfsd: fh_compose(exp 32c:b8e6/4555387 top_hT52pM/sub_GYZ7QL, ino=0)
            May 11 12:01:34 p45-storage kernel: nfsd_dispatch: vers 3 proc 2
            May 11 12:01:34 p45-storage kernel: nfsd: fh_verify(40: 97010001 000038a5 0000bb99 00000002 00004999 00000000)
            May 11 12:01:35 p45-storage kernel: nfsd_dispatch: vers 3 proc 4
            May 11 12:01:35 p45-storage kernel: nfsd: ACCESS(3)   40: 97010001 000038a5 0000bb99 00000002 00004999 00000000 0x1f
            May 11 12:01:35 p45-storage kernel: nfsd: fh_verify(40: 97010001 000038a5 0000bb99 00000002 00004999 00000000)
            

            As always, if you need more information, please ask.

            ferner Frederik Ferner (Inactive) added a comment - - edited After applying the patch we initially failed to reproduce it until today (possibly related to restarting the MDS?) I have just reproduced it again, though /var/log/messages on the MDS server doesn't quite contain what I was expecting: May 11 11:23:59 cs04r-sc-mds02-03 kernel: format at ldlm_pool.c:628:ldlm_pool_recalc doesn't end in newline May 11 11:43:42 cs04r-sc-mds02-03 kernel: format at ldlm_pool.c:628:ldlm_pool_recalc doesn't end in newline May 11 11:45:49 cs04r-sc-mds02-03 kernel: format at ldlm_pool.c:628:ldlm_pool_recalc doesn't end in newline May 11 11:51:10 cs04r-sc-mds02-03 kernel: Lustre: MGS: non-config logname received: params May 11 11:51:23 cs04r-sc-mds02-03 kernel: Lustre: MGS: non-config logname received: params May 11 11:54:11 cs04r-sc-mds02-03 kernel: Lustre: MGS: non-config logname received: params May 11 11:56:43 cs04r-sc-mds02-03 kernel: Lustre: MGS: non-config logname received: params May 11 12:01:35 cs04r-sc-mds02-03 kernel: format at mdd_permission.c:310:__mdd_permission_internal doesn't end in newline Ah, dmesg contained this: LustreError: 2723:0:(mdd_permission.c:310:__mdd_permission_internal()) permission denied, mode 45c0, fsuid 1200328, uid 37058 I'll upload the dumps as well, but in case it matters, here is the output: Traceback (most recent call last): File "/home/bnh65367/bin/LU-6528-reproducer1.sh", line 16, in <module> fd, pth = tempfile.mkstemp(prefix='script', dir=sub_dir) File "/usr/lib64/python2.6/tempfile.py", line 293, in mkstemp return _mkstemp_inner(dir, prefix, suffix, flags) File "/usr/lib64/python2.6/tempfile.py", line 228, in _mkstemp_inner fd = _os.open(file, flags, 0600) OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_hT52pM/sub_GYZ7QL/script1IzmRI' And this is what seems to be the relevant bit of syslog entries on the NFS server: May 11 12:01:34 p45-storage kernel: nfsd: MKDIR(3) 40: 97010001 000038a5 0000bb99 00000002 00000674 00000000 sub_GYZ7QL May 11 12:01:34 p45-storage kernel: nfsd: fh_verify(40: 97010001 000038a5 0000bb99 00000002 00000674 00000000) May 11 12:01:34 p45-storage kernel: nfsd: fh_compose(exp 32c:b8e6/4555387 top_hT52pM/sub_GYZ7QL, ino=0) May 11 12:01:34 p45-storage kernel: nfsd_dispatch: vers 3 proc 2 May 11 12:01:34 p45-storage kernel: nfsd: fh_verify(40: 97010001 000038a5 0000bb99 00000002 00004999 00000000) May 11 12:01:35 p45-storage kernel: nfsd_dispatch: vers 3 proc 4 May 11 12:01:35 p45-storage kernel: nfsd: ACCESS(3) 40: 97010001 000038a5 0000bb99 00000002 00004999 00000000 0x1f May 11 12:01:35 p45-storage kernel: nfsd: fh_verify(40: 97010001 000038a5 0000bb99 00000002 00004999 00000000) As always, if you need more information, please ask.
            laisiyao Lai Siyao added a comment -

            The debug log lacks information, and I just uploaded a debug patch LU-6528_debug.diff, could you reproduce it again?

            laisiyao Lai Siyao added a comment - The debug log lacks information, and I just uploaded a debug patch LU-6528 _debug.diff, could you reproduce it again?
            laisiyao Lai Siyao added a comment -

            This is a debug patch which will print mode and id information upon permission deny, could you apply it on MDS and reproduce? BTW it looks like only "trace" is enabled in debug log, could you `lctl set_param debug="super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck trace" during reproduce?

            laisiyao Lai Siyao added a comment - This is a debug patch which will print mode and id information upon permission deny, could you apply it on MDS and reproduce? BTW it looks like only "trace" is enabled in debug log, could you `lctl set_param debug="super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck trace" during reproduce?

            could you verify that the test user has EXEC access right to all directories in "/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_D19Rnh"?

            Yes, I've have checked and this is the case.

            Thanks,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - could you verify that the test user has EXEC access right to all directories in "/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_D19Rnh"? Yes, I've have checked and this is the case. Thanks, Frederik
            laisiyao Lai Siyao added a comment -

            The log shows lustre client .lookup_nd failed with -13, and it looks quite possible to be inside kernel function reconnect_path() when it finds a disconnected dentry:
            1. ->get_parent() found parent.
            2. ->get_name() found child dentry name.
            3. ->lookup_one_len() on parent for this name failed with -EACCES, MDS log shows it doesn't have EXEC right to parent.

            I'm not clear of the cause yet, but first could you verify that the test user has EXEC access right to all directories in "/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_D19Rnh"?

            laisiyao Lai Siyao added a comment - The log shows lustre client .lookup_nd failed with -13, and it looks quite possible to be inside kernel function reconnect_path() when it finds a disconnected dentry: 1. ->get_parent() found parent. 2. ->get_name() found child dentry name. 3. ->lookup_one_len() on parent for this name failed with -EACCES, MDS log shows it doesn't have EXEC right to parent. I'm not clear of the cause yet, but first could you verify that the test user has EXEC access right to all directories in "/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_D19Rnh"?

            Also, we are receiving an increasing number of reports from users who are trying to access files and directories written days ago and are receiving intermittent permission denied (for example when running ls). This is obviously much harder to debug as we are not (yet) able to reproduce at will...

            ferner Frederik Ferner (Inactive) added a comment - Also, we are receiving an increasing number of reports from users who are trying to access files and directories written days ago and are receiving intermittent permission denied (for example when running ls). This is obviously much harder to debug as we are not (yet) able to reproduce at will...

            This is on our test file system, I believe it should be fairly quite at most times, so I don't think there are other clients creating files but I'm not absolutely sure.

            I've noticed (now) that on the lustre client it seems the lnet/debug_mb was quite small (only 8) during the previous test, not sure if this meant that some of the interesting information has been overwritten, so I've increased this now (to 32) and repeated the test. I've verified that the string "setattr" now exists in the client dump file as well.

            I'll upload the new dump files and /var/log/message from the lustre client/nfs server after reproducing it again: The error was

            Traceback (most recent call last):
              File "/home/bnh65367/bin/LU-6528-reproducer1.sh", line 15, in <module>
                sub_dir = tempfile.mkdtemp(prefix='sub_', dir=top_dir)
              File "/usr/lib64/python2.6/tempfile.py", line 318, in mkdtemp
                _os.mkdir(file, 0700)
            OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_D19Rnh/sub_4GFmt6'
            
            ferner Frederik Ferner (Inactive) added a comment - This is on our test file system, I believe it should be fairly quite at most times, so I don't think there are other clients creating files but I'm not absolutely sure. I've noticed (now) that on the lustre client it seems the lnet/debug_mb was quite small (only 8) during the previous test, not sure if this meant that some of the interesting information has been overwritten, so I've increased this now (to 32) and repeated the test. I've verified that the string "setattr" now exists in the client dump file as well. I'll upload the new dump files and /var/log/message from the lustre client/nfs server after reproducing it again: The error was Traceback (most recent call last): File "/home/bnh65367/bin/LU-6528-reproducer1.sh", line 15, in <module> sub_dir = tempfile.mkdtemp(prefix='sub_', dir=top_dir) File "/usr/lib64/python2.6/tempfile.py", line 318, in mkdtemp _os.mkdir(file, 0700) OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_D19Rnh/sub_4GFmt6'
            laisiyao Lai Siyao added a comment -

            I don't see any failure in both logs, but I see there are create/setattr in MDS logs, while not found in client logs. Actually client log shows it's idle (only ping is sent to MDS).

            Is there other lustre clients which is creating files simultaneously? And when does client log starts? If it does contain all the logs of this creation, I'm afraid _os.mkdir() failed before it's really done on lustre client, that is, NFS server returned -EACCES before calling into lustre code. If this is true, could you `echo 32767 > /proc/sys/sunrpc/nfsd_debug`, and tried to reproduce, and then upload messages which contains NFSD logs?

            laisiyao Lai Siyao added a comment - I don't see any failure in both logs, but I see there are create/setattr in MDS logs, while not found in client logs. Actually client log shows it's idle (only ping is sent to MDS). Is there other lustre clients which is creating files simultaneously? And when does client log starts? If it does contain all the logs of this creation, I'm afraid _os.mkdir() failed before it's really done on lustre client, that is, NFS server returned -EACCES before calling into lustre code. If this is true, could you `echo 32767 > /proc/sys/sunrpc/nfsd_debug`, and tried to reproduce, and then upload messages which contains NFSD logs?

            People

              laisiyao Lai Siyao
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: