Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6528

permission denied over NFS (intermittent

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.5.3
    • None
    • 3
    • 9223372036854775807

    Description

      (This might be related to LU-5730 but as the symptoms are different, I'm opening a new ticket.)

      Our users are

      Attachments

        1. dk-dump-client.bz2
          191 kB
        2. dk-dump-client-2015-05-07.bz2
          1.40 MB
        3. dk-dump-client-2015-05-11.bz2
          1.46 MB
        4. dk-dump-mds.bz2
          2.44 MB
        5. dk-dump-mds-2015-05-07.bz2
          1.27 MB
        6. dk-dump-mds-2015-05-11.bz2
          1.62 MB
        7. LU-6528_debug.diff
          0.4 kB
        8. LU-6528.diff
          4 kB
        9. LU-6528-reproducer1.sh
          0.6 kB
        10. messages-client.bz2
          964 kB

        Issue Links

          Activity

            [LU-6528] permission denied over NFS (intermittent

            Also, we are receiving an increasing number of reports from users who are trying to access files and directories written days ago and are receiving intermittent permission denied (for example when running ls). This is obviously much harder to debug as we are not (yet) able to reproduce at will...

            ferner Frederik Ferner (Inactive) added a comment - Also, we are receiving an increasing number of reports from users who are trying to access files and directories written days ago and are receiving intermittent permission denied (for example when running ls). This is obviously much harder to debug as we are not (yet) able to reproduce at will...

            This is on our test file system, I believe it should be fairly quite at most times, so I don't think there are other clients creating files but I'm not absolutely sure.

            I've noticed (now) that on the lustre client it seems the lnet/debug_mb was quite small (only 8) during the previous test, not sure if this meant that some of the interesting information has been overwritten, so I've increased this now (to 32) and repeated the test. I've verified that the string "setattr" now exists in the client dump file as well.

            I'll upload the new dump files and /var/log/message from the lustre client/nfs server after reproducing it again: The error was

            Traceback (most recent call last):
              File "/home/bnh65367/bin/LU-6528-reproducer1.sh", line 15, in <module>
                sub_dir = tempfile.mkdtemp(prefix='sub_', dir=top_dir)
              File "/usr/lib64/python2.6/tempfile.py", line 318, in mkdtemp
                _os.mkdir(file, 0700)
            OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_D19Rnh/sub_4GFmt6'
            
            ferner Frederik Ferner (Inactive) added a comment - This is on our test file system, I believe it should be fairly quite at most times, so I don't think there are other clients creating files but I'm not absolutely sure. I've noticed (now) that on the lustre client it seems the lnet/debug_mb was quite small (only 8) during the previous test, not sure if this meant that some of the interesting information has been overwritten, so I've increased this now (to 32) and repeated the test. I've verified that the string "setattr" now exists in the client dump file as well. I'll upload the new dump files and /var/log/message from the lustre client/nfs server after reproducing it again: The error was Traceback (most recent call last): File "/home/bnh65367/bin/LU-6528-reproducer1.sh", line 15, in <module> sub_dir = tempfile.mkdtemp(prefix='sub_', dir=top_dir) File "/usr/lib64/python2.6/tempfile.py", line 318, in mkdtemp _os.mkdir(file, 0700) OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_D19Rnh/sub_4GFmt6'
            laisiyao Lai Siyao added a comment -

            I don't see any failure in both logs, but I see there are create/setattr in MDS logs, while not found in client logs. Actually client log shows it's idle (only ping is sent to MDS).

            Is there other lustre clients which is creating files simultaneously? And when does client log starts? If it does contain all the logs of this creation, I'm afraid _os.mkdir() failed before it's really done on lustre client, that is, NFS server returned -EACCES before calling into lustre code. If this is true, could you `echo 32767 > /proc/sys/sunrpc/nfsd_debug`, and tried to reproduce, and then upload messages which contains NFSD logs?

            laisiyao Lai Siyao added a comment - I don't see any failure in both logs, but I see there are create/setattr in MDS logs, while not found in client logs. Actually client log shows it's idle (only ping is sent to MDS). Is there other lustre clients which is creating files simultaneously? And when does client log starts? If it does contain all the logs of this creation, I'm afraid _os.mkdir() failed before it's really done on lustre client, that is, NFS server returned -EACCES before calling into lustre code. If this is true, could you `echo 32767 > /proc/sys/sunrpc/nfsd_debug`, and tried to reproduce, and then upload messages which contains NFSD logs?

            I have just managed to reproduce it, triggering 'lctl dk' on both lustre client/aka NFS server and MDS server directly after the permission denied message, both files are attached, so I'm hoping there might be some information in the dumps.

            This was the error:

            File "/home/bnh65367/bin/LU-6528-reproducer1.sh", line 15, in <module>
            sub_dir = tempfile.mkdtemp(prefix='sub_', dir=top_dir)
            File "/usr/lib64/python2.6/tempfile.py", line 318, in mkdtemp
            _os.mkdir(file, 0700)
            OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_39JlTW/sub_ju6qhK'

            ferner Frederik Ferner (Inactive) added a comment - I have just managed to reproduce it, triggering 'lctl dk' on both lustre client/aka NFS server and MDS server directly after the permission denied message, both files are attached, so I'm hoping there might be some information in the dumps. This was the error: File "/home/bnh65367/bin/ LU-6528 -reproducer1.sh", line 15, in <module> sub_dir = tempfile.mkdtemp(prefix='sub_', dir=top_dir) File "/usr/lib64/python2.6/tempfile.py", line 318, in mkdtemp _os.mkdir(file, 0700) OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_39JlTW/sub_ju6qhK'

            Lai,

            I'm struggling slightly to gather the logs.

            Initially I was going to use one of our (more powerful) test clients as NFS server, but failed to reproduce the issue with that machine. On the much less powerful machine I used earlier to reproduce the issue on our test system, I failed to trigger it after enabling the debug daemon. I'm currently trying without debug daemon again, hoping to catch it in time to use lctl dk to gather debug logs which contain the interesting information.

            One difference between the NFS servers where I was able to reproduce the issue and the one where I couldn't is that the later use o2ib for lnet and the others ethernet only. I don't see how this would matter but thought I'd mention it.

            Cheers,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Lai, I'm struggling slightly to gather the logs. Initially I was going to use one of our (more powerful) test clients as NFS server, but failed to reproduce the issue with that machine. On the much less powerful machine I used earlier to reproduce the issue on our test system, I failed to trigger it after enabling the debug daemon. I'm currently trying without debug daemon again, hoping to catch it in time to use lctl dk to gather debug logs which contain the interesting information. One difference between the NFS servers where I was able to reproduce the issue and the one where I couldn't is that the later use o2ib for lnet and the others ethernet only. I don't see how this would matter but thought I'd mention it. Cheers, Frederik

            Malcolm,

            yes we are using LDAP, SSSD is not installed on our MDS servers (or the NFS servers), it might be installed on the clients but as you mention secondary groups upcall, I assume the MDS is what you were after.

            Cheers,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Malcolm, yes we are using LDAP, SSSD is not installed on our MDS servers (or the NFS servers), it might be installed on the clients but as you mention secondary groups upcall, I assume the MDS is what you were after. Cheers, Frederik

            This is an odd question but are you using LDAP?
            If so are you using SSSD?
            If so do you have enumerate enabled?
            We ran into this exact issue when enumerate was not enabled as it breaks the secondary groups upcall.

            malhaak Malcolm Haak (Inactive) added a comment - This is an odd question but are you using LDAP? If so are you using SSSD? If so do you have enumerate enabled? We ran into this exact issue when enumerate was not enabled as it breaks the secondary groups upcall.
            laisiyao Lai Siyao added a comment -

            I tested locally with nfsv3 and nfsv4, but couldn't reproduce. Once you have debug logs, please put it somewhere, and then I'll look into them.

            laisiyao Lai Siyao added a comment - I tested locally with nfsv3 and nfsv4, but couldn't reproduce. Once you have debug logs, please put it somewhere, and then I'll look into them.
            pjones Peter Jones added a comment -

            Lai

            Could you please assist with this issue?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please assist with this issue? Thanks Peter

            Apologies, this ticket got created early, I'll put my information in this comment as I can't edit the original description...

            Our users are reporting intermittent permission denied problems when accessing directory on lustre over NFS. They trying to change to a directory returns permission denied and when retrying a few minutes later it is permitted. As far as I can tell nothing modifies the directory in the mean time. There are no problems of this sort when using native Lustre clients.

            For one user this seems to happen when creating directories over NFS with mode 0700 but not (or much less frequently) when using a more permissive mode like 0755 for the directories. In this case we have a reproducer, which I'll attach. The reproducer creates directories below your current directory and attempts to write one file per directory. It will report permission denied eventually when we run it, depending on the file system we are trying it on, we need to run the reproducer with a different directory count to reproduce the problem.

            This is a sample output for a failing case:

            [bnh65367@ws104 frederik-test]$ ~/bin/LU-6528-reproducer1.sh 10000
            Traceback (most recent call last):
              File "/home/bnh65367/bin/LU-6528-reproducer1.sh", line 15, in <module>
                sub_dir = tempfile.mkdtemp(prefix='sub_', dir=top_dir)
              File "/usr/lib64/python2.6/tempfile.py", line 318, in mkdtemp
                _os.mkdir(file, 0700)
            OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_erQIXk/sub_2aRBS4'
            [bnh65367@ws104 frederik-test]$ 
            

            Reports from other users are less clear, so there might be other failing cases.

            I will try to run the reproducer with debug enabled on the NFS server/lustre client and the MDS server on our test file system....

            ferner Frederik Ferner (Inactive) added a comment - Apologies, this ticket got created early, I'll put my information in this comment as I can't edit the original description... Our users are reporting intermittent permission denied problems when accessing directory on lustre over NFS. They trying to change to a directory returns permission denied and when retrying a few minutes later it is permitted. As far as I can tell nothing modifies the directory in the mean time. There are no problems of this sort when using native Lustre clients. For one user this seems to happen when creating directories over NFS with mode 0700 but not (or much less frequently) when using a more permissive mode like 0755 for the directories. In this case we have a reproducer, which I'll attach. The reproducer creates directories below your current directory and attempts to write one file per directory. It will report permission denied eventually when we run it, depending on the file system we are trying it on, we need to run the reproducer with a different directory count to reproduce the problem. This is a sample output for a failing case: [bnh65367@ws104 frederik-test]$ ~/bin/LU-6528-reproducer1.sh 10000 Traceback (most recent call last): File "/home/bnh65367/bin/LU-6528-reproducer1.sh", line 15, in <module> sub_dir = tempfile.mkdtemp(prefix='sub_', dir=top_dir) File "/usr/lib64/python2.6/tempfile.py", line 318, in mkdtemp _os.mkdir(file, 0700) OSError: [Errno 13] Permission denied: '/dls/p45/data/2015/cm12185-2/tmp/frederik-test/top_erQIXk/sub_2aRBS4' [bnh65367@ws104 frederik-test]$ Reports from other users are less clear, so there might be other failing cases. I will try to run the reproducer with debug enabled on the NFS server/lustre client and the MDS server on our test file system....

            People

              laisiyao Lai Siyao
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: