Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7842

ACL's applied over NFS are not consistent when looping file operations

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      We are experiencing an issue where on a lustre client we do not see the issue, but exported over NFS we see this issue on one of our three production lustre file systems. We cannot reproduce this issue on any other system, but is causing us production issues on our oldest lustre instance.

      Running the script attached over NFS after a few iterations we hit the following issue:

      [joe59240@vws250 joe59240]$ /dls/tmp/joe59240/stresstest
      ....mkdir: cannot create directory `5': Permission denied

      Each "." is an iteration of the loop as you will see in the script

      This persists for maybe as long as five seconds before files can be written to the folder again and the script runs.

      At the moment we have not ever had the script run to completion, but on other lustre file systems we can run it hundreds of times to completion.
      The file system has many different NFS exporters each exporting a different folder in the root of the file system. As is common practice on all other systems at Diamond. We can re produce this on all exporters attached to this particular file system.

      We are thinking after a number of weeks looking at the issue that it is not the exporter as this is across all servers that export lustre but down to the interaction between FS and NFS.

      We have put in a few sleeps in the script to try and identify if there is a buffering issue where we are modifying or deleting before a flush to disk. But this has not improved the symptoms.

      Would it be possible to advise further debugging?

      Attachments

        1. lustre-dump.tar.gz
          9.33 MB
        2. lustre-logs.tar.gz
          13.58 MB
        3. stresstest
          0.2 kB

        Issue Links

          Activity

            [LU-7842] ACL's applied over NFS are not consistent when looping file operations
            laisiyao Lai Siyao added a comment -

            I don't find any clue in debug logs, so the -13 might be generated from NFS code (though it may be caused by lustre code, but it may be wrong attribute fetched to NFS server from MDS).

            I'll see whether I can make a patch to add some debug messages.

            laisiyao Lai Siyao added a comment - I don't find any clue in debug logs, so the -13 might be generated from NFS code (though it may be caused by lustre code, but it may be wrong attribute fetched to NFS server from MDS). I'll see whether I can make a patch to add some debug messages.

            From the time stamp it has been 6 days since the last update when I uploaded the latest logs. Any chance of an update even to say you are still looking? We have shared this ticket number with the developers whom this is causing pain and I would like to provide them with an update.

            davebond-diamond Dave Bond (Inactive) added a comment - From the time stamp it has been 6 days since the last update when I uploaded the latest logs. Any chance of an update even to say you are still looking? We have shared this ticket number with the developers whom this is causing pain and I would like to provide them with an update.

            Just uploaded new dump file.

            This was collected by

            sudo lctl debug_daemon start /tmp/lustre-dump-14-05-16
            and
            stopped after the error had shown up on the NFS client.

            davebond-diamond Dave Bond (Inactive) added a comment - Just uploaded new dump file. This was collected by sudo lctl debug_daemon start /tmp/lustre-dump-14-05-16 and stopped after the error had shown up on the NFS client.
            laisiyao Lai Siyao added a comment -

            I do see -13 error code in nfs logs, but I'm afraid the lustre debug log is not dumped in time, and the related logs were discarded (lustre only keeps a certain amount of debug logs in memory).

            Could you modify your test script a bit to check errors of each command, and dump logs just upon error?

            laisiyao Lai Siyao added a comment - I do see -13 error code in nfs logs, but I'm afraid the lustre debug log is not dumped in time, and the related logs were discarded (lustre only keeps a certain amount of debug logs in memory). Could you modify your test script a bit to check errors of each command, and dump logs just upon error?

            I have just attached the logs you requested. Let me know if there are any more details I can give you.

            davebond-diamond Dave Bond (Inactive) added a comment - I have just attached the logs you requested. Let me know if there are any more details I can give you.
            laisiyao Lai Siyao added a comment -

            LU-6528 and LU-7630 are the known permission deny issues, your test failure looks to be a new one. We need more informations to triage, could you collect lustre debuglog on NFS server and MDS server? And can you also collect NFS client and server logs (enabled by `echo 2047 > /proc/sys/sunrpc/nfs_debug` on NFS client, and `echo 2047 > /proc/sys/sunrpc/nfsd_debug` on NFS server)?

            laisiyao Lai Siyao added a comment - LU-6528 and LU-7630 are the known permission deny issues, your test failure looks to be a new one. We need more informations to triage, could you collect lustre debuglog on NFS server and MDS server? And can you also collect NFS client and server logs (enabled by `echo 2047 > /proc/sys/sunrpc/nfs_debug` on NFS client, and `echo 2047 > /proc/sys/sunrpc/nfsd_debug` on NFS server)?

            We are approaching the end of our maintenance period. Would it be possible to get an update on this?

            davebond-diamond Dave Bond (Inactive) added a comment - We are approaching the end of our maintenance period. Would it be possible to get an update on this?

            Hello,

            It would appear that the issue got a lot better but never went away. In the latest client version running

            lustre: 2.7.2
            kernel: patchless_client
            build: v2_7_1_DLS_20160330-gf4709ff-CHANGED-2.6.32-573.22.1.el6.x86_64

            From an NFS client mounting this area

            [joe59240@vws250 mx-scratch]$ ~/dls-science-user-area/benchmarking/stresstest
            ......touch: cannot touch `5/somefile': Permission denied
            [joe59240@vws250 mx-scratch]$

            Would you have expected this to include the fix?

            davebond-diamond Dave Bond (Inactive) added a comment - Hello, It would appear that the issue got a lot better but never went away. In the latest client version running lustre: 2.7.2 kernel: patchless_client build: v2_7_1_DLS_20160330-gf4709ff-CHANGED-2.6.32-573.22.1.el6.x86_64 From an NFS client mounting this area [joe59240@vws250 mx-scratch] $ ~/dls-science-user-area/benchmarking/stresstest ......touch: cannot touch `5/somefile': Permission denied [joe59240@vws250 mx-scratch] $ Would you have expected this to include the fix?
            yujian Jian Yu added a comment -

            Hi Dave,
            The back-ported patch for LU-7630 in http://review.whamcloud.com/18828 is now ready to land.

            yujian Jian Yu added a comment - Hi Dave, The back-ported patch for LU-7630 in http://review.whamcloud.com/18828 is now ready to land.

            Hi All,

            Could we possibly have an update as to the progress of the patch testing. We would like to get the latest 2.7 including this to test in our production environment ASAP.

            davebond-diamond Dave Bond (Inactive) added a comment - Hi All, Could we possibly have an update as to the progress of the patch testing. We would like to get the latest 2.7 including this to test in our production environment ASAP.
            pjones Peter Jones added a comment -

            Dave/Frederik

            The relevant patches have been ported and are going through testing and reviews atm

            Peter

            pjones Peter Jones added a comment - Dave/Frederik The relevant patches have been ported and are going through testing and reviews atm Peter

            People

              laisiyao Lai Siyao
              davebond-diamond Dave Bond (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: