Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7842

ACL's applied over NFS are not consistent when looping file operations

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      We are experiencing an issue where on a lustre client we do not see the issue, but exported over NFS we see this issue on one of our three production lustre file systems. We cannot reproduce this issue on any other system, but is causing us production issues on our oldest lustre instance.

      Running the script attached over NFS after a few iterations we hit the following issue:

      [joe59240@vws250 joe59240]$ /dls/tmp/joe59240/stresstest
      ....mkdir: cannot create directory `5': Permission denied

      Each "." is an iteration of the loop as you will see in the script

      This persists for maybe as long as five seconds before files can be written to the folder again and the script runs.

      At the moment we have not ever had the script run to completion, but on other lustre file systems we can run it hundreds of times to completion.
      The file system has many different NFS exporters each exporting a different folder in the root of the file system. As is common practice on all other systems at Diamond. We can re produce this on all exporters attached to this particular file system.

      We are thinking after a number of weeks looking at the issue that it is not the exporter as this is across all servers that export lustre but down to the interaction between FS and NFS.

      We have put in a few sleeps in the script to try and identify if there is a buffering issue where we are modifying or deleting before a flush to disk. But this has not improved the symptoms.

      Would it be possible to advise further debugging?

      Attachments

        1. lustre-dump.tar.gz
          9.33 MB
          Dave Bond
        2. lustre-logs.tar.gz
          13.58 MB
          Dave Bond
        3. stresstest
          0.2 kB
          Dave Bond

        Issue Links

          Activity

            [LU-7842] ACL's applied over NFS are not consistent when looping file operations
            pjones Peter Jones added a comment -

            ok so either this is no longer happening or you are no longer concerned about it. Either way, I'll close out the ticket

            pjones Peter Jones added a comment - ok so either this is no longer happening or you are no longer concerned about it. Either way, I'll close out the ticket
            pjones Peter Jones added a comment -

            Frederik

            Any news?

            Peter

            pjones Peter Jones added a comment - Frederik Any news? Peter

            Peter,

            the patch last week was a bit late for that maintenance window, so we had to wait until this week.

            We have applied the patch on the MDS this morning and so far we've not been able to reproduce the issue, though if I remember right, this sometimes had been the case immediately after rebooting the NFS server. And we did have to reboot the NFS server as it suffered a LBUG after finishing recovery. We're looking into this and if we can't find anything in Jira, we'll open another ticket for that.

            Thanks,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Peter, the patch last week was a bit late for that maintenance window, so we had to wait until this week. We have applied the patch on the MDS this morning and so far we've not been able to reproduce the issue, though if I remember right, this sometimes had been the case immediately after rebooting the NFS server. And we did have to reboot the NFS server as it suffered a LBUG after finishing recovery. We're looking into this and if we can't find anything in Jira, we'll open another ticket for that. Thanks, Frederik
            pjones Peter Jones added a comment -

            Dave/Frederik

            Have you applied the supplied diagnostic patch?

            Peter

            pjones Peter Jones added a comment - Dave/Frederik Have you applied the supplied diagnostic patch? Peter
            laisiyao Lai Siyao added a comment -

            the patch for b2_7_fe is on: http://review.whamcloud.com/#/c/20992/

            laisiyao Lai Siyao added a comment - the patch for b2_7_fe is on: http://review.whamcloud.com/#/c/20992/
            laisiyao Lai Siyao added a comment -

            okay, I'll do it now.

            laisiyao Lai Siyao added a comment - okay, I'll do it now.
            pjones Peter Jones added a comment - - edited

            Lai

            While LU-8305 may prevent this change completing testing on master, it should have no relevance to Diamond running on the 2.7 FE branch so could you please port the patch there for them to try?

            Thanks

            Peter

            pjones Peter Jones added a comment - - edited Lai While LU-8305 may prevent this change completing testing on master, it should have no relevance to Diamond running on the 2.7 FE branch so could you please port the patch there for them to try? Thanks Peter
            laisiyao Lai Siyao added a comment -

            The autotest failure looks to be caused by LU-8305. I'll watch the progress of that ticket.

            laisiyao Lai Siyao added a comment - The autotest failure looks to be caused by LU-8305 . I'll watch the progress of that ticket.

            Lai,

            I've looked at the patch, it has received a '-1' from maloo but I can't work out if this is a failure that is seen elsewhere. I think it might be but would like to double check before considering to apply this patch for a test. Considering we are unfortunately only seeing this on a production file system, and MDS changes require a full file system outage, we will need to schedule this and I currently can't promise when this will happen. Hopefully early next week if everything else looks good.

            Thanks,
            Frederik

            ferner Frederik Ferner (Inactive) added a comment - Lai, I've looked at the patch, it has received a '-1' from maloo but I can't work out if this is a failure that is seen elsewhere. I think it might be but would like to double check before considering to apply this patch for a test. Considering we are unfortunately only seeing this on a production file system, and MDS changes require a full file system outage, we will need to schedule this and I currently can't promise when this will happen. Hopefully early next week if everything else looks good. Thanks, Frederik
            laisiyao Lai Siyao added a comment -

            hi Dave, I just pushed a patch, which changes MDS code only, could you apply it and test again?

            laisiyao Lai Siyao added a comment - hi Dave, I just pushed a patch, which changes MDS code only, could you apply it and test again?

            People

              laisiyao Lai Siyao
              davebond-diamond Dave Bond (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: