Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1101

ncorrect permission handling when creating existing directories

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.1.0
    • None
    • Lustre 2.1 on clients and servers, Scientific Linux 5
    • 3
    • 23,459
    • 4020

    Description

      Lustre seems to handle permissions on mkdir incorectlly in some cases. This issue makes it hard (or impossible) to use Torque scheduler directly on top of a Lustre filesystem. This is in fact copy of bugzilla bug #23459, which was reported by us some time ago for 1.8 branch, however it looks like the bug is still there even in 2.1. All the symptoms described in bugzilla are identical and the reproducer code provided by Lukasz Flis still works for this issue.

      Attachments

        Issue Links

          Activity

            [LU-1101] ncorrect permission handling when creating existing directories

            Closing as a duplicate of LU-4185.

            adilger Andreas Dilger added a comment - Closing as a duplicate of LU-4185 .
            lflis Lukasz Flis added a comment -

            Hello,

            2.2.0 clients are not usable yet for us (one unreported LBUG)

            Is there any plan to include fix for the issue in upcoming 2.1.2?

            lflis Lukasz Flis added a comment - Hello, 2.2.0 clients are not usable yet for us (one unreported LBUG) Is there any plan to include fix for the issue in upcoming 2.1.2?
            lflis Lukasz Flis added a comment -

            Hi,

            Just to update:

            We have tested and it appeared this is not a problem in 2.2.0 clients.
            However 2.1.1 clients with 2.2 servers are still affected by the issue.

            lflis Lukasz Flis added a comment - Hi, Just to update: We have tested and it appeared this is not a problem in 2.2.0 clients. However 2.1.1 clients with 2.2 servers are still affected by the issue.

            To answer the last question in the bugzilla report, the code that causes this bug was added here as an MDS optimization:
            https://bugzilla.lustre.org/show_bug.cgi?id=18534

            kitwestneat Kit Westneat (Inactive) added a comment - To answer the last question in the bugzilla report, the code that causes this bug was added here as an MDS optimization: https://bugzilla.lustre.org/show_bug.cgi?id=18534
            lflis Lukasz Flis added a comment -

            Hi,

            One of our users using Quantum Espresso application hit the bug today.
            user has set outdir variable to her directory on lustre filesystem.

            %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
            task # 39
            from parallel_mkdir : error # 1
            /mnt/lustre/scratch/people/xuser/ non existent or non writable
            %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
            task # 14
            from parallel_mkdir : error # 1
            /mnt/lustre/scratch/people/xuser/ non existent or non writable
            %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
            task # 44
            from parallel_mkdir : error # 1
            /mnt/lustre/scratch/people/xuser/ non existent or non writable
            %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            strace dump shown that mkdir result was:
            wrapper.6736.3477:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EACCES (Permission denied)

            After doing stat on the directory before invoking application problem dissapeared:
            wrapper.23170.3574:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EEXIST (File exists)

            Cheers,

            lflis Lukasz Flis added a comment - Hi, One of our users using Quantum Espresso application hit the bug today. user has set outdir variable to her directory on lustre filesystem. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 39 from parallel_mkdir : error # 1 /mnt/lustre/scratch/people/xuser/ non existent or non writable %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 14 from parallel_mkdir : error # 1 /mnt/lustre/scratch/people/xuser/ non existent or non writable %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 44 from parallel_mkdir : error # 1 /mnt/lustre/scratch/people/xuser/ non existent or non writable %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% strace dump shown that mkdir result was: wrapper.6736.3477:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EACCES (Permission denied) After doing stat on the directory before invoking application problem dissapeared: wrapper.23170.3574:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EEXIST (File exists) Cheers,
            m.magrys Marek Magrys added a comment - - edited

            To clarify:
            The problem occurs when Torque (pbs_mom) has the $tmpdir config (/var/torque/mom_priv/config) var set to Lustre filesystem (in our case $tmpdir /mnt/lustre/scratch/jobs). We occasionally get errors like:

            Feb 14 13:56:23 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18555647.batch.grid.cyf-kr.edu.pl
            Feb 14 14:37:35 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557701.batch.grid.cyf-kr.edu.pl
            Feb 14 14:38:17 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557716.batch.grid.cyf-kr.edu.pl
            Feb 14 14:50:46 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559037.batch.grid.cyf-kr.edu.pl
            Feb 14 15:01:44 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559949.batch.grid.cyf-kr.edu.pl

            An example output of the reproducer:

            [b14flis@n6-4-16 repro]$ ./a.out /mnt/lustre/scratch/jobs/
            Iteration: 1
            Creating directory: /mnt/lustre/scratch/jobs/1804289383
            mkdir(/mnt,mode) errno: 17
            mkdir(/mnt/lustre,mode) errno: 17
            mkdir(/mnt/lustre/scratch,mode) errno: 17
            mkdir(/mnt/lustre/scratch/jobs,mode) errno: 13
            mkdirtree: failed: rc=13
            sleeping for 2 seconds

            Iteration: 2
            doing stat before creating directory
            Creating directory: /mnt/lustre/scratch/jobs/846930886
            mkdir(/mnt,mode) errno: 17
            mkdir(/mnt/lustre,mode) errno: 17
            mkdir(/mnt/lustre/scratch,mode) errno: 17
            mkdir(/mnt/lustre/scratch/jobs,mode) errno: 17
            mkdirtree: successful: rc=0

            ERROR: inconsistency detected: previous rc: 13 vs current rc: 0

            m.magrys Marek Magrys added a comment - - edited To clarify: The problem occurs when Torque (pbs_mom) has the $tmpdir config (/var/torque/mom_priv/config) var set to Lustre filesystem (in our case $tmpdir /mnt/lustre/scratch/jobs). We occasionally get errors like: Feb 14 13:56:23 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18555647.batch.grid.cyf-kr.edu.pl Feb 14 14:37:35 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557701.batch.grid.cyf-kr.edu.pl Feb 14 14:38:17 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557716.batch.grid.cyf-kr.edu.pl Feb 14 14:50:46 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559037.batch.grid.cyf-kr.edu.pl Feb 14 15:01:44 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559949.batch.grid.cyf-kr.edu.pl An example output of the reproducer: [b14flis@n6-4-16 repro] $ ./a.out /mnt/lustre/scratch/jobs/ Iteration: 1 Creating directory: /mnt/lustre/scratch/jobs/1804289383 mkdir(/mnt,mode) errno: 17 mkdir(/mnt/lustre,mode) errno: 17 mkdir(/mnt/lustre/scratch,mode) errno: 17 mkdir(/mnt/lustre/scratch/jobs,mode) errno: 13 mkdirtree: failed: rc=13 sleeping for 2 seconds Iteration: 2 doing stat before creating directory Creating directory: /mnt/lustre/scratch/jobs/846930886 mkdir(/mnt,mode) errno: 17 mkdir(/mnt/lustre,mode) errno: 17 mkdir(/mnt/lustre/scratch,mode) errno: 17 mkdir(/mnt/lustre/scratch/jobs,mode) errno: 17 mkdirtree: successful: rc=0 ERROR: inconsistency detected: previous rc: 13 vs current rc: 0

            People

              wc-triage WC Triage
              m.magrys Marek Magrys
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: