[LU-1101] ncorrect permission handling when creating existing directories Created: 14/Feb/12  Updated: 06/Nov/13  Resolved: 06/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Marek Magrys Assignee: WC Triage
Resolution: Duplicate Votes: 1
Labels: None
Environment:

Lustre 2.1 on clients and servers, Scientific Linux 5


Issue Links:
Duplicate
duplicates LU-4185 Incorrect permission handling when cr... Resolved
Severity: 3
Bugzilla ID: 23,459
Rank (Obsolete): 4020

 Description   

Lustre seems to handle permissions on mkdir incorectlly in some cases. This issue makes it hard (or impossible) to use Torque scheduler directly on top of a Lustre filesystem. This is in fact copy of bugzilla bug #23459, which was reported by us some time ago for 1.8 branch, however it looks like the bug is still there even in 2.1. All the symptoms described in bugzilla are identical and the reproducer code provided by Lukasz Flis still works for this issue.



 Comments   
Comment by Marek Magrys [ 14/Feb/12 ]

To clarify:
The problem occurs when Torque (pbs_mom) has the $tmpdir config (/var/torque/mom_priv/config) var set to Lustre filesystem (in our case $tmpdir /mnt/lustre/scratch/jobs). We occasionally get errors like:

Feb 14 13:56:23 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18555647.batch.grid.cyf-kr.edu.pl
Feb 14 14:37:35 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557701.batch.grid.cyf-kr.edu.pl
Feb 14 14:38:17 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557716.batch.grid.cyf-kr.edu.pl
Feb 14 14:50:46 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559037.batch.grid.cyf-kr.edu.pl
Feb 14 15:01:44 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18559949.batch.grid.cyf-kr.edu.pl

An example output of the reproducer:

[b14flis@n6-4-16 repro]$ ./a.out /mnt/lustre/scratch/jobs/
Iteration: 1
Creating directory: /mnt/lustre/scratch/jobs/1804289383
mkdir(/mnt,mode) errno: 17
mkdir(/mnt/lustre,mode) errno: 17
mkdir(/mnt/lustre/scratch,mode) errno: 17
mkdir(/mnt/lustre/scratch/jobs,mode) errno: 13
mkdirtree: failed: rc=13
sleeping for 2 seconds

Iteration: 2
doing stat before creating directory
Creating directory: /mnt/lustre/scratch/jobs/846930886
mkdir(/mnt,mode) errno: 17
mkdir(/mnt/lustre,mode) errno: 17
mkdir(/mnt/lustre/scratch,mode) errno: 17
mkdir(/mnt/lustre/scratch/jobs,mode) errno: 17
mkdirtree: successful: rc=0

ERROR: inconsistency detected: previous rc: 13 vs current rc: 0

Comment by Lukasz Flis [ 27/Feb/12 ]

Hi,

One of our users using Quantum Espresso application hit the bug today.
user has set outdir variable to her directory on lustre filesystem.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 39
from parallel_mkdir : error # 1
/mnt/lustre/scratch/people/xuser/ non existent or non writable
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 14
from parallel_mkdir : error # 1
/mnt/lustre/scratch/people/xuser/ non existent or non writable
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
task # 44
from parallel_mkdir : error # 1
/mnt/lustre/scratch/people/xuser/ non existent or non writable
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

strace dump shown that mkdir result was:
wrapper.6736.3477:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EACCES (Permission denied)

After doing stat on the directory before invoking application problem dissapeared:
wrapper.23170.3574:mkdir("/mnt/lustre/scratch/people/xuser/", 0777) = -1 EEXIST (File exists)

Cheers,

Comment by Kit Westneat (Inactive) [ 27/Feb/12 ]

To answer the last question in the bugzilla report, the code that causes this bug was added here as an MDS optimization:
https://bugzilla.lustre.org/show_bug.cgi?id=18534

Comment by Lukasz Flis [ 11/Apr/12 ]

Hi,

Just to update:

We have tested and it appeared this is not a problem in 2.2.0 clients.
However 2.1.1 clients with 2.2 servers are still affected by the issue.

Comment by Lukasz Flis [ 18/Jun/12 ]

Hello,

2.2.0 clients are not usable yet for us (one unreported LBUG)

Is there any plan to include fix for the issue in upcoming 2.1.2?

Comment by Andreas Dilger [ 06/Nov/13 ]

Closing as a duplicate of LU-4185.

Generated at Sat Feb 10 01:13:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.