[LU-9735] Sles12Sp2 and 2.9 getcwd() sometimes fails Created: 05/Jul/17  Updated: 24/Jul/19  Resolved: 24/Jul/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: James A Simmons
Resolution: Fixed Votes: 1
Labels: ORNL

Attachments: File getcwdHack.c     File miranda.debug.1499341246.gz     File miranda.dis     File r481i7n17.dump1.log.gz     Text File unoptimize-atomic_open-of-negative-dentry.patch    
Issue Links:
Duplicate
duplicates LU-9208 getcwd() sometimes fails Resolved
Related
is related to LU-9868 dcache/namei fixes for lustre Open
is related to LU-10164 kernel update [SLES12 SP3 4.4.92-6.18] Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

This is a duplicate of LU-9208. Opening this case for tracking for nasa. We start to see this once we updated the clients to Sles12SP2 and lustre2.9

Using the test code provide LU-9208 (miranda) I was able to reproduce the bug on a single node.

 

Iteration =    868, Run Time =     0.9614 sec., Transfer Rate =   120.7790 10e+06 Bytes/sec/proc
Iteration =    869, Run Time =     1.5308 sec., Transfer Rate =    75.8561 10e+06 Bytes/sec/proc
forrtl: severe (121): Cannot access current working directory for unit 10012, file "Unknown"
Image              PC                Routine            Line        Source             
miranda            0000000000409F29  Unknown               Unknown  Unknown
miranda            00000000004169D2  Unknown               Unknown  Unknown
miranda            0000000000404045  Unknown               Unknown  Unknown
miranda            0000000000402FDE  Unknown               Unknown  Unknown
libc.so.6          00002AAAAB5B96E5  Unknown               Unknown  Unknown
miranda            0000000000402EE9  Unknown               Unknown  Unknown
MPT ERROR: MPI_COMM_WORLD rank 12 has terminated without calling MPI_Finalize()
	aborting job

 I was able to capture some debug logs I have attached to the case. I was unable to reproduce it using "+trace". But will continue to try.



 Comments   
Comment by Peter Jones [ 05/Jul/17 ]

Bobijam

Could you please assist with this issue?

Thanks

Peter

Comment by Zhenyu Xu [ 05/Jul/17 ]

Is it NFS involved? I haven't found -2 (ENOENT) error is the log.

Comment by Mahmoud Hanafi [ 05/Jul/17 ]

NFS is not involved. I'll try to get additional debugging to catch the ENOENT.

 

Comment by Mahmoud Hanafi [ 06/Jul/17 ]

 I am unable to reproduce with -1 debugging. What is the min debugging level need to diag the issue?

Comment by Zhenyu Xu [ 06/Jul/17 ]

How about start with "trace inode info other dentry rpctrace vfstrace console" ? And if possible add "dlmtrace" as well.

Comment by Mahmoud Hanafi [ 06/Jul/17 ]

Uploaded miranda.debug.1499341246.gz but I still didn't see any ENOENT Error. This was the error on the client

 

Iteration =   9649, Run Time =     2.5712 sec., Transfer Rate =    45.1620 10e+06 Bytes/sec/proc
forrtl: severe (121): Cannot access current working directory for unit 10013, file "Unknown"
Image              PC                Routine            Line        Source             
miranda            0000000000409F29  Unknown               Unknown  Unknown
miranda            00000000004169D2  Unknown               Unknown  Unknown
miranda            0000000000404045  Unknown               Unknown  Unknown
miranda            0000000000402FDE  Unknown               Unknown  Unknown
libc.so.6          00002AAAAB5B96E5  Unknown               Unknown  Unknown
miranda            0000000000402EE9  Unknown               Unknown  Unknown
MPT ERROR: MPI_COMM_WORLD rank 13 has terminated without calling MPI_Finalize()
        aborting job

Comment by Zhenyu Xu [ 06/Jul/17 ]

Is it possible to locate where does "miranda 0000000000409F29 Unknown Unknown Unknown" call in miranda_io.F ?

Comment by Zhenyu Xu [ 06/Jul/17 ]

And what files does miranda_io.F created/write/read ? I don't know fortran well.

Comment by Mahmoud Hanafi [ 06/Jul/17 ]

attached object dump of miranda.

 

Comment by Zhenyu Xu [ 06/Jul/17 ]

The calling path should be like this

402fd9 callq 402ff0 <MAIN__>
404040 callq 415ea0 <for_open>
416a27 callq 44f410 <for__reopen_file>
44f48b callq  44f550 <for__compute_filename>
4509cd callq  402cb0 <getcwd@plt>

it seems that the getcwd is called from the open() for write, and the open() is trying to get the full path name of the file.

Comment by Mahmoud Hanafi [ 06/Jul/17 ]

That is correct. forran open always call getcwd to get full path before doing the actual open.

I uploaded new debug log to ftp site: /uploads/LU-9735/miranda.debug.1499351104.gz

What about a debug patch to help diag the issue.

 

 The files written are all called 'miranda_io.out.100xx' where xx is the rank number.

Comment by Zhenyu Xu [ 07/Jul/17 ]

In what path did you start the test program? Is it possible that during the test run, your working path got some change temporary?

Comment by Zhenyu Xu [ 07/Jul/17 ]

The log shows that miranda_io.out.10013 has been going through the read phase while no write happens on it. And other miranda_io.out.xxxx has the write phase happened.

The thing is getcwd only determine the current working directory, is it under Lustre file system? If not, Lustre log cannot know the reason why it fails.

Comment by Mahmoud Hanafi [ 07/Jul/17 ]

yes the working directory is on the lustre file system.

 

This is similar issue as LU-645. Our LD_PRELOAD workaround works in this case as well.
I wrote a system tap script to trace
kernel.function("SyS_getcwd@../fs/dcache.c:3254")

Here you can see at the failure it returns
0xfffffffffffffffe

return=return=0x1c
call=buf=0x7fffffff5a67 size=0x1000
return=return=0x1c
call=buf=0x7fffffff5a67 size=0x1000
return=return=0x1c
call=buf=0x7fffffff5a67 size=0x1000
return=return=0xfffffffffffffffe
call=buf=0x7fffffff5a67 size=0x1000
return=return=0x1c
call=buf=0x7fffffff5a67 size=0x1000
return=return=0x1c
call=buf=0x7fffffff5a67 size=0x1000
return=return=0x1c
call=buf=0x7fffffff5a67 size=0x1000
return=return=0x1c
call=buf=0x8bc040 size=0x1000
return=return=0xb


Comment by Jay Lan (Inactive) [ 07/Jul/17 ]

error of 0x1c is ENOSPC, error of 0xb is EAGAIN
#define ENOSPC 28 /* No space left on device */
#define EAGAIN 11 /* Try again */

Comment by Mahmoud Hanafi [ 07/Jul/17 ]

return=0xfffffffffffffffe is -2

SyS_getcwd      0 miranda.debug(26848): -> buf=0x7fffffff5a67 size=0x1000
 0x2aaaab676d92 : getcwd+0x72/0x6d0 [/lib64/libc-2.22.so]
 0x452762 : for__compute_filename+0x1482/0x16d0 [/home7/mhanafi/LUSTRE/BUGS/GETPWD/miranda.debug]
 0x453a89 : for__open_proc+0xc9/0x34b0 [/home7/mhanafi/LUSTRE/BUGS/GETPWD/miranda.debug]
Returning from:  0xffffffff81225fa0 : sys_getcwd+0x0/0x190 [kernel]
Returning to  :  0xffffffff81618e1e : tracesys_phase2+0x84/0x89 [kernel]
 0xffffffff8101ae0c : try_stack_unwind+0x17c/0x190 [kernel]
 0xffffffff81019ce4 : dump_trace+0x64/0x380 [kernel]
 0xffffffffa1ae1664 [stap_971bda5852e0877040783dde29bd60d_26789+0xe664/0x0]
 0xffffffffa1ae2315 [stap_971bda5852e0877040783dde29bd60d_26789+0xf315/0x0]
 0xffffffffa1ae24d0 [stap_971bda5852e0877040783dde29bd60d_26789+0xf4d0/0x0]
 0xffffffff8105c18b : trampoline_handler+0x11b/0x1d0 [kernel]
 0xffffffff8105bcae : kretprobe_trampoline+0x25/0x57 [kernel]
 0xffffffff81618e1e : tracesys_phase2+0x84/0x89 [kernel] (inexact)
SyS_getcwd    135 miranda.debug(26848): <- return=-2 


Comment by Mahmoud Hanafi [ 08/Jul/17 ]

I think this is what is is happing.

SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
{
    int error;
    struct path pwd, root;
    char *page = __getname();

    if (!page)
        return -ENOMEM;

    rcu_read_lock();
    get_fs_root_and_pwd_rcu(current->fs, &root, &pwd);

    error = -ENOENT;
    if (!d_unlinked(pwd.dentry)) {

In the getcwd syscall between

    get_fs_root_and_pwd_rcu(current->fs, &root, &pwd);

and

if (!d_unlinked(pwd.dentry)) {

The dentry has is getting dropped. From systemtap trace you can see here that .d_iname="1stripe" is dropped and getpwd errors out.

 

__d_drop 363  dentry={.d_flags=2621646, .d_seq={.sequence=2787}, .d_hash={.next=0x0, .pprev=0xffffc90002553280}, .d_parent=0xffff88105c8bc918, .d_name={<union>={<class&gt;={.hash=1222078897, .len=7}, .hash_len=31286849969}, .name="1stripe"}, .d_inode=0xffff880dd6b2c550, .d_iname="1stripe", .d_lockref={<union>={.lock_count=120259084289, <class&gt;={.lock={<union>={.rlock={.raw_lock={.val={.counter=1}}}}}, .count=28}}}, .d_op=0xffffffffa0ef8240, .d_sb=0xffff880816ab6800, .d_time=0, .d_fsdata=0xffff880e403b3740, .d_lru={.next=0x

__d_drop 363  dentry={.d_flags=142, .d_seq={.sequence=1}, .d_hash={.next=0x0, .pprev=0x0}, .d_parent=0xffff88105c8bc918, .d_name={<union>={<class&gt;={.hash=1222078897, .len=7}, .hash_len=31286849969}, .name="1stripe"}, .d_inode=0x0, .d_iname="1stripe", .d_lockref={<union>={.lock_count=4294967297, <class&gt;={.lock={<union>={.rlock={.raw_lock={.val={.counter=1}}}}}, .count=1}}}, .d_op=0xffffffffa0ef8240, .d_sb=0xffff880816ab6800, .d_time=0, .d_fsdata=0x0, .d_lru={.next=0xffff88105c85ebd8, .prev=0xffff88105c85ebd8}, .d_child={

__d_drop 363  dentry={.d_flags=142, .d_seq={.sequence=2}, .d_hash={.next=0x0, .pprev=0x0}, .d_parent=0xffff88105c8bc918, .d_name={<union>={<class&gt;={.hash=1222078897, .len=7}, .hash_len=31286849969}, .name="1stripe"}, .d_inode=0x0, .d_iname="1stripe", .d_lockref={<union>={.lock_count=18446743523953737729, <class&gt;={.lock={<union>={.rlock={.raw_lock={.val={.counter=1}}}}}, .count=-128}}}, .d_op=0xffffffffa0ef8240, .d_sb=0xffff880816ab6800, .d_time=0, .d_fsdata=0x0, .d_lru={.next=0xffff88105c85ebd8, .prev=0xffff88105c85ebd8

SyS_getcwd      0 miranda.debug(354): <- return=-2 /nobackupp8/mhanafi/1stripe
 0x2aaaab676d92 [/lib64/libc-2.22.so+0xddd92/0x39f000]
Returning from:  0xffffffff81225fa0 : sys_getcwd+0x0/0x190 [kernel]
Returning to  :  0xffffffff81618e1e : tracesys_phase2+0x84/0x89 [kernel]
 0xffffffff8101ae0c : try_stack_unwind+0x17c/0x190 [kernel]
 0xffffffff81019ce4 : dump_trace+0x64/0x380 [kernel]
 0xffffffffa1155b09 [stap_257b7a801b68bce2a60161ccda5d2586_4_302+0x12b09/0x0]
 0xffffffffa1156865 [stap_257b7a801b68bce2a60161ccda5d2586_4_302+0x13865/0x0]
 0xffffffffa1156a20 [stap_257b7a801b68bce2a60161ccda5d2586_4_302+0x13a20/0x0]
 0xffffffff8105c18b : trampoline_handler+0x11b/0x1d0 [kernel]
 0xffffffff8105bcae : kretprobe_trampoline+0x25/0x57 [kernel]
 0xffffffff81618e1e : tracesys_phase2+0x84/0x89 [kernel] (inexact)
Pass 5: run completed in 0usr/180sys/11238real ms.

Comment by Zhenyu Xu [ 10/Jul/17 ]

Would you mind trying this patch?

Comment by Mahmoud Hanafi [ 10/Jul/17 ]

No patch was attached.

Comment by Zhenyu Xu [ 10/Jul/17 ]

Can you see unoptimize-atomic_open-of-negative-dentry.patch

Comment by Mahmoud Hanafi [ 10/Jul/17 ]

I can see the attached path now. But is it not in Gerrit?

Comment by Mahmoud Hanafi [ 10/Jul/17 ]

We tried the patch and the code failed they same way.

 Could this be related to https://jira.hpdd.intel.com/browse/LU-8891?

 

Comment by Oleg Drokin [ 12/Jul/17 ]

So it looks like the sequence of events unfolds like this:
Process1: open_create("1stripe/somefilename"); -> this in turn goes to cancel the update lock we have for "1stripe". But the lock is shared with lookup lock, so cancelling it means the dentry is dropped.
Process2: does getcwd - revalidates the "1stripe" first and then checks validity.

In the bigger log it's all happening at 1499341246.216525
Process 1 is 34167 and process 2 is 34165:

00000080:00200000:12.0:1499341246.216524:0:34165:0:(dcache.c:357:ll_revalidate_nd()) VFS Op:name=1stripe, flags=17
00000080:00002000:14.0:1499341246.216525:0:34167:0:(llite_internal.h:1388:d_lustre_invalidate()) invalidate dentry 1stripe (ffff8808fcbfdd98) parent ffff8808fcbfdf18 inode ffff8810414180d0 refc 265
00000080:00000001:12.0:1499341246.216525:0:34165:0:(dcache.c:360:ll_revalidate_nd()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:12.0:1499341246.216525:0:34165:0:(file.c:3695:ll_inode_permission()) Process entered

This sounds kind of a normal race - I don't see why dentry could not be invalidated between lookup and check and then revalidated later (also explains why NFS would trip this at times).
I see no easy way to quash this without a kernel patch and even that likely would not be too easy and we'd need to get Al Viro behind it I imagine.

Comment by Mahmoud Hanafi [ 17/Jul/17 ]

Any updates?

Comment by Zhenyu Xu [ 19/Jul/17 ]

we‘re still investigating it

Comment by Mahmoud Hanafi [ 19/Jul/17 ]

Btw, I did get it to fail with sles12sp2 and lustre2.10.

Comment by Jinshan Xiong (Inactive) [ 20/Jul/17 ]

The problem is clear as Oleg mentioned. It doesn't revalidate dentry in the sys call of getcwd() so we need to patch the kernel. I can't think of any workaround solution at this moment.

Comment by Nathan Dauchy (Inactive) [ 07/Aug/17 ]

For reference, uploaded the code for the getcwd() retry hack we are using with LD_PRELOAD to work around this problem.  We have not noticed any significant performance impacts with this approach.  Perhaps similar code can be added to Lustre itself.

Built with something like:

gcc -Wall -c -fPIC -DPIC getcwdHack.c -o getcwdHack.lo
gcc -Wall -shared getcwdHack.lo -ldl -Wl,-soname -Wl,libgetcwdHack.so -o libgetcwdHack.so

Run with something like:

export LD_PRELOAD=`pwd`/libgetcwdHack.so

Comment by Sebastian Parschauer [ 30/Aug/17 ]

The retry mechanism has to be applied to Intel/SGI MPI code (libmpifort.so). The string "forrtl" is included in it and it is the only runtime library which calls getcwd(). So only MPI Fortran code is affected.

openmpi and mvapich2 are not affected (nothing calls getcwd() there).

Comment by Sebastian Parschauer [ 30/Aug/17 ]

The kernel maintains a dentry lookup hash list. If getcwd() returns a dentry which is not in that list, then -ENOENT is returned. This is the case at initialization time of struct hlist_bl_node and when it has been deleted from that list. This works pretty well with local file systems.

Can someone explain how Lustre handles the local dentry lookup hash list?
Is it rebuilt from time to time?
Is it possible that getcwd() can be called before the dentry is added to the list?

Comment by Sebastian Parschauer [ 30/Aug/17 ]

Since kernel 3.12, getcwd() performance has been improved a lot by the way. Spin locks have been replaced with RCU.

Comment by Oleg Drokin [ 31/Aug/17 ]

Sebastien:
The way Lustre handles the local dentry lookup hash is the same as any other local filesystem. It is not possible for getcwd to be called before teh entry was added to the list because the lookup places it there.

The problem at hand is the entry could be removed from the list between lookup and getcwd. On normal filesystems that's possible if the entry was unlinked meanwhile. On Lustre (and to a degree on NFS) additional way to get the entry dropped from the list is because the entry became stale (i.e. somebody attempted a conflicting operation that revoked the lock of that entry so it was invalidated). Since getcwd does not revalidate entries it might return false errors in those cases.

Comment by Sebastian Parschauer [ 07/Sep/17 ]

I've talked to our NFS developer Neil Brown meanwhile. getcwd() shouldn't need to revalidate the dentry. NFS code is not affected - confirmed by our customer. Also upstream Lustre code doesn't seem to be affected - Neil checked the source.
If you can provide us your affected Lustre source, then he offers to have a look at it.

Comment by Sebastian Parschauer [ 07/Sep/17 ]

I think I've found the Lustre 2.9.0 bug for SLES12-SP2. ll_set_fs_pwd() only supports the fs->lock spin lock or the fs->lock write lock but not the RCU locking done by getcwd() in kernel 4.4.

Comment by Gerrit Updater [ 08/Sep/17 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/28907
Subject: LU-9735 compat: heed the fs_struct::seq
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 898c57e49743e5bf57b5fb689484c71fc909bb61

Comment by Zhenyu Xu [ 08/Sep/17 ]

Would you try this patch?

Comment by Andreas Dilger [ 09/Sep/17 ]

sparschauer, I also saw the following email on the linux-fsdevel mailing list from Neil Brown:

Subject: Re: [PATCH] d_move() vs d_unhashed() race: retry under d_lock

On Fri, Sep 08 2017, Goldwyn Rodrigues wrote:
>
> This is a follow-up of Alexey's patch at https://patchwork.kernel.org/patch/9455345/
> with suggestions proposed by Al Viro.
>
> d_move() and d_unhashed() may race because there is a small window
> where the dentry is unhashed. This may result in ENOENT (for getcwd).
> This must be checked under d_lock. However, in order to keep the fast
> path, perform the d_unhashed without d_lock first, and in the unlikely
> event that it succeeds, perform the check again under d_lock.

For your consideration, here is an alternate patch which - I believe -
achieves the same end. I think this approach is a little more robust,
but there isn't a lot in it - Goldwyn's is arguably simpler so might be
better for that reason.

NeilBrown

The actual patch is at https://patchwork.kernel.org/patch/9945159/

From dfaa166e2afaed051c388dc9f43d1468020b5e22 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Fri, 8 Sep 2017 16:03:42 +1000
Subject: [PATCH] VFS: close race between getcwd() and d_move()

d_move() will call __d_drop() and then __d_rehash()
on the dentry being moved.  This creates a small window
when the dentry appears to be unhashed.  Many tests
of d_unhashed() are made under ->d_lock and so are safe
from racing with this window, but some aren't.
In particular, getcwd() calls d_unlinked() (which calls
d_unhashed()) without d_lock protection, so it can race.

This races has been seen in practice with lustre, which uses d_move() as
part of name lookup.  See:
   https://jira.hpdd.intel.com/browse/LU-9735
It could race with a regular rename(), and result in ENOENT instead
of either the 'before' or 'after' name.

We could fix this race by taking d_lock an rechecking when
d_unhashed() reports true.  Alternately when can remove the window,
which is the approach this patch takes.

When __d_drop and __d_rehash are used to move a dentry, an extra
flag is passed which causes d_hash.pprev to not be cleared, and
to not be tested.

Signed-off-by: NeilBrown <neilb@suse.com>
Comment by Bob Glossman (Inactive) [ 26/Oct/17 ]

A fix for this problem is now shipped in the latest kernel version for sles12sp3. The description of the fix is as follows:

From: NeilBrown <neilb@suse.com>
Subject: getcwd: Close race with d_move called by lustre
Patch-mainline: Not yet, under development
References: bsc#1052593

When lustre invalidates a dentry (e.g. do to a recalled lock) and then
revalidates it, ll_splice_alias() will call d_move() to move the old alias
to the name of a new one.
This will d_drop then d_rehash the old dentry, creating a small window
when the dentry in unhashed.
If getcwd is run at this time, it might incorrectly think that
the dentry really is unhashed, and so return ENOENT.

This is a bug in lustre (it shouldn't call d_move()) but we can work
around it in getcwd by  taking the d_lock to avoid the race.
First we test without the lock as the common case does not involve
any race.  If we find the the dentry appears to be unhashed, we take
the lock and check again.

Signed-off-by: Neil Brown <neilb@suse.com>
Comment by Peter Jones [ 26/Oct/17 ]

So can we safely close this as "not a bug" from a Lustre perspective safe in the knowledge that this will be fixed in current version of SLES?

Comment by Bob Glossman (Inactive) [ 26/Oct/17 ]

I would vote for 'yes', but it's only fixed for new versions of sles12sp3. not fixed for anything older. Do other commenters in this ticket have opinions?

Has there been any feedback on the other proposed fix https://review.whamcloud.com/28907 ?

Comment by Sebastian Parschauer [ 26/Oct/17 ]

The patch has been submitted from SLE11-SP3-LTSS to SLE15. New kernels containing the fix will be released soon.

Comment by Mahmoud Hanafi [ 26/Oct/17 ]

What about the above comment 

"This is a bug in lustre (it shouldn't call d_move())"

Comment by Jay Lan (Inactive) [ 26/Oct/17 ]

We carry #28907 in our nas-2.10.x and it fixed our problem. It probably would be 4-6 months before we can upgrade to sles12sp3.

Would #28907 conflict with SUSE's workaround in sles12sp3?

Neil Brown while proposed that WA thought it was a bug in lustre. I think we should have a valid fix.

Comment by Gerrit Updater [ 31/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28907/
Subject: LU-9735 compat: heed the fs_struct::seq
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fff1163fdb41190b59adb8d90919e0adf37f68fb

Comment by Minh Diep [ 31/Jan/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 31/Jan/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31106
Subject: LU-9735 compat: heed the fs_struct::seq
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 6c21f76ae7ef6cb8004dd60db87583c101b50aa6

Comment by Gerrit Updater [ 09/Feb/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31106/
Subject: LU-9735 compat: heed the fs_struct::seq
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 030b15004d3acf6b98c198263fcca232129568cc

Comment by Stanford Research Computing Center [ 08/May/18 ]

Hi!

As an additional datapoint, we'd like to report that we've been seeing this exact same behavior with the latest Maintenance Release (2.10.3) and the latest available CentOS 7.4 kernel

# uname -a
Linux sh-104-39.int 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

# cat /sys/fs/lustre/version
2.10.3 

 

Symptoms are the same stack as initially reported, and happened while running VASP jobs:

"forrtl: severe (121): Cannot access current working directory for unit 7, file "Unknown"
Image              PC                Routine            Line        Source
vasp_gam           000000000140B496  Unknown               Unknown  Unknown
vasp_gam           000000000142511E  Unknown               Unknown  Unknown
vasp_gam           000000000091665F  Unknown               Unknown  Unknown
vasp_gam           0000000000CFE655  Unknown               Unknown  Unknown
vasp_gam           00000000012AF330  Unknown               Unknown  Unknown
vasp_gam           0000000000408D1E  Unknown               Unknown  Unknown
libc-2.17.so       00007F839E16FC05  __libc_start_main     Unknown  Unknown
vasp_gam           0000000000408C29  Unknown               Unknown  Unknown

 

The "try-again" workaround provided by @Nathan works great and we're recommending our users to use it for now. With the libgetcwdHack.so library LD_PRELOADed, the application generates this kind of log:

NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: failed: size 4096,
buf 0x7fffd6c0959b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: succeeded at try 2
of 10: size 4096, buf 0x7fffd6c0959b, ret 0x7fffd6c0959b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096,
buf 0x7ffe51196e9b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2
of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: failed: size 4096,
buf 0x7ffdb3f1f31b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: succeeded at try 2
of 10: size 4096, buf 0x7ffdb3f1f31b, ret 0x7ffdb3f1f31b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: failed: size 4096,
buf 0x7ffe9713df9b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: succeeded at try 2
of 10: size 4096, buf 0x7ffe9713df9b, ret 0x7ffe9713df9b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096,
buf 0x7ffe51196e9b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2
of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3

which seem very indicative of the same error.

We're looking forward to the fix in 2.10.4.

Cheers,
-- 
Kilian

Comment by Sebastien Piechurski [ 19/Jul/18 ]

Hi All,

 

I did not see any reaction to Mahmoud Hanafi 's comment from the 26th October 2017, citing Neil Brown:

"This is a bug in lustre (it shouldn't call d_move())"

 

We have installed 2.10.4 at a customer's site which encountered this problem (with RHEL 7 clients), and even though this considerably decreased the number of occurences, we still see the "small window when dentry is unhashed", making the job fail.

Is there something that can be done in ll_splice_alias() to close this race ?

I understand this is closed by SuSE in their latest kernels, but this is not the case for earlier kernels, nor for RHEL kernels.

Or should we push Red Hat to apply the same kind of patch SuSE did (does not seem really fair to me) ?

Comment by Marek Magrys [ 27/Sep/18 ]

It looks like we hit the same issue on Lustre 2.10.5 client and Centos 7.5 kernel. Should the fix come from Lustre or kernel, as I'm confused by the previous discussion.

Comment by James A Simmons [ 27/Sep/18 ]

I wonder if the fixes from LU-9868 would fix this? Note the patch post has a bug in it. I have a fix but haven't pushed it.

Comment by James A Simmons [ 27/Sep/18 ]

Can you give  https://review.whamcloud.com/#/c/28486 a try.

Comment by James A Simmons [ 01/Oct/18 ]

As reported the earlier patch for this bug didn't completely solve the problem. The work from LU-9868 has been reported as solving this problem which is now linked to this ticket.

Comment by James A Simmons [ 04/Oct/18 ]

So it appears that the patch for LU-9868 while fixing this bug has exposed another potential bug in lustre. If you run sanity test 233 you see

[37212.956888] VFS: Lookup of '[0x200000007:0x1:0x0]' in lustre lustre would have caused loop

[37217.817624] Lustre: DEBUG MARKER: sanity test_233a: @@@@@@ FAIL: cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]'

[37236.855424] Lustre: DEBUG MARKER: == sanity test 233b: checking that OBF of the FS .lustre succeeds ==================================== 03:34:33 (1538379273)

[37238.362201] VFS: Lookup of '[0x200000002:0x1:0x0]' in lustre lustre would have caused loop

[37243.442480] Lustre: DEBUG MARKER: sanity test_233b: @@@@@@ FAIL: cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]'

Some how the parent child relationship got inverted. Will investigate.

Comment by Mahmoud Hanafi [ 24/Jul/19 ]

We can close this case.

Comment by Peter Jones [ 24/Jul/19 ]

ok so, given that the initial fix seems to satisfy NASA (the original reporter) we can close the ticket. simmonsja can you track any remaining work under a new ticket?

Comment by James A Simmons [ 24/Jul/19 ]

I already have another ticket for this

Generated at Sat Feb 10 02:28:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.