[LU-9735] Sles12Sp2 and 2.9 getcwd() sometimes fails Created: 05/Jul/17 Updated: 24/Jul/19 Resolved: 24/Jul/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | ORNL | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 2 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This is a duplicate of Using the test code provide
Iteration = 868, Run Time = 0.9614 sec., Transfer Rate = 120.7790 10e+06 Bytes/sec/proc Iteration = 869, Run Time = 1.5308 sec., Transfer Rate = 75.8561 10e+06 Bytes/sec/proc forrtl: severe (121): Cannot access current working directory for unit 10012, file "Unknown" Image PC Routine Line Source miranda 0000000000409F29 Unknown Unknown Unknown miranda 00000000004169D2 Unknown Unknown Unknown miranda 0000000000404045 Unknown Unknown Unknown miranda 0000000000402FDE Unknown Unknown Unknown libc.so.6 00002AAAAB5B96E5 Unknown Unknown Unknown miranda 0000000000402EE9 Unknown Unknown Unknown MPT ERROR: MPI_COMM_WORLD rank 12 has terminated without calling MPI_Finalize() aborting job I was able to capture some debug logs I have attached to the case. I was unable to reproduce it using "+trace". But will continue to try. |
| Comments |
| Comment by Peter Jones [ 05/Jul/17 ] |
|
Bobijam Could you please assist with this issue? Thanks Peter |
| Comment by Zhenyu Xu [ 05/Jul/17 ] |
|
Is it NFS involved? I haven't found -2 (ENOENT) error is the log. |
| Comment by Mahmoud Hanafi [ 05/Jul/17 ] |
|
NFS is not involved. I'll try to get additional debugging to catch the ENOENT.
|
| Comment by Mahmoud Hanafi [ 06/Jul/17 ] |
|
I am unable to reproduce with -1 debugging. What is the min debugging level need to diag the issue? |
| Comment by Zhenyu Xu [ 06/Jul/17 ] |
|
How about start with "trace inode info other dentry rpctrace vfstrace console" ? And if possible add "dlmtrace" as well. |
| Comment by Mahmoud Hanafi [ 06/Jul/17 ] |
|
Uploaded miranda.debug.1499341246.gz but I still didn't see any ENOENT Error. This was the error on the client
Iteration = 9649, Run Time = 2.5712 sec., Transfer Rate = 45.1620 10e+06 Bytes/sec/proc forrtl: severe (121): Cannot access current working directory for unit 10013, file "Unknown" Image PC Routine Line Source miranda 0000000000409F29 Unknown Unknown Unknown miranda 00000000004169D2 Unknown Unknown Unknown miranda 0000000000404045 Unknown Unknown Unknown miranda 0000000000402FDE Unknown Unknown Unknown libc.so.6 00002AAAAB5B96E5 Unknown Unknown Unknown miranda 0000000000402EE9 Unknown Unknown Unknown MPT ERROR: MPI_COMM_WORLD rank 13 has terminated without calling MPI_Finalize() aborting job |
| Comment by Zhenyu Xu [ 06/Jul/17 ] |
|
Is it possible to locate where does "miranda 0000000000409F29 Unknown Unknown Unknown" call in miranda_io.F ? |
| Comment by Zhenyu Xu [ 06/Jul/17 ] |
|
And what files does miranda_io.F created/write/read ? I don't know fortran well. |
| Comment by Mahmoud Hanafi [ 06/Jul/17 ] |
|
attached object dump of miranda.
|
| Comment by Zhenyu Xu [ 06/Jul/17 ] |
|
The calling path should be like this 402fd9 callq 402ff0 <MAIN__> 404040 callq 415ea0 <for_open> 416a27 callq 44f410 <for__reopen_file> 44f48b callq 44f550 <for__compute_filename> 4509cd callq 402cb0 <getcwd@plt> it seems that the getcwd is called from the open() for write, and the open() is trying to get the full path name of the file. |
| Comment by Mahmoud Hanafi [ 06/Jul/17 ] |
|
That is correct. forran open always call getcwd to get full path before doing the actual open. I uploaded new debug log to ftp site: /uploads/ What about a debug patch to help diag the issue.
The files written are all called 'miranda_io.out.100xx' where xx is the rank number. |
| Comment by Zhenyu Xu [ 07/Jul/17 ] |
|
In what path did you start the test program? Is it possible that during the test run, your working path got some change temporary? |
| Comment by Zhenyu Xu [ 07/Jul/17 ] |
|
The log shows that miranda_io.out.10013 has been going through the read phase while no write happens on it. And other miranda_io.out.xxxx has the write phase happened. The thing is getcwd only determine the current working directory, is it under Lustre file system? If not, Lustre log cannot know the reason why it fails. |
| Comment by Mahmoud Hanafi [ 07/Jul/17 ] |
|
yes the working directory is on the lustre file system.
This is similar issue as Here you can see at the failure it returns return=return=0x1c call=buf=0x7fffffff5a67 size=0x1000 return=return=0x1c call=buf=0x7fffffff5a67 size=0x1000 return=return=0x1c call=buf=0x7fffffff5a67 size=0x1000 return=return=0xfffffffffffffffe call=buf=0x7fffffff5a67 size=0x1000 return=return=0x1c call=buf=0x7fffffff5a67 size=0x1000 return=return=0x1c call=buf=0x7fffffff5a67 size=0x1000 return=return=0x1c call=buf=0x7fffffff5a67 size=0x1000 return=return=0x1c call=buf=0x8bc040 size=0x1000 return=return=0xb |
| Comment by Jay Lan (Inactive) [ 07/Jul/17 ] |
|
error of 0x1c is ENOSPC, error of 0xb is EAGAIN |
| Comment by Mahmoud Hanafi [ 07/Jul/17 ] |
|
return=0xfffffffffffffffe is -2 SyS_getcwd 0 miranda.debug(26848): -> buf=0x7fffffff5a67 size=0x1000
0x2aaaab676d92 : getcwd+0x72/0x6d0 [/lib64/libc-2.22.so]
0x452762 : for__compute_filename+0x1482/0x16d0 [/home7/mhanafi/LUSTRE/BUGS/GETPWD/miranda.debug]
0x453a89 : for__open_proc+0xc9/0x34b0 [/home7/mhanafi/LUSTRE/BUGS/GETPWD/miranda.debug]
Returning from: 0xffffffff81225fa0 : sys_getcwd+0x0/0x190 [kernel]
Returning to : 0xffffffff81618e1e : tracesys_phase2+0x84/0x89 [kernel]
0xffffffff8101ae0c : try_stack_unwind+0x17c/0x190 [kernel]
0xffffffff81019ce4 : dump_trace+0x64/0x380 [kernel]
0xffffffffa1ae1664 [stap_971bda5852e0877040783dde29bd60d_26789+0xe664/0x0]
0xffffffffa1ae2315 [stap_971bda5852e0877040783dde29bd60d_26789+0xf315/0x0]
0xffffffffa1ae24d0 [stap_971bda5852e0877040783dde29bd60d_26789+0xf4d0/0x0]
0xffffffff8105c18b : trampoline_handler+0x11b/0x1d0 [kernel]
0xffffffff8105bcae : kretprobe_trampoline+0x25/0x57 [kernel]
0xffffffff81618e1e : tracesys_phase2+0x84/0x89 [kernel] (inexact)
SyS_getcwd 135 miranda.debug(26848): <- return=-2
|
| Comment by Mahmoud Hanafi [ 08/Jul/17 ] |
|
I think this is what is is happing. SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) { int error; struct path pwd, root; char *page = __getname(); if (!page) return -ENOMEM; rcu_read_lock(); get_fs_root_and_pwd_rcu(current->fs, &root, &pwd); error = -ENOENT; if (!d_unlinked(pwd.dentry)) { In the getcwd syscall between get_fs_root_and_pwd_rcu(current->fs, &root, &pwd); and if (!d_unlinked(pwd.dentry)) {
The dentry has is getting dropped. From systemtap trace you can see here that .d_iname="1stripe" is dropped and getpwd errors out.
__d_drop 363 dentry={.d_flags=2621646, .d_seq={.sequence=2787}, .d_hash={.next=0x0, .pprev=0xffffc90002553280}, .d_parent=0xffff88105c8bc918, .d_name={<union>={<class>={.hash=1222078897, .len=7}, .hash_len=31286849969}, .name="1stripe"}, .d_inode=0xffff880dd6b2c550, .d_iname="1stripe", .d_lockref={<union>={.lock_count=120259084289, <class>={.lock={<union>={.rlock={.raw_lock={.val={.counter=1}}}}}, .count=28}}}, .d_op=0xffffffffa0ef8240, .d_sb=0xffff880816ab6800, .d_time=0, .d_fsdata=0xffff880e403b3740, .d_lru={.next=0x
__d_drop 363 dentry={.d_flags=142, .d_seq={.sequence=1}, .d_hash={.next=0x0, .pprev=0x0}, .d_parent=0xffff88105c8bc918, .d_name={<union>={<class>={.hash=1222078897, .len=7}, .hash_len=31286849969}, .name="1stripe"}, .d_inode=0x0, .d_iname="1stripe", .d_lockref={<union>={.lock_count=4294967297, <class>={.lock={<union>={.rlock={.raw_lock={.val={.counter=1}}}}}, .count=1}}}, .d_op=0xffffffffa0ef8240, .d_sb=0xffff880816ab6800, .d_time=0, .d_fsdata=0x0, .d_lru={.next=0xffff88105c85ebd8, .prev=0xffff88105c85ebd8}, .d_child={
__d_drop 363 dentry={.d_flags=142, .d_seq={.sequence=2}, .d_hash={.next=0x0, .pprev=0x0}, .d_parent=0xffff88105c8bc918, .d_name={<union>={<class>={.hash=1222078897, .len=7}, .hash_len=31286849969}, .name="1stripe"}, .d_inode=0x0, .d_iname="1stripe", .d_lockref={<union>={.lock_count=18446743523953737729, <class>={.lock={<union>={.rlock={.raw_lock={.val={.counter=1}}}}}, .count=-128}}}, .d_op=0xffffffffa0ef8240, .d_sb=0xffff880816ab6800, .d_time=0, .d_fsdata=0x0, .d_lru={.next=0xffff88105c85ebd8, .prev=0xffff88105c85ebd8
SyS_getcwd 0 miranda.debug(354): <- return=-2 /nobackupp8/mhanafi/1stripe
0x2aaaab676d92 [/lib64/libc-2.22.so+0xddd92/0x39f000]
Returning from: 0xffffffff81225fa0 : sys_getcwd+0x0/0x190 [kernel]
Returning to : 0xffffffff81618e1e : tracesys_phase2+0x84/0x89 [kernel]
0xffffffff8101ae0c : try_stack_unwind+0x17c/0x190 [kernel]
0xffffffff81019ce4 : dump_trace+0x64/0x380 [kernel]
0xffffffffa1155b09 [stap_257b7a801b68bce2a60161ccda5d2586_4_302+0x12b09/0x0]
0xffffffffa1156865 [stap_257b7a801b68bce2a60161ccda5d2586_4_302+0x13865/0x0]
0xffffffffa1156a20 [stap_257b7a801b68bce2a60161ccda5d2586_4_302+0x13a20/0x0]
0xffffffff8105c18b : trampoline_handler+0x11b/0x1d0 [kernel]
0xffffffff8105bcae : kretprobe_trampoline+0x25/0x57 [kernel]
0xffffffff81618e1e : tracesys_phase2+0x84/0x89 [kernel] (inexact)
Pass 5: run completed in 0usr/180sys/11238real ms.
|
| Comment by Zhenyu Xu [ 10/Jul/17 ] |
|
Would you mind trying this patch? |
| Comment by Mahmoud Hanafi [ 10/Jul/17 ] |
|
No patch was attached. |
| Comment by Zhenyu Xu [ 10/Jul/17 ] |
|
Can you see unoptimize-atomic_open-of-negative-dentry.patch |
| Comment by Mahmoud Hanafi [ 10/Jul/17 ] |
|
I can see the attached path now. But is it not in Gerrit? |
| Comment by Mahmoud Hanafi [ 10/Jul/17 ] |
|
We tried the patch and the code failed they same way. Could this be related to https://jira.hpdd.intel.com/browse/LU-8891?
|
| Comment by Oleg Drokin [ 12/Jul/17 ] |
|
So it looks like the sequence of events unfolds like this: In the bigger log it's all happening at 1499341246.216525 00000080:00200000:12.0:1499341246.216524:0:34165:0:(dcache.c:357:ll_revalidate_nd()) VFS Op:name=1stripe, flags=17 00000080:00002000:14.0:1499341246.216525:0:34167:0:(llite_internal.h:1388:d_lustre_invalidate()) invalidate dentry 1stripe (ffff8808fcbfdd98) parent ffff8808fcbfdf18 inode ffff8810414180d0 refc 265 00000080:00000001:12.0:1499341246.216525:0:34165:0:(dcache.c:360:ll_revalidate_nd()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:12.0:1499341246.216525:0:34165:0:(file.c:3695:ll_inode_permission()) Process entered This sounds kind of a normal race - I don't see why dentry could not be invalidated between lookup and check and then revalidated later (also explains why NFS would trip this at times). |
| Comment by Mahmoud Hanafi [ 17/Jul/17 ] |
|
Any updates? |
| Comment by Zhenyu Xu [ 19/Jul/17 ] |
|
we‘re still investigating it |
| Comment by Mahmoud Hanafi [ 19/Jul/17 ] |
|
Btw, I did get it to fail with sles12sp2 and lustre2.10. |
| Comment by Jinshan Xiong (Inactive) [ 20/Jul/17 ] |
|
The problem is clear as Oleg mentioned. It doesn't revalidate dentry in the sys call of getcwd() so we need to patch the kernel. I can't think of any workaround solution at this moment. |
| Comment by Nathan Dauchy (Inactive) [ 07/Aug/17 ] |
|
For reference, uploaded the code for the getcwd() retry hack we are using with LD_PRELOAD to work around this problem. We have not noticed any significant performance impacts with this approach. Perhaps similar code can be added to Lustre itself. Built with something like: gcc -Wall -c -fPIC -DPIC getcwdHack.c -o getcwdHack.lo Run with something like: export LD_PRELOAD=`pwd`/libgetcwdHack.so |
| Comment by Sebastian Parschauer [ 30/Aug/17 ] |
|
The retry mechanism has to be applied to Intel/SGI MPI code (libmpifort.so). The string "forrtl" is included in it and it is the only runtime library which calls getcwd(). So only MPI Fortran code is affected. openmpi and mvapich2 are not affected (nothing calls getcwd() there). |
| Comment by Sebastian Parschauer [ 30/Aug/17 ] |
|
The kernel maintains a dentry lookup hash list. If getcwd() returns a dentry which is not in that list, then -ENOENT is returned. This is the case at initialization time of struct hlist_bl_node and when it has been deleted from that list. This works pretty well with local file systems. Can someone explain how Lustre handles the local dentry lookup hash list? |
| Comment by Sebastian Parschauer [ 30/Aug/17 ] |
|
Since kernel 3.12, getcwd() performance has been improved a lot by the way. Spin locks have been replaced with RCU. |
| Comment by Oleg Drokin [ 31/Aug/17 ] |
|
Sebastien: The problem at hand is the entry could be removed from the list between lookup and getcwd. On normal filesystems that's possible if the entry was unlinked meanwhile. On Lustre (and to a degree on NFS) additional way to get the entry dropped from the list is because the entry became stale (i.e. somebody attempted a conflicting operation that revoked the lock of that entry so it was invalidated). Since getcwd does not revalidate entries it might return false errors in those cases. |
| Comment by Sebastian Parschauer [ 07/Sep/17 ] |
|
I've talked to our NFS developer Neil Brown meanwhile. getcwd() shouldn't need to revalidate the dentry. NFS code is not affected - confirmed by our customer. Also upstream Lustre code doesn't seem to be affected - Neil checked the source. |
| Comment by Sebastian Parschauer [ 07/Sep/17 ] |
|
I think I've found the Lustre 2.9.0 bug for SLES12-SP2. ll_set_fs_pwd() only supports the fs->lock spin lock or the fs->lock write lock but not the RCU locking done by getcwd() in kernel 4.4. |
| Comment by Gerrit Updater [ 08/Sep/17 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/28907 |
| Comment by Zhenyu Xu [ 08/Sep/17 ] |
|
Would you try this patch? |
| Comment by Andreas Dilger [ 09/Sep/17 ] |
|
sparschauer, I also saw the following email on the linux-fsdevel mailing list from Neil Brown:
The actual patch is at https://patchwork.kernel.org/patch/9945159/ From dfaa166e2afaed051c388dc9f43d1468020b5e22 Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@suse.com> Date: Fri, 8 Sep 2017 16:03:42 +1000 Subject: [PATCH] VFS: close race between getcwd() and d_move() d_move() will call __d_drop() and then __d_rehash() on the dentry being moved. This creates a small window when the dentry appears to be unhashed. Many tests of d_unhashed() are made under ->d_lock and so are safe from racing with this window, but some aren't. In particular, getcwd() calls d_unlinked() (which calls d_unhashed()) without d_lock protection, so it can race. This races has been seen in practice with lustre, which uses d_move() as part of name lookup. See: https://jira.hpdd.intel.com/browse/LU-9735 It could race with a regular rename(), and result in ENOENT instead of either the 'before' or 'after' name. We could fix this race by taking d_lock an rechecking when d_unhashed() reports true. Alternately when can remove the window, which is the approach this patch takes. When __d_drop and __d_rehash are used to move a dentry, an extra flag is passed which causes d_hash.pprev to not be cleared, and to not be tested. Signed-off-by: NeilBrown <neilb@suse.com> |
| Comment by Bob Glossman (Inactive) [ 26/Oct/17 ] |
|
A fix for this problem is now shipped in the latest kernel version for sles12sp3. The description of the fix is as follows: From: NeilBrown <neilb@suse.com> Subject: getcwd: Close race with d_move called by lustre Patch-mainline: Not yet, under development References: bsc#1052593 When lustre invalidates a dentry (e.g. do to a recalled lock) and then revalidates it, ll_splice_alias() will call d_move() to move the old alias to the name of a new one. This will d_drop then d_rehash the old dentry, creating a small window when the dentry in unhashed. If getcwd is run at this time, it might incorrectly think that the dentry really is unhashed, and so return ENOENT. This is a bug in lustre (it shouldn't call d_move()) but we can work around it in getcwd by taking the d_lock to avoid the race. First we test without the lock as the common case does not involve any race. If we find the the dentry appears to be unhashed, we take the lock and check again. Signed-off-by: Neil Brown <neilb@suse.com> |
| Comment by Peter Jones [ 26/Oct/17 ] |
|
So can we safely close this as "not a bug" from a Lustre perspective safe in the knowledge that this will be fixed in current version of SLES? |
| Comment by Bob Glossman (Inactive) [ 26/Oct/17 ] |
|
I would vote for 'yes', but it's only fixed for new versions of sles12sp3. not fixed for anything older. Do other commenters in this ticket have opinions? Has there been any feedback on the other proposed fix https://review.whamcloud.com/28907 ? |
| Comment by Sebastian Parschauer [ 26/Oct/17 ] |
|
The patch has been submitted from SLE11-SP3-LTSS to SLE15. New kernels containing the fix will be released soon. |
| Comment by Mahmoud Hanafi [ 26/Oct/17 ] |
|
What about the above comment "This is a bug in lustre (it shouldn't call d_move())" |
| Comment by Jay Lan (Inactive) [ 26/Oct/17 ] |
|
We carry #28907 in our nas-2.10.x and it fixed our problem. It probably would be 4-6 months before we can upgrade to sles12sp3. Would #28907 conflict with SUSE's workaround in sles12sp3? Neil Brown while proposed that WA thought it was a bug in lustre. I think we should have a valid fix. |
| Comment by Gerrit Updater [ 31/Jan/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28907/ |
| Comment by Minh Diep [ 31/Jan/18 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 31/Jan/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31106 |
| Comment by Gerrit Updater [ 09/Feb/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31106/ |
| Comment by Stanford Research Computing Center [ 08/May/18 ] |
|
Hi! As an additional datapoint, we'd like to report that we've been seeing this exact same behavior with the latest Maintenance Release (2.10.3) and the latest available CentOS 7.4 kernel # uname -a
Linux sh-104-39.int 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /sys/fs/lustre/version
2.10.3
Symptoms are the same stack as initially reported, and happened while running VASP jobs: "forrtl: severe (121): Cannot access current working directory for unit 7, file "Unknown"
Image PC Routine Line Source
vasp_gam 000000000140B496 Unknown Unknown Unknown
vasp_gam 000000000142511E Unknown Unknown Unknown
vasp_gam 000000000091665F Unknown Unknown Unknown
vasp_gam 0000000000CFE655 Unknown Unknown Unknown
vasp_gam 00000000012AF330 Unknown Unknown Unknown
vasp_gam 0000000000408D1E Unknown Unknown Unknown
libc-2.17.so 00007F839E16FC05 __libc_start_main Unknown Unknown
vasp_gam 0000000000408C29 Unknown Unknown Unknown
The "try-again" workaround provided by @Nathan works great and we're recommending our users to use it for now. With the libgetcwdHack.so library LD_PRELOADed, the application generates this kind of log: NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: failed: size 4096, buf 0x7fffd6c0959b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: succeeded at try 2 of 10: size 4096, buf 0x7fffd6c0959b, ret 0x7fffd6c0959b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096, buf 0x7ffe51196e9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2 of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: failed: size 4096, buf 0x7ffdb3f1f31b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: succeeded at try 2 of 10: size 4096, buf 0x7ffdb3f1f31b, ret 0x7ffdb3f1f31b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: failed: size 4096, buf 0x7ffe9713df9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: succeeded at try 2 of 10: size 4096, buf 0x7ffe9713df9b, ret 0x7ffe9713df9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096, buf 0x7ffe51196e9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2 of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 which seem very indicative of the same error. We're looking forward to the fix in 2.10.4. Cheers, |
| Comment by Sebastien Piechurski [ 19/Jul/18 ] |
|
Hi All,
I did not see any reaction to Mahmoud Hanafi 's comment from the 26th October 2017, citing Neil Brown: "This is a bug in lustre (it shouldn't call d_move())"
We have installed 2.10.4 at a customer's site which encountered this problem (with RHEL 7 clients), and even though this considerably decreased the number of occurences, we still see the "small window when dentry is unhashed", making the job fail. Is there something that can be done in ll_splice_alias() to close this race ? I understand this is closed by SuSE in their latest kernels, but this is not the case for earlier kernels, nor for RHEL kernels. Or should we push Red Hat to apply the same kind of patch SuSE did (does not seem really fair to me) ? |
| Comment by Marek Magrys [ 27/Sep/18 ] |
|
It looks like we hit the same issue on Lustre 2.10.5 client and Centos 7.5 kernel. Should the fix come from Lustre or kernel, as I'm confused by the previous discussion. |
| Comment by James A Simmons [ 27/Sep/18 ] |
|
I wonder if the fixes from LU-9868 would fix this? Note the patch post has a bug in it. I have a fix but haven't pushed it. |
| Comment by James A Simmons [ 27/Sep/18 ] |
|
Can you give https://review.whamcloud.com/#/c/28486 a try. |
| Comment by James A Simmons [ 01/Oct/18 ] |
|
As reported the earlier patch for this bug didn't completely solve the problem. The work from LU-9868 has been reported as solving this problem which is now linked to this ticket. |
| Comment by James A Simmons [ 04/Oct/18 ] |
|
So it appears that the patch for LU-9868 while fixing this bug has exposed another potential bug in lustre. If you run sanity test 233 you see [37212.956888] VFS: Lookup of '[0x200000007:0x1:0x0]' in lustre lustre would have caused loop [37217.817624] Lustre: DEBUG MARKER: sanity test_233a: @@@@@@ FAIL: cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]' [37236.855424] Lustre: DEBUG MARKER: == sanity test 233b: checking that OBF of the FS .lustre succeeds ==================================== 03:34:33 (1538379273) [37238.362201] VFS: Lookup of '[0x200000002:0x1:0x0]' in lustre lustre would have caused loop [37243.442480] Lustre: DEBUG MARKER: sanity test_233b: @@@@@@ FAIL: cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]' Some how the parent child relationship got inverted. Will investigate. |
| Comment by Mahmoud Hanafi [ 24/Jul/19 ] |
|
We can close this case. |
| Comment by Peter Jones [ 24/Jul/19 ] |
|
ok so, given that the initial fix seems to satisfy NASA (the original reporter) we can close the ticket. simmonsja can you track any remaining work under a new ticket? |
| Comment by James A Simmons [ 24/Jul/19 ] |
|
I already have another ticket for this |