|
we are seeing getcwd fail sometimes.
Fortran codes are preferentially seeing this because Intel's fortran runtime does a getcwd() before
every open() call (and also doesn't check that getcwd() succeeded, but that's another story).
I wrote a LD_PRELOAD for getcwd that does logging and also retries the getcwd call. you can see
from the below few examples that on the first try it's seeing ENOENT, and on the second try it
works.
Oct 18 21:18:52 v1195 NF: getcwd: mpi rank 5, host v1195: [26037]: failed: size 4096, buf
0x7fffffff64d0, ret (nil): No such file or directory
Oct 18 21:18:52 v1195 NF: getcwd: mpi rank 5, host v1195: [26037]: succeeded at try 2 of 10: size
4096, buf 0x7fffffff64d0, ret 0x7fffffff64d0, path
/short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
Oct 18 23:59:02 v1258 NF: getcwd: mpi rank 6, host v1258: [21909]: failed: size 4096, buf
0x7fffffff3c50, ret (nil): No such file or directory
Oct 18 23:59:02 v1258 NF: getcwd: mpi rank 6, host v1258: [21909]: succeeded at try 2 of 10: size
4096, buf 0x7fffffff3c50, ret 0x7fffffff3c50, path
/short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
Oct 19 04:54:15 v1167 NF: getcwd: mpi rank 4, host v1167: [24760]: failed: size 4096, buf
0x7fffffff3c50, ret (nil): No such file or directory
Oct 19 04:54:15 v1167 NF: getcwd: mpi rank 4, host v1167: [24760]: succeeded at try 2 of 10: size
4096, buf 0x7fffffff3c50, ret 0x7fffffff3c50, path
/short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
Oct 19 04:54:15 v1193 NF: getcwd: mpi rank 39, host v1193: [7384]: failed: size 4096, buf
0x7fffffff4690, ret (nil): No such file or directory
Oct 19 04:54:15 v1193 NF: getcwd: mpi rank 39, host v1193: [7384]: succeeded at try 2 of 10: size
4096, buf 0x7fffffff4690, ret 0x7fffffff4690, path
/short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
we have tried but can't find a simple reproducer for the problem - hence we resorted to a
LD_PRELOAD so that user codes could detect it for us. we think 16 and 32 node (128 and 256 process)
parallel jobs see it much more than serial jobs. the directories failing the getcwd() are not
usually recently created (the one failing above is a month old), and obviously getcwd() is usually
succeeding for all processes in most jobs, but sometimes fails for one or perhaps 2 processes in a
job.
the problem seems to have surfaced relatively recently - possibly with lustre 1.8.3 clients, but we
aren't sure about that.
client kernels are vanilla 2.6.27.54 with lustre 1.8.3 with some patches from 1.8.4 (bz 22309
attach 30455, bz 22610 attach 29931, bz 22786 attach 29866, bz 22889 attach 30111)
server kernels are 2.6.18-164.11.1.el5 with lustre 1.8.2 with some patches from 1.8.3 (bz 17197
attach 28672, bz 22177 attach 28798,29030)
all machines are centos5.5 x86_64 o2ib.
-
|
|
Bobijam
Could you please have a look at this one?
Thanks
Peter
|
|
would you mind providing dmesg, debug logs which covers the time span when getcwd() error happens?
|
|
Hi Zhenyu,
There are debug logs from three events (as attatchment) in Oracle bugzilla 23978.
I will ask my customers try to recreate the problem and get dmesg for you.
Thanks,
Jay
|
|
Here is a reproducer (a Fortran program):
program getcwd_bug
! reproducer for the getcwd bug
! run with lots (preferably > 400) processes to see the bug
use mpi
character (len=20) :: filename
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, myrank, ierr)
do j = 1,100
do i = 1,200
write(filename,'(a,i3.3)') 'out.',i
call mpi_barrier(mpi_comm_world, ierr)
open(120+i,file=trim(filename),form='unformatted')
if (myrank .eq. i-1) write(120+i) i+j
enddo
call mpi_barrier(mpi_comm_world, ierr)
close(120+i)
enddo
if (myrank .eq. 0) print *, 'All files opened successfully'
call dummy
call mpi_finalize(ierr)
end
subroutine dummy
end
Just compile it with:
ifort -o getcwd_bug getcwd_bug.f -lmpi
and run with:
mpiexec -n 400 ./getcwd_bug
You will probably need to change the mpiexec command to use the
hostfile containing the nodes that you will want to run on.
You need to have many processes trying to open the same file
to see the problem. With 16 CPUs, you'll probably need to
run many times or for a longer time before it fails with
the getcwd bug.
Thanks,
Jay
|
|
master patch tracking at http://review.whamcloud.com/1434
b1_8 patch tracking at http://review.whamcloud.com/1435
|
|
Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/statahead.c
- lustre/llite/dcache.c
- lustre/llite/namei.c
|
|
Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/statahead.c
- lustre/llite/dcache.c
- lustre/llite/namei.c
|
|
Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/namei.c
- lustre/llite/statahead.c
- lustre/llite/dcache.c
|
|
Integrated in lustre-b1_8 » x86_64,client,el5,ofa #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/namei.c
- lustre/llite/statahead.c
- lustre/llite/dcache.c
|
|
Integrated in lustre-b1_8 » x86_64,server,el5,ofa #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/statahead.c
- lustre/llite/namei.c
- lustre/llite/dcache.c
|
|
Integrated in lustre-b1_8 » i686,server,el5,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/statahead.c
- lustre/llite/dcache.c
- lustre/llite/namei.c
|
|
Integrated in lustre-b1_8 » i686,client,el5,ofa #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/statahead.c
- lustre/llite/namei.c
- lustre/llite/dcache.c
|
|
Integrated in lustre-b1_8 » i686,client,el6,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/namei.c
- lustre/llite/dcache.c
- lustre/llite/statahead.c
|
|
Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/dcache.c
- lustre/llite/namei.c
- lustre/llite/statahead.c
|
|
Integrated in lustre-b1_8 » i686,client,el5,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/dcache.c
- lustre/llite/statahead.c
- lustre/llite/namei.c
|
|
Integrated in lustre-b1_8 » i686,server,el5,ofa #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)
Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :
- lustre/llite/dcache.c
- lustre/llite/namei.c
- lustre/llite/statahead.c
|
|
Integrated in lustre-master » x86_64,client,el6,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » i686,client,el6,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » x86_64,client,sles11,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Landed for 2.2
|
|
Integrated in lustre-master » x86_64,client,el5,ofa #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » x86_64,server,el6,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » x86_64,server,el5,ofa #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » i686,server,el6,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » i686,client,el5,ofa #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » i686,client,el5,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » x86_64,server,el5,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » x86_64,client,el5,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » i686,server,el5,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Integrated in lustre-master » i686,server,el5,ofa #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)
Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :
- lustre/llite/llite_internal.h
|
|
Could you please port the fix to b2_1 branch also? It would be just a simple cherry-pick from the b1_8 without conflicts. Thanks!
|
|
b2_1 patch tracking at http://review.whamcloud.com/3206
|
|
I have signed off my review for b2_1 patch. Do you need another reviewer before you can land the patch?
|
Generated at Sat Feb 10 01:09:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.