[LU-645] getcwd fails Created: 29/Aug/11  Updated: 20/Feb/13  Resolved: 12/Dec/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: Lustre 2.2.0, Lustre 1.8.8

Type: Bug Priority: Minor
Reporter: James Karellas Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre 1.8.4


Severity: 3
Bugzilla ID: 23,978
Rank (Obsolete): 4819

 Description   

we are seeing getcwd fail sometimes.

Fortran codes are preferentially seeing this because Intel's fortran runtime does a getcwd() before
every open() call (and also doesn't check that getcwd() succeeded, but that's another story).

I wrote a LD_PRELOAD for getcwd that does logging and also retries the getcwd call. you can see
from the below few examples that on the first try it's seeing ENOENT, and on the second try it
works.

Oct 18 21:18:52 v1195 NF: getcwd: mpi rank 5, host v1195: [26037]: failed: size 4096, buf
0x7fffffff64d0, ret (nil): No such file or directory
Oct 18 21:18:52 v1195 NF: getcwd: mpi rank 5, host v1195: [26037]: succeeded at try 2 of 10: size
4096, buf 0x7fffffff64d0, ret 0x7fffffff64d0, path
/short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
Oct 18 23:59:02 v1258 NF: getcwd: mpi rank 6, host v1258: [21909]: failed: size 4096, buf
0x7fffffff3c50, ret (nil): No such file or directory
Oct 18 23:59:02 v1258 NF: getcwd: mpi rank 6, host v1258: [21909]: succeeded at try 2 of 10: size
4096, buf 0x7fffffff3c50, ret 0x7fffffff3c50, path
/short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
Oct 19 04:54:15 v1167 NF: getcwd: mpi rank 4, host v1167: [24760]: failed: size 4096, buf
0x7fffffff3c50, ret (nil): No such file or directory
Oct 19 04:54:15 v1167 NF: getcwd: mpi rank 4, host v1167: [24760]: succeeded at try 2 of 10: size
4096, buf 0x7fffffff3c50, ret 0x7fffffff3c50, path
/short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
Oct 19 04:54:15 v1193 NF: getcwd: mpi rank 39, host v1193: [7384]: failed: size 4096, buf
0x7fffffff4690, ret (nil): No such file or directory
Oct 19 04:54:15 v1193 NF: getcwd: mpi rank 39, host v1193: [7384]: succeeded at try 2 of 10: size
4096, buf 0x7fffffff4690, ret 0x7fffffff4690, path
/short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir

we have tried but can't find a simple reproducer for the problem - hence we resorted to a
LD_PRELOAD so that user codes could detect it for us. we think 16 and 32 node (128 and 256 process)
parallel jobs see it much more than serial jobs. the directories failing the getcwd() are not
usually recently created (the one failing above is a month old), and obviously getcwd() is usually
succeeding for all processes in most jobs, but sometimes fails for one or perhaps 2 processes in a
job.

the problem seems to have surfaced relatively recently - possibly with lustre 1.8.3 clients, but we
aren't sure about that.

client kernels are vanilla 2.6.27.54 with lustre 1.8.3 with some patches from 1.8.4 (bz 22309
attach 30455, bz 22610 attach 29931, bz 22786 attach 29866, bz 22889 attach 30111)

server kernels are 2.6.18-164.11.1.el5 with lustre 1.8.2 with some patches from 1.8.3 (bz 17197
attach 28672, bz 22177 attach 28798,29030)

all machines are centos5.5 x86_64 o2ib.
-



 Comments   
Comment by Peter Jones [ 29/Aug/11 ]

Bobijam

Could you please have a look at this one?

Thanks

Peter

Comment by Zhenyu Xu [ 30/Aug/11 ]

would you mind providing dmesg, debug logs which covers the time span when getcwd() error happens?

Comment by Jay Lan (Inactive) [ 23/Sep/11 ]

Hi Zhenyu,

There are debug logs from three events (as attatchment) in Oracle bugzilla 23978.

I will ask my customers try to recreate the problem and get dmesg for you.

Thanks,
Jay

Comment by Jay Lan (Inactive) [ 23/Sep/11 ]

Here is a reproducer (a Fortran program):

program getcwd_bug
! reproducer for the getcwd bug
! run with lots (preferably > 400) processes to see the bug
use mpi
character (len=20) :: filename

call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world, myrank, ierr)

do j = 1,100
do i = 1,200
write(filename,'(a,i3.3)') 'out.',i
call mpi_barrier(mpi_comm_world, ierr)
open(120+i,file=trim(filename),form='unformatted')
if (myrank .eq. i-1) write(120+i) i+j
enddo
call mpi_barrier(mpi_comm_world, ierr)
close(120+i)
enddo

if (myrank .eq. 0) print *, 'All files opened successfully'

call dummy

call mpi_finalize(ierr)
end

subroutine dummy
end

Just compile it with:

ifort -o getcwd_bug getcwd_bug.f -lmpi

and run with:

mpiexec -n 400 ./getcwd_bug

You will probably need to change the mpiexec command to use the
hostfile containing the nodes that you will want to run on.

You need to have many processes trying to open the same file
to see the problem. With 16 CPUs, you'll probably need to
run many times or for a longer time before it fails with
the getcwd bug.

Thanks,
Jay

Comment by Zhenyu Xu [ 28/Sep/11 ]

master patch tracking at http://review.whamcloud.com/1434
b1_8 patch tracking at http://review.whamcloud.com/1435

Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/statahead.c
  • lustre/llite/dcache.c
  • lustre/llite/namei.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/statahead.c
  • lustre/llite/dcache.c
  • lustre/llite/namei.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/namei.c
  • lustre/llite/statahead.c
  • lustre/llite/dcache.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/namei.c
  • lustre/llite/statahead.c
  • lustre/llite/dcache.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/statahead.c
  • lustre/llite/namei.c
  • lustre/llite/dcache.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/statahead.c
  • lustre/llite/dcache.c
  • lustre/llite/namei.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/statahead.c
  • lustre/llite/namei.c
  • lustre/llite/dcache.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/namei.c
  • lustre/llite/dcache.c
  • lustre/llite/statahead.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/dcache.c
  • lustre/llite/namei.c
  • lustre/llite/statahead.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/dcache.c
  • lustre/llite/statahead.c
  • lustre/llite/namei.c
Comment by Build Master (Inactive) [ 16/Nov/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #156
LU-645 Avoid unnecessary dentry rehashing (Revision fa111106572c8206fb8b6477617bea4fc483a37d)

Result = SUCCESS
Johann Lombardi : fa111106572c8206fb8b6477617bea4fc483a37d
Files :

  • lustre/llite/dcache.c
  • lustre/llite/namei.c
  • lustre/llite/statahead.c
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Peter Jones [ 12/Dec/11 ]

Landed for 2.2

Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » i686,client,el5,ofa #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Build Master (Inactive) [ 12/Dec/11 ]

Integrated in lustre-master » i686,server,el5,ofa #376
LU-645 llite: Avoid unnecessary dentry rehashing (Revision b2d0facce07e734e4aa14653d0ef637dc553cb4a)

Result = SUCCESS
Oleg Drokin : b2d0facce07e734e4aa14653d0ef637dc553cb4a
Files :

  • lustre/llite/llite_internal.h
Comment by Jay Lan (Inactive) [ 27/Jun/12 ]

Could you please port the fix to b2_1 branch also? It would be just a simple cherry-pick from the b1_8 without conflicts. Thanks!

Comment by Zhenyu Xu [ 28/Jun/12 ]

b2_1 patch tracking at http://review.whamcloud.com/3206

Comment by Jay Lan (Inactive) [ 09/Jul/12 ]

I have signed off my review for b2_1 patch. Do you need another reviewer before you can land the patch?

Generated at Sat Feb 10 01:09:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.