[LU-3177] mpi process crash on accessing file/directory - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.1.5
Affects Version/s: Lustre 2.1.4
Labels:
None
Environment:
CentOS 6.3, Intel fortran compiler 13.0.1, intel MPI, WRF

Severity:
3
Rank (Obsolete):
7744

Description

We're running WRF (3.4.1) using 300 to 500 processes. There are situations that some runs is stuck. We found that in those cases, one of the processes crashes at the very beginning. In WRF's rsl.error.* file, there are such an error message:

forrtl: severe (121): Cannot access current working directory for unit 27, file "Unknown"

About 5% of the runs encounter this problem. In searching Lustre bug report, we guess it might be related to this one:

https://projectlava.xyratex.com/show_bug.cgi?id=23978

We then follow the suggested workaround by using a getcwd wrapper (without changing the WRF source code), the problem seems to be gone in the subsequent 300 runs. And in the log message generated by the wrapper, we found the following:

rsl.out.0095: getcwdfixwrap: host/pid: node23/87536 time: 1366042583 problem null, retryctr: 0 errno: 2 errstr: "No such file or directory"
rsl.out.0095: getcwdfixwrap: host/pid: node23/87536 problem buf non null, value:

Seems that the workaround bails out the getcwd problem.

There seems to a fix in 1.8.x version. We wonder if such fix will be applied to the 2.1.4 version.

Attachments

Activity

People

Assignee:: Zhenyu Xu

Reporter:: Joe K.W. Chong (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 16/Apr/13 6:10 AM

Updated:: 11/Mar/14 1:26 AM

Resolved:: 11/Mar/14 1:26 AM