[LU-3177] mpi process crash on accessing file/directory Created: 16/Apr/13  Updated: 11/Mar/14  Resolved: 11/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4
Fix Version/s: Lustre 2.1.5

Type: Bug Priority: Critical
Reporter: Joe K.W. Chong (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 6.3, Intel fortran compiler 13.0.1, intel MPI, WRF


Severity: 3
Rank (Obsolete): 7744

 Description   

We're running WRF (3.4.1) using 300 to 500 processes. There are situations that some runs is stuck. We found that in those cases, one of the processes crashes at the very beginning. In WRF's rsl.error.* file, there are such an error message:

forrtl: severe (121): Cannot access current working directory for unit 27, file "Unknown"

About 5% of the runs encounter this problem. In searching Lustre bug report, we guess it might be related to this one:

https://projectlava.xyratex.com/show_bug.cgi?id=23978

We then follow the suggested workaround by using a getcwd wrapper (without changing the WRF source code), the problem seems to be gone in the subsequent 300 runs. And in the log message generated by the wrapper, we found the following:

rsl.out.0095: getcwdfixwrap: host/pid: node23/87536 time: 1366042583 problem null, retryctr: 0 errno: 2 errstr: "No such file or directory"
rsl.out.0095: getcwdfixwrap: host/pid: node23/87536 problem buf non null, value:

Seems that the workaround bails out the getcwd problem.

There seems to a fix in 1.8.x version. We wonder if such fix will be applied to the 2.1.4 version.



 Comments   
Comment by Peter Jones [ 18/Apr/13 ]

Bobijam

Could you please confirm whether this issue will be resolved by LU-645 and thus covered by this landing - http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=94509cda52b49a0153fae4b7a1f0772077aa9809?

Thanks

Peter

Comment by Zhenyu Xu [ 18/Apr/13 ]

patch http://review.whamcloud.com/3206 can be applied upon 2.1.4 which should fix the issue.
Even better to apply http://review.whamcloud.com/2400 which also handles the issue while still being kept align with 2.1.5 code base.

Comment by Peter Jones [ 18/Apr/13 ]

Bobijam

Am I correct in understanding that simply upgrading to 2.1.5 itself would also address this issue?

Thanks

Peter

Comment by Zhenyu Xu [ 19/Apr/13 ]

Peter,

yes, it is. 2.1.5 contains http://review.whamcloud.com/2400

Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ]

Joe,
Is there any further action required on this ticket?
If not, can I go ahead and mark it as resolved?
Thanks,
~ jfc.

Comment by Joe K.W. Chong (Inactive) [ 10/Mar/14 ]

Dear John,

Yes, please mark it as resolved.

regards,

Joe

Comment by John Fuchs-Chesney (Inactive) [ 11/Mar/14 ]

Customer says OK to resolve and patch has been landed in 2.1.5.

Generated at Sat Feb 10 01:31:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.