Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.5.3, Lustre 2.8.0
-
None
-
3
-
9223372036854775807
Description
We have an IO test called Miranda which is written in Fortran (attached). It intermittently fails when running in Lustre because getcwd() returns NULL with errno set to ENOENT. This behavior is similar to what was reported in LU-645. We've reproduced the problem on a single node on a Lustre 2.8 client server environment as well as Lustre 2.5. Typically the problem will occur within an hour or two when running Miranda continuously in a loop. Typical invocation is something like
cd /p/lquake/some_lustre_dir srun -N 1 -n 36 -pplustre28 /path/to/miranda_io 100
The failing run prints out something like
forrtl: severe (121): Cannot access current working directory for unit 10739, file "Unknown" Image PC Routine Line Source miranda_io 000000000040FEE9 Unknown Unknown Unknown miranda_io 000000000041C992 Unknown Unknown Unknown miranda_io 000000000040A4F1 Unknown Unknown Unknown miranda_io 0000000000408F9E Unknown Unknown Unknown libc.so.6 00002AAAABBAEB35 Unknown Unknown Unknown miranda_io 0000000000408EA9 Unknown Unknown Unknown srun: error: opal93: task 735: Exited with exit code 121 srun: First task exited 30s ago srun: tasks 0-734,736-1259: running srun: task 735: exited abnormally srun: Terminating job step 1166343.0 slurmd[opal38]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** slurmd[opal40]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** slurmd[opal35]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** slurmd[opal36]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** slurmd[opal41]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** ...
I haven't collected debug logs yet but the bug shouldn't be hard to reproduce in a test environment.
Attachments
Issue Links
- is duplicated by
-
LU-9735 Sles12Sp2 and 2.9 getcwd() sometimes fails
- Resolved