[LU-9208] getcwd() sometimes fails Created: 13/Mar/17 Updated: 05/Apr/18 Resolved: 05/Apr/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3, Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ned Bass | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
We have an IO test called Miranda which is written in Fortran (attached). It intermittently fails when running in Lustre because getcwd() returns NULL with errno set to ENOENT. This behavior is similar to what was reported in cd /p/lquake/some_lustre_dir srun -N 1 -n 36 -pplustre28 /path/to/miranda_io 100 The failing run prints out something like forrtl: severe (121): Cannot access current working directory for unit 10739, file "Unknown" Image PC Routine Line Source miranda_io 000000000040FEE9 Unknown Unknown Unknown miranda_io 000000000041C992 Unknown Unknown Unknown miranda_io 000000000040A4F1 Unknown Unknown Unknown miranda_io 0000000000408F9E Unknown Unknown Unknown libc.so.6 00002AAAABBAEB35 Unknown Unknown Unknown miranda_io 0000000000408EA9 Unknown Unknown Unknown srun: error: opal93: task 735: Exited with exit code 121 srun: First task exited 30s ago srun: tasks 0-734,736-1259: running srun: task 735: exited abnormally srun: Terminating job step 1166343.0 slurmd[opal38]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** slurmd[opal40]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** slurmd[opal35]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** slurmd[opal36]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** slurmd[opal41]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 *** ... I haven't collected debug logs yet but the bug shouldn't be hard to reproduce in a test environment. |
| Comments |
| Comment by Zhenyu Xu [ 15/Mar/17 ] |
|
Would you please grab some debug logs and dmesg when you rehit it? |
| Comment by Ned Bass [ 15/Mar/17 ] |
|
I attached a -1 debug log from the client when it hit this bug. |
| Comment by Ned Bass [ 15/Mar/17 ] |
|
No Lustre messages appear in dmesg at the time. |
| Comment by Zhenyu Xu [ 21/Mar/17 ] |
|
What system call returns what error in this case? I see a file ioctrl cmd (0x5401 TCGETS) upon miranda.log failure in the log. |
| Comment by Ned Bass [ 21/Mar/17 ] |
|
The getcwd() system call returns ENOENT. |
| Comment by Ned Bass [ 21/Mar/17 ] |
|
I was appending miranda_io output to miranda.log. I suspect the ioctl failure is unrelated to the getcwd() problem. |
| Comment by Zhenyu Xu [ 24/Mar/17 ] |
|
I didn't see miranda.log in the attachment in this ticket. |
| Comment by Ned Bass [ 24/Mar/17 ] |
|
There's nothing useful in miranda.log other than the error message that I included in the description. |
| Comment by Ned Bass [ 24/Mar/17 ] |
|
Our testing team just reported that they may have reproduced this bug on an NFS filesystem. So this may not be a Lustre-specific bug. I'll close the ticket for now, and will re-open it if we get any further evidence that it's a Lustre problem. Thanks for your time on this issue. |
| Comment by Ned Bass [ 12/Jun/17 ] |
|
Reopening to allow attaching a file. |
| Comment by Ned Bass [ 12/Jun/17 ] |
|
I attached a test case Red Hat provided for a similar issue seen on NFS. They are interested to know if we can reproduce this on Lustre and NFS. LLNL doesn't have cycles to spend on this right now so I am uploading here in case others want to take a stab at it. I'd be happy to feed any results back to Red Hat. |
| Comment by Mahmoud Hanafi [ 30/Jun/17 ] |
|
We have recently upgraded to Sles12sp2 and lustre 2.9 clients and started to see this issues.
|
| Comment by Andreas Dilger [ 31/Aug/17 ] |
|
Ned, are you also running Intel MPI when this problem is hit? According to the comment in |
| Comment by Ned Bass [ 31/Aug/17 ] |
No. I can reproduce it on a single node non-MPI program. I wrote a simple C program to simulate what Miranda is doing and just ran a bunch of copies of it in the background. Eventually getcwd() fails with errno=ENOENT. I'll dig up the program and attach it here. |
| Comment by Andreas Dilger [ 05/Apr/18 ] |
|
It appears that the |