Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.5.3, Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      We have an IO test called Miranda which is written in Fortran (attached). It intermittently fails when running in Lustre because getcwd() returns NULL with errno set to ENOENT. This behavior is similar to what was reported in LU-645. We've reproduced the problem on a single node on a Lustre 2.8 client server environment as well as Lustre 2.5. Typically the problem will occur within an hour or two when running Miranda continuously in a loop. Typical invocation is something like

      cd /p/lquake/some_lustre_dir
      srun -N 1 -n 36 -pplustre28 /path/to/miranda_io 100
      

      The failing run prints out something like

      forrtl: severe (121): Cannot access current working directory for unit 10739, file "Unknown"
      Image              PC                Routine            Line        Source             
      miranda_io         000000000040FEE9  Unknown               Unknown  Unknown
      miranda_io         000000000041C992  Unknown               Unknown  Unknown
      miranda_io         000000000040A4F1  Unknown               Unknown  Unknown
      miranda_io         0000000000408F9E  Unknown               Unknown  Unknown
      libc.so.6          00002AAAABBAEB35  Unknown               Unknown  Unknown
      miranda_io         0000000000408EA9  Unknown               Unknown  Unknown
      srun: error: opal93: task 735: Exited with exit code 121
      srun: First task exited 30s ago
      srun: tasks 0-734,736-1259: running
      srun: task 735: exited abnormally
      srun: Terminating job step 1166343.0
      slurmd[opal38]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 ***
      slurmd[opal40]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 ***
      slurmd[opal35]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 ***
      slurmd[opal36]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 ***
      slurmd[opal41]: *** STEP 1166343.0 KILLED AT 2017-03-02T08:52:13 WITH SIGNAL 9 ***
      ...
      

      I haven't collected debug logs yet but the bug shouldn't be hard to reproduce in a test environment.

      Attachments

        Issue Links

          Activity

            [LU-9208] getcwd() sometimes fails

            I attached a test case Red Hat provided for a similar issue seen on NFS. They are interested to know if we can reproduce this on Lustre and NFS. LLNL doesn't have cycles to spend on this right now so I am uploading here in case others want to take a stab at it. I'd be happy to feed any results back to Red Hat.

            nedbass Ned Bass (Inactive) added a comment - I attached a test case Red Hat provided for a similar issue seen on NFS. They are interested to know if we can reproduce this on Lustre and NFS. LLNL doesn't have cycles to spend on this right now so I am uploading here in case others want to take a stab at it. I'd be happy to feed any results back to Red Hat.

            Reopening to allow attaching a file.

            nedbass Ned Bass (Inactive) added a comment - Reopening to allow attaching a file.

            Our testing team just reported that they may have reproduced this bug on an NFS filesystem. So this may not be a Lustre-specific bug. I'll close the ticket for now, and will re-open it if we get any further evidence that it's a Lustre problem.

            Thanks for your time on this issue.

            nedbass Ned Bass (Inactive) added a comment - Our testing team just reported that they may have reproduced this bug on an NFS filesystem. So this may not be a Lustre-specific bug. I'll close the ticket for now, and will re-open it if we get any further evidence that it's a Lustre problem. Thanks for your time on this issue.

            There's nothing useful in miranda.log other than the error message that I included in the description.

            nedbass Ned Bass (Inactive) added a comment - There's nothing useful in miranda.log other than the error message that I included in the description.
            bobijam Zhenyu Xu added a comment -

            I didn't see miranda.log in the attachment in this ticket.

            bobijam Zhenyu Xu added a comment - I didn't see miranda.log in the attachment in this ticket.

            I was appending miranda_io output to miranda.log. I suspect the ioctl failure is unrelated to the getcwd() problem.

            nedbass Ned Bass (Inactive) added a comment - I was appending miranda_io output to miranda.log. I suspect the ioctl failure is unrelated to the getcwd() problem.

            The getcwd() system call returns ENOENT.

            nedbass Ned Bass (Inactive) added a comment - The getcwd() system call returns ENOENT.
            bobijam Zhenyu Xu added a comment -

            What system call returns what error in this case? I see a file ioctrl cmd (0x5401 TCGETS) upon miranda.log failure in the log.

            bobijam Zhenyu Xu added a comment - What system call returns what error in this case? I see a file ioctrl cmd (0x5401 TCGETS) upon miranda.log failure in the log.

            No Lustre messages appear in dmesg at the time.

            nedbass Ned Bass (Inactive) added a comment - No Lustre messages appear in dmesg at the time.

            I attached a -1 debug log from the client when it hit this bug.

            nedbass Ned Bass (Inactive) added a comment - I attached a -1 debug log from the client when it hit this bug.
            bobijam Zhenyu Xu added a comment -

            Would you please grab some debug logs and dmesg when you rehit it?

            bobijam Zhenyu Xu added a comment - Would you please grab some debug logs and dmesg when you rehit it?

            People

              bobijam Zhenyu Xu
              nedbass Ned Bass (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: