Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-645

getcwd fails

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.2.0, Lustre 1.8.8
    • Lustre 1.8.6
    • None
    • lustre 1.8.4
    • 3
    • 23,978
    • 4819

    Description

      we are seeing getcwd fail sometimes.

      Fortran codes are preferentially seeing this because Intel's fortran runtime does a getcwd() before
      every open() call (and also doesn't check that getcwd() succeeded, but that's another story).

      I wrote a LD_PRELOAD for getcwd that does logging and also retries the getcwd call. you can see
      from the below few examples that on the first try it's seeing ENOENT, and on the second try it
      works.

      Oct 18 21:18:52 v1195 NF: getcwd: mpi rank 5, host v1195: [26037]: failed: size 4096, buf
      0x7fffffff64d0, ret (nil): No such file or directory
      Oct 18 21:18:52 v1195 NF: getcwd: mpi rank 5, host v1195: [26037]: succeeded at try 2 of 10: size
      4096, buf 0x7fffffff64d0, ret 0x7fffffff64d0, path
      /short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
      Oct 18 23:59:02 v1258 NF: getcwd: mpi rank 6, host v1258: [21909]: failed: size 4096, buf
      0x7fffffff3c50, ret (nil): No such file or directory
      Oct 18 23:59:02 v1258 NF: getcwd: mpi rank 6, host v1258: [21909]: succeeded at try 2 of 10: size
      4096, buf 0x7fffffff3c50, ret 0x7fffffff3c50, path
      /short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
      Oct 19 04:54:15 v1167 NF: getcwd: mpi rank 4, host v1167: [24760]: failed: size 4096, buf
      0x7fffffff3c50, ret (nil): No such file or directory
      Oct 19 04:54:15 v1167 NF: getcwd: mpi rank 4, host v1167: [24760]: succeeded at try 2 of 10: size
      4096, buf 0x7fffffff3c50, ret 0x7fffffff3c50, path
      /short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir
      Oct 19 04:54:15 v1193 NF: getcwd: mpi rank 39, host v1193: [7384]: failed: size 4096, buf
      0x7fffffff4690, ret (nil): No such file or directory
      Oct 19 04:54:15 v1193 NF: getcwd: mpi rank 39, host v1193: [7384]: succeeded at try 2 of 10: size
      4096, buf 0x7fffffff4690, ret 0x7fffffff4690, path
      /short/<project>/<username>/BRAN/BRAN2.0R4/OFAM/workdir

      we have tried but can't find a simple reproducer for the problem - hence we resorted to a
      LD_PRELOAD so that user codes could detect it for us. we think 16 and 32 node (128 and 256 process)
      parallel jobs see it much more than serial jobs. the directories failing the getcwd() are not
      usually recently created (the one failing above is a month old), and obviously getcwd() is usually
      succeeding for all processes in most jobs, but sometimes fails for one or perhaps 2 processes in a
      job.

      the problem seems to have surfaced relatively recently - possibly with lustre 1.8.3 clients, but we
      aren't sure about that.

      client kernels are vanilla 2.6.27.54 with lustre 1.8.3 with some patches from 1.8.4 (bz 22309
      attach 30455, bz 22610 attach 29931, bz 22786 attach 29866, bz 22889 attach 30111)

      server kernels are 2.6.18-164.11.1.el5 with lustre 1.8.2 with some patches from 1.8.3 (bz 17197
      attach 28672, bz 22177 attach 28798,29030)

      all machines are centos5.5 x86_64 o2ib.
      -

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              qm137 James Karellas (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: