Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9735

Sles12Sp2 and 2.9 getcwd() sometimes fails

Details

    • 2
    • 9223372036854775807

    Description

      This is a duplicate of LU-9208. Opening this case for tracking for nasa. We start to see this once we updated the clients to Sles12SP2 and lustre2.9

      Using the test code provide LU-9208 (miranda) I was able to reproduce the bug on a single node.

       

      Iteration =    868, Run Time =     0.9614 sec., Transfer Rate =   120.7790 10e+06 Bytes/sec/proc
      Iteration =    869, Run Time =     1.5308 sec., Transfer Rate =    75.8561 10e+06 Bytes/sec/proc
      forrtl: severe (121): Cannot access current working directory for unit 10012, file "Unknown"
      Image              PC                Routine            Line        Source             
      miranda            0000000000409F29  Unknown               Unknown  Unknown
      miranda            00000000004169D2  Unknown               Unknown  Unknown
      miranda            0000000000404045  Unknown               Unknown  Unknown
      miranda            0000000000402FDE  Unknown               Unknown  Unknown
      libc.so.6          00002AAAAB5B96E5  Unknown               Unknown  Unknown
      miranda            0000000000402EE9  Unknown               Unknown  Unknown
      MPT ERROR: MPI_COMM_WORLD rank 12 has terminated without calling MPI_Finalize()
      	aborting job
      
      

       I was able to capture some debug logs I have attached to the case. I was unable to reproduce it using "+trace". But will continue to try.

      Attachments

        1. r481i7n17.dump1.log.gz
          13.86 MB
        2. miranda.debug.1499341246.gz
          84.13 MB
        3. miranda.dis
          9.19 MB
        4. unoptimize-atomic_open-of-negative-dentry.patch
          2 kB
        5. getcwdHack.c
          6 kB

        Issue Links

          Activity

            [LU-9735] Sles12Sp2 and 2.9 getcwd() sometimes fails

            I already have another ticket for this

            simmonsja James A Simmons added a comment - I already have another ticket for this
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            ok so, given that the initial fix seems to satisfy NASA (the original reporter) we can close the ticket. simmonsja can you track any remaining work under a new ticket?

            pjones Peter Jones added a comment - ok so, given that the initial fix seems to satisfy NASA (the original reporter) we can close the ticket. simmonsja can you track any remaining work under a new ticket?

            We can close this case.

            mhanafi Mahmoud Hanafi added a comment - We can close this case.
            simmonsja James A Simmons made changes -
            Labels New: ORNL
            simmonsja James A Simmons made changes -
            Status Original: In Progress [ 3 ] New: Open [ 1 ]
            simmonsja James A Simmons made changes -
            Status Original: Reopened [ 4 ] New: In Progress [ 3 ]

            So it appears that the patch for LU-9868 while fixing this bug has exposed another potential bug in lustre. If you run sanity test 233 you see

            [37212.956888] VFS: Lookup of '[0x200000007:0x1:0x0]' in lustre lustre would have caused loop

            [37217.817624] Lustre: DEBUG MARKER: sanity test_233a: @@@@@@ FAIL: cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]'

            [37236.855424] Lustre: DEBUG MARKER: == sanity test 233b: checking that OBF of the FS .lustre succeeds ==================================== 03:34:33 (1538379273)

            [37238.362201] VFS: Lookup of '[0x200000002:0x1:0x0]' in lustre lustre would have caused loop

            [37243.442480] Lustre: DEBUG MARKER: sanity test_233b: @@@@@@ FAIL: cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]'

            Some how the parent child relationship got inverted. Will investigate.

            simmonsja James A Simmons added a comment - So it appears that the patch for LU-9868 while fixing this bug has exposed another potential bug in lustre. If you run sanity test 233 you see [37212.956888] VFS: Lookup of ' [0x200000007:0x1:0x0] ' in lustre lustre would have caused loop [37217.817624] Lustre: DEBUG MARKER: sanity test_233a: @@@@@@ FAIL: cannot access /lustre/lustre using its FID ' [0x200000007:0x1:0x0] ' [37236.855424] Lustre: DEBUG MARKER: == sanity test 233b: checking that OBF of the FS .lustre succeeds ==================================== 03:34:33 (1538379273) [37238.362201] VFS: Lookup of ' [0x200000002:0x1:0x0] ' in lustre lustre would have caused loop [37243.442480] Lustre: DEBUG MARKER: sanity test_233b: @@@@@@ FAIL: cannot access /lustre/lustre/.lustre using its FID ' [0x200000002:0x1:0x0] ' Some how the parent child relationship got inverted. Will investigate.
            simmonsja James A Simmons made changes -
            Assignee Original: Zhenyu Xu [ bobijam ] New: James A Simmons [ simmonsja ]
            Resolution Original: Fixed [ 1 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]

            As reported the earlier patch for this bug didn't completely solve the problem. The work from LU-9868 has been reported as solving this problem which is now linked to this ticket.

            simmonsja James A Simmons added a comment - As reported the earlier patch for this bug didn't completely solve the problem. The work from LU-9868 has been reported as solving this problem which is now linked to this ticket.

            People

              simmonsja James A Simmons
              mhanafi Mahmoud Hanafi
              Votes:
              1 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: