Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9735

Sles12Sp2 and 2.9 getcwd() sometimes fails

Details

    • 2
    • 9223372036854775807

    Description

      This is a duplicate of LU-9208. Opening this case for tracking for nasa. We start to see this once we updated the clients to Sles12SP2 and lustre2.9

      Using the test code provide LU-9208 (miranda) I was able to reproduce the bug on a single node.

       

      Iteration =    868, Run Time =     0.9614 sec., Transfer Rate =   120.7790 10e+06 Bytes/sec/proc
      Iteration =    869, Run Time =     1.5308 sec., Transfer Rate =    75.8561 10e+06 Bytes/sec/proc
      forrtl: severe (121): Cannot access current working directory for unit 10012, file "Unknown"
      Image              PC                Routine            Line        Source             
      miranda            0000000000409F29  Unknown               Unknown  Unknown
      miranda            00000000004169D2  Unknown               Unknown  Unknown
      miranda            0000000000404045  Unknown               Unknown  Unknown
      miranda            0000000000402FDE  Unknown               Unknown  Unknown
      libc.so.6          00002AAAAB5B96E5  Unknown               Unknown  Unknown
      miranda            0000000000402EE9  Unknown               Unknown  Unknown
      MPT ERROR: MPI_COMM_WORLD rank 12 has terminated without calling MPI_Finalize()
      	aborting job
      
      

       I was able to capture some debug logs I have attached to the case. I was unable to reproduce it using "+trace". But will continue to try.

      Attachments

        1. getcwdHack.c
          6 kB
        2. miranda.debug.1499341246.gz
          84.13 MB
        3. miranda.dis
          9.19 MB
        4. r481i7n17.dump1.log.gz
          13.86 MB
        5. unoptimize-atomic_open-of-negative-dentry.patch
          2 kB

        Issue Links

          Activity

            [LU-9735] Sles12Sp2 and 2.9 getcwd() sometimes fails

            I already have another ticket for this

            simmonsja James A Simmons added a comment - I already have another ticket for this
            pjones Peter Jones added a comment -

            ok so, given that the initial fix seems to satisfy NASA (the original reporter) we can close the ticket. simmonsja can you track any remaining work under a new ticket?

            pjones Peter Jones added a comment - ok so, given that the initial fix seems to satisfy NASA (the original reporter) we can close the ticket. simmonsja can you track any remaining work under a new ticket?

            We can close this case.

            mhanafi Mahmoud Hanafi added a comment - We can close this case.

            So it appears that the patch for LU-9868 while fixing this bug has exposed another potential bug in lustre. If you run sanity test 233 you see

            [37212.956888] VFS: Lookup of '[0x200000007:0x1:0x0]' in lustre lustre would have caused loop

            [37217.817624] Lustre: DEBUG MARKER: sanity test_233a: @@@@@@ FAIL: cannot access /lustre/lustre using its FID '[0x200000007:0x1:0x0]'

            [37236.855424] Lustre: DEBUG MARKER: == sanity test 233b: checking that OBF of the FS .lustre succeeds ==================================== 03:34:33 (1538379273)

            [37238.362201] VFS: Lookup of '[0x200000002:0x1:0x0]' in lustre lustre would have caused loop

            [37243.442480] Lustre: DEBUG MARKER: sanity test_233b: @@@@@@ FAIL: cannot access /lustre/lustre/.lustre using its FID '[0x200000002:0x1:0x0]'

            Some how the parent child relationship got inverted. Will investigate.

            simmonsja James A Simmons added a comment - So it appears that the patch for LU-9868 while fixing this bug has exposed another potential bug in lustre. If you run sanity test 233 you see [37212.956888] VFS: Lookup of ' [0x200000007:0x1:0x0] ' in lustre lustre would have caused loop [37217.817624] Lustre: DEBUG MARKER: sanity test_233a: @@@@@@ FAIL: cannot access /lustre/lustre using its FID ' [0x200000007:0x1:0x0] ' [37236.855424] Lustre: DEBUG MARKER: == sanity test 233b: checking that OBF of the FS .lustre succeeds ==================================== 03:34:33 (1538379273) [37238.362201] VFS: Lookup of ' [0x200000002:0x1:0x0] ' in lustre lustre would have caused loop [37243.442480] Lustre: DEBUG MARKER: sanity test_233b: @@@@@@ FAIL: cannot access /lustre/lustre/.lustre using its FID ' [0x200000002:0x1:0x0] ' Some how the parent child relationship got inverted. Will investigate.

            As reported the earlier patch for this bug didn't completely solve the problem. The work from LU-9868 has been reported as solving this problem which is now linked to this ticket.

            simmonsja James A Simmons added a comment - As reported the earlier patch for this bug didn't completely solve the problem. The work from LU-9868 has been reported as solving this problem which is now linked to this ticket.
            simmonsja James A Simmons added a comment - Can you give  https://review.whamcloud.com/#/c/28486  a try.

            I wonder if the fixes from LU-9868 would fix this? Note the patch post has a bug in it. I have a fix but haven't pushed it.

            simmonsja James A Simmons added a comment - I wonder if the fixes from LU-9868 would fix this? Note the patch post has a bug in it. I have a fix but haven't pushed it.
            m.magrys Marek Magrys added a comment -

            It looks like we hit the same issue on Lustre 2.10.5 client and Centos 7.5 kernel. Should the fix come from Lustre or kernel, as I'm confused by the previous discussion.

            m.magrys Marek Magrys added a comment - It looks like we hit the same issue on Lustre 2.10.5 client and Centos 7.5 kernel. Should the fix come from Lustre or kernel, as I'm confused by the previous discussion.

            Hi All,

             

            I did not see any reaction to Mahmoud Hanafi 's comment from the 26th October 2017, citing Neil Brown:

            "This is a bug in lustre (it shouldn't call d_move())"

             

            We have installed 2.10.4 at a customer's site which encountered this problem (with RHEL 7 clients), and even though this considerably decreased the number of occurences, we still see the "small window when dentry is unhashed", making the job fail.

            Is there something that can be done in ll_splice_alias() to close this race ?

            I understand this is closed by SuSE in their latest kernels, but this is not the case for earlier kernels, nor for RHEL kernels.

            Or should we push Red Hat to apply the same kind of patch SuSE did (does not seem really fair to me) ?

            spiechurski Sebastien Piechurski added a comment - Hi All,   I did not see any reaction to Mahmoud Hanafi 's comment from the 26th October 2017, citing Neil Brown: "This is a bug in lustre (it shouldn't call d_move())"   We have installed 2.10.4 at a customer's site which encountered this problem (with RHEL 7 clients), and even though this considerably decreased the number of occurences, we still see the "small window when dentry is unhashed", making the job fail. Is there something that can be done in ll_splice_alias() to close this race ? I understand this is closed by SuSE in their latest kernels, but this is not the case for earlier kernels, nor for RHEL kernels. Or should we push Red Hat to apply the same kind of patch SuSE did (does not seem really fair to me) ?

            Hi!

            As an additional datapoint, we'd like to report that we've been seeing this exact same behavior with the latest Maintenance Release (2.10.3) and the latest available CentOS 7.4 kernel

            # uname -a
            Linux sh-104-39.int 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
            
            # cat /sys/fs/lustre/version
            2.10.3 

             

            Symptoms are the same stack as initially reported, and happened while running VASP jobs:

            "forrtl: severe (121): Cannot access current working directory for unit 7, file "Unknown"
            Image              PC                Routine            Line        Source
            vasp_gam           000000000140B496  Unknown               Unknown  Unknown
            vasp_gam           000000000142511E  Unknown               Unknown  Unknown
            vasp_gam           000000000091665F  Unknown               Unknown  Unknown
            vasp_gam           0000000000CFE655  Unknown               Unknown  Unknown
            vasp_gam           00000000012AF330  Unknown               Unknown  Unknown
            vasp_gam           0000000000408D1E  Unknown               Unknown  Unknown
            libc-2.17.so       00007F839E16FC05  __libc_start_main     Unknown  Unknown
            vasp_gam           0000000000408C29  Unknown               Unknown  Unknown

             

            The "try-again" workaround provided by @Nathan works great and we're recommending our users to use it for now. With the libgetcwdHack.so library LD_PRELOADed, the application generates this kind of log:

            NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: failed: size 4096,
            buf 0x7fffd6c0959b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: succeeded at try 2
            of 10: size 4096, buf 0x7fffd6c0959b, ret 0x7fffd6c0959b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096,
            buf 0x7ffe51196e9b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2
            of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: failed: size 4096,
            buf 0x7ffdb3f1f31b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: succeeded at try 2
            of 10: size 4096, buf 0x7ffdb3f1f31b, ret 0x7ffdb3f1f31b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: failed: size 4096,
            buf 0x7ffe9713df9b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: succeeded at try 2
            of 10: size 4096, buf 0x7ffe9713df9b, ret 0x7ffe9713df9b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096,
            buf 0x7ffe51196e9b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2
            of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            

            which seem very indicative of the same error.

            We're looking forward to the fix in 2.10.4.

            Cheers,
            -- 
            Kilian

            srcc Stanford Research Computing Center added a comment - Hi! As an additional datapoint, we'd like to report that we've been seeing this exact same behavior with the latest Maintenance Release (2.10.3) and the latest available CentOS 7.4 kernel # uname -a Linux sh-104-39. int 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux # cat /sys/fs/lustre/version 2.10.3   Symptoms are the same stack as initially reported, and happened while running VASP jobs: "forrtl: severe (121): Cannot access current working directory for unit 7, file " Unknown" Image PC Routine Line Source vasp_gam 000000000140B496 Unknown Unknown Unknown vasp_gam 000000000142511E Unknown Unknown Unknown vasp_gam 000000000091665F Unknown Unknown Unknown vasp_gam 0000000000CFE655 Unknown Unknown Unknown vasp_gam 00000000012AF330 Unknown Unknown Unknown vasp_gam 0000000000408D1E Unknown Unknown Unknown libc-2.17.so 00007F839E16FC05 __libc_start_main Unknown Unknown vasp_gam 0000000000408C29 Unknown Unknown Unknown   The "try-again" workaround provided by @Nathan works great and we're recommending our users to use it for now. With the libgetcwdHack.so library LD_PRELOADed, the application generates this kind of log: NF: getcwd: mpi rank -1, host sh-104-39. int : [190701]: failed: size 4096, buf 0x7fffd6c0959b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190701]: succeeded at try 2 of 10: size 4096, buf 0x7fffd6c0959b, ret 0x7fffd6c0959b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: failed: size 4096, buf 0x7ffe51196e9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: succeeded at try 2 of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190699]: failed: size 4096, buf 0x7ffdb3f1f31b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190699]: succeeded at try 2 of 10: size 4096, buf 0x7ffdb3f1f31b, ret 0x7ffdb3f1f31b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190697]: failed: size 4096, buf 0x7ffe9713df9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190697]: succeeded at try 2 of 10: size 4096, buf 0x7ffe9713df9b, ret 0x7ffe9713df9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: failed: size 4096, buf 0x7ffe51196e9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: succeeded at try 2 of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 which seem very indicative of the same error. We're looking forward to the fix in 2.10.4. Cheers, --  Kilian

            People

              simmonsja James A Simmons
              mhanafi Mahmoud Hanafi
              Votes:
              1 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: