Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9735

Sles12Sp2 and 2.9 getcwd() sometimes fails

Details

    • 2
    • 9223372036854775807

    Description

      This is a duplicate of LU-9208. Opening this case for tracking for nasa. We start to see this once we updated the clients to Sles12SP2 and lustre2.9

      Using the test code provide LU-9208 (miranda) I was able to reproduce the bug on a single node.

       

      Iteration =    868, Run Time =     0.9614 sec., Transfer Rate =   120.7790 10e+06 Bytes/sec/proc
      Iteration =    869, Run Time =     1.5308 sec., Transfer Rate =    75.8561 10e+06 Bytes/sec/proc
      forrtl: severe (121): Cannot access current working directory for unit 10012, file "Unknown"
      Image              PC                Routine            Line        Source             
      miranda            0000000000409F29  Unknown               Unknown  Unknown
      miranda            00000000004169D2  Unknown               Unknown  Unknown
      miranda            0000000000404045  Unknown               Unknown  Unknown
      miranda            0000000000402FDE  Unknown               Unknown  Unknown
      libc.so.6          00002AAAAB5B96E5  Unknown               Unknown  Unknown
      miranda            0000000000402EE9  Unknown               Unknown  Unknown
      MPT ERROR: MPI_COMM_WORLD rank 12 has terminated without calling MPI_Finalize()
      	aborting job
      
      

       I was able to capture some debug logs I have attached to the case. I was unable to reproduce it using "+trace". But will continue to try.

      Attachments

        1. getcwdHack.c
          6 kB
        2. miranda.debug.1499341246.gz
          84.13 MB
        3. miranda.dis
          9.19 MB
        4. r481i7n17.dump1.log.gz
          13.86 MB
        5. unoptimize-atomic_open-of-negative-dentry.patch
          2 kB

        Issue Links

          Activity

            [LU-9735] Sles12Sp2 and 2.9 getcwd() sometimes fails

            Hi!

            As an additional datapoint, we'd like to report that we've been seeing this exact same behavior with the latest Maintenance Release (2.10.3) and the latest available CentOS 7.4 kernel

            # uname -a
            Linux sh-104-39.int 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
            
            # cat /sys/fs/lustre/version
            2.10.3 

             

            Symptoms are the same stack as initially reported, and happened while running VASP jobs:

            "forrtl: severe (121): Cannot access current working directory for unit 7, file "Unknown"
            Image              PC                Routine            Line        Source
            vasp_gam           000000000140B496  Unknown               Unknown  Unknown
            vasp_gam           000000000142511E  Unknown               Unknown  Unknown
            vasp_gam           000000000091665F  Unknown               Unknown  Unknown
            vasp_gam           0000000000CFE655  Unknown               Unknown  Unknown
            vasp_gam           00000000012AF330  Unknown               Unknown  Unknown
            vasp_gam           0000000000408D1E  Unknown               Unknown  Unknown
            libc-2.17.so       00007F839E16FC05  __libc_start_main     Unknown  Unknown
            vasp_gam           0000000000408C29  Unknown               Unknown  Unknown

             

            The "try-again" workaround provided by @Nathan works great and we're recommending our users to use it for now. With the libgetcwdHack.so library LD_PRELOADed, the application generates this kind of log:

            NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: failed: size 4096,
            buf 0x7fffd6c0959b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: succeeded at try 2
            of 10: size 4096, buf 0x7fffd6c0959b, ret 0x7fffd6c0959b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096,
            buf 0x7ffe51196e9b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2
            of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: failed: size 4096,
            buf 0x7ffdb3f1f31b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: succeeded at try 2
            of 10: size 4096, buf 0x7ffdb3f1f31b, ret 0x7ffdb3f1f31b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: failed: size 4096,
            buf 0x7ffe9713df9b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: succeeded at try 2
            of 10: size 4096, buf 0x7ffe9713df9b, ret 0x7ffe9713df9b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096,
            buf 0x7ffe51196e9b, ret (nil): No such file or directory
            NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2
            of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path
            /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
            

            which seem very indicative of the same error.

            We're looking forward to the fix in 2.10.4.

            Cheers,
            -- 
            Kilian

            srcc Stanford Research Computing Center added a comment - Hi! As an additional datapoint, we'd like to report that we've been seeing this exact same behavior with the latest Maintenance Release (2.10.3) and the latest available CentOS 7.4 kernel # uname -a Linux sh-104-39. int 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux # cat /sys/fs/lustre/version 2.10.3   Symptoms are the same stack as initially reported, and happened while running VASP jobs: "forrtl: severe (121): Cannot access current working directory for unit 7, file " Unknown" Image PC Routine Line Source vasp_gam 000000000140B496 Unknown Unknown Unknown vasp_gam 000000000142511E Unknown Unknown Unknown vasp_gam 000000000091665F Unknown Unknown Unknown vasp_gam 0000000000CFE655 Unknown Unknown Unknown vasp_gam 00000000012AF330 Unknown Unknown Unknown vasp_gam 0000000000408D1E Unknown Unknown Unknown libc-2.17.so 00007F839E16FC05 __libc_start_main Unknown Unknown vasp_gam 0000000000408C29 Unknown Unknown Unknown   The "try-again" workaround provided by @Nathan works great and we're recommending our users to use it for now. With the libgetcwdHack.so library LD_PRELOADed, the application generates this kind of log: NF: getcwd: mpi rank -1, host sh-104-39. int : [190701]: failed: size 4096, buf 0x7fffd6c0959b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190701]: succeeded at try 2 of 10: size 4096, buf 0x7fffd6c0959b, ret 0x7fffd6c0959b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: failed: size 4096, buf 0x7ffe51196e9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: succeeded at try 2 of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190699]: failed: size 4096, buf 0x7ffdb3f1f31b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190699]: succeeded at try 2 of 10: size 4096, buf 0x7ffdb3f1f31b, ret 0x7ffdb3f1f31b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190697]: failed: size 4096, buf 0x7ffe9713df9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190697]: succeeded at try 2 of 10: size 4096, buf 0x7ffe9713df9b, ret 0x7ffe9713df9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: failed: size 4096, buf 0x7ffe51196e9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: succeeded at try 2 of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 which seem very indicative of the same error. We're looking forward to the fix in 2.10.4. Cheers, --  Kilian

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31106/
            Subject: LU-9735 compat: heed the fs_struct::seq
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 030b15004d3acf6b98c198263fcca232129568cc

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31106/ Subject: LU-9735 compat: heed the fs_struct::seq Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 030b15004d3acf6b98c198263fcca232129568cc

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31106
            Subject: LU-9735 compat: heed the fs_struct::seq
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 6c21f76ae7ef6cb8004dd60db87583c101b50aa6

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31106 Subject: LU-9735 compat: heed the fs_struct::seq Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 6c21f76ae7ef6cb8004dd60db87583c101b50aa6
            mdiep Minh Diep added a comment -

            Landed for 2.11

            mdiep Minh Diep added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28907/
            Subject: LU-9735 compat: heed the fs_struct::seq
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fff1163fdb41190b59adb8d90919e0adf37f68fb

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28907/ Subject: LU-9735 compat: heed the fs_struct::seq Project: fs/lustre-release Branch: master Current Patch Set: Commit: fff1163fdb41190b59adb8d90919e0adf37f68fb

            We carry #28907 in our nas-2.10.x and it fixed our problem. It probably would be 4-6 months before we can upgrade to sles12sp3.

            Would #28907 conflict with SUSE's workaround in sles12sp3?

            Neil Brown while proposed that WA thought it was a bug in lustre. I think we should have a valid fix.

            jaylan Jay Lan (Inactive) added a comment - We carry #28907 in our nas-2.10.x and it fixed our problem. It probably would be 4-6 months before we can upgrade to sles12sp3. Would #28907 conflict with SUSE's workaround in sles12sp3? Neil Brown while proposed that WA thought it was a bug in lustre. I think we should have a valid fix.

            What about the above comment 

            "This is a bug in lustre (it shouldn't call d_move())"

            mhanafi Mahmoud Hanafi added a comment - What about the above comment  "This is a bug in lustre (it shouldn't call d_move())"

            The patch has been submitted from SLE11-SP3-LTSS to SLE15. New kernels containing the fix will be released soon.

            sparschauer Sebastian Parschauer (Inactive) added a comment - The patch has been submitted from SLE11-SP3-LTSS to SLE15. New kernels containing the fix will be released soon.
            bogl Bob Glossman (Inactive) added a comment - - edited

            I would vote for 'yes', but it's only fixed for new versions of sles12sp3. not fixed for anything older. Do other commenters in this ticket have opinions?

            Has there been any feedback on the other proposed fix https://review.whamcloud.com/28907 ?

            bogl Bob Glossman (Inactive) added a comment - - edited I would vote for 'yes', but it's only fixed for new versions of sles12sp3. not fixed for anything older. Do other commenters in this ticket have opinions? Has there been any feedback on the other proposed fix https://review.whamcloud.com/28907 ?
            pjones Peter Jones added a comment -

            So can we safely close this as "not a bug" from a Lustre perspective safe in the knowledge that this will be fixed in current version of SLES?

            pjones Peter Jones added a comment - So can we safely close this as "not a bug" from a Lustre perspective safe in the knowledge that this will be fixed in current version of SLES?

            A fix for this problem is now shipped in the latest kernel version for sles12sp3. The description of the fix is as follows:

            From: NeilBrown <neilb@suse.com>
            Subject: getcwd: Close race with d_move called by lustre
            Patch-mainline: Not yet, under development
            References: bsc#1052593
            
            When lustre invalidates a dentry (e.g. do to a recalled lock) and then
            revalidates it, ll_splice_alias() will call d_move() to move the old alias
            to the name of a new one.
            This will d_drop then d_rehash the old dentry, creating a small window
            when the dentry in unhashed.
            If getcwd is run at this time, it might incorrectly think that
            the dentry really is unhashed, and so return ENOENT.
            
            This is a bug in lustre (it shouldn't call d_move()) but we can work
            around it in getcwd by  taking the d_lock to avoid the race.
            First we test without the lock as the common case does not involve
            any race.  If we find the the dentry appears to be unhashed, we take
            the lock and check again.
            
            Signed-off-by: Neil Brown <neilb@suse.com>
            
            bogl Bob Glossman (Inactive) added a comment - A fix for this problem is now shipped in the latest kernel version for sles12sp3. The description of the fix is as follows: From: NeilBrown <neilb@suse.com> Subject: getcwd: Close race with d_move called by lustre Patch-mainline: Not yet, under development References: bsc#1052593 When lustre invalidates a dentry (e.g. do to a recalled lock) and then revalidates it, ll_splice_alias() will call d_move() to move the old alias to the name of a new one. This will d_drop then d_rehash the old dentry, creating a small window when the dentry in unhashed. If getcwd is run at this time, it might incorrectly think that the dentry really is unhashed, and so return ENOENT. This is a bug in lustre (it shouldn't call d_move()) but we can work around it in getcwd by taking the d_lock to avoid the race. First we test without the lock as the common case does not involve any race. If we find the the dentry appears to be unhashed, we take the lock and check again. Signed-off-by: Neil Brown <neilb@suse.com>

            People

              simmonsja James A Simmons
              mhanafi Mahmoud Hanafi
              Votes:
              1 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: