[LU-9735] Sles12Sp2 and 2.9 getcwd() sometimes fails - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4
Affects Version/s: Lustre 2.9.0
Labels:
- ORNL

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

This is a duplicate of ~~LU-9208~~. Opening this case for tracking for nasa. We start to see this once we updated the clients to Sles12SP2 and lustre2.9

Using the test code provide ~~LU-9208~~ (miranda) I was able to reproduce the bug on a single node.

Iteration =    868, Run Time =     0.9614 sec., Transfer Rate =   120.7790 10e+06 Bytes/sec/proc
Iteration =    869, Run Time =     1.5308 sec., Transfer Rate =    75.8561 10e+06 Bytes/sec/proc
forrtl: severe (121): Cannot access current working directory for unit 10012, file "Unknown"
Image              PC                Routine            Line        Source             
miranda            0000000000409F29  Unknown               Unknown  Unknown
miranda            00000000004169D2  Unknown               Unknown  Unknown
miranda            0000000000404045  Unknown               Unknown  Unknown
miranda            0000000000402FDE  Unknown               Unknown  Unknown
libc.so.6          00002AAAAB5B96E5  Unknown               Unknown  Unknown
miranda            0000000000402EE9  Unknown               Unknown  Unknown
MPT ERROR: MPI_COMM_WORLD rank 12 has terminated without calling MPI_Finalize()
	aborting job

I was able to capture some debug logs I have attached to the case. I was unable to reproduce it using "+trace". But will continue to try.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

getcwdHack.c
6 kB
07/Aug/17 6:43 PM
miranda.debug.1499341246.gz
84.13 MB
06/Jul/17 12:10 PM
miranda.dis
9.19 MB
06/Jul/17 12:52 PM
r481i7n17.dump1.log.gz
13.86 MB
05/Jul/17 6:55 AM
unoptimize-atomic_open-of-negative-dentry.patch
2 kB
10/Jul/17 3:10 AM

Issue Links

duplicates

LU-9208 getcwd() sometimes fails

Resolved

is related to

LU-9868 dcache/namei fixes for lustre

Open

LU-10164 kernel update [SLES12 SP3 4.4.92-6.18]

Resolved

Activity

[LU-9735] Sles12Sp2 and 2.9 getcwd() sometimes fails

Stanford Research Computing Center added a comment - 08/May/18 3:20 PM

Hi!

As an additional datapoint, we'd like to report that we've been seeing this exact same behavior with the latest Maintenance Release (2.10.3) and the latest available CentOS 7.4 kernel

# uname -a
Linux sh-104-39.int 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

# cat /sys/fs/lustre/version
2.10.3

Symptoms are the same stack as initially reported, and happened while running VASP jobs:

"forrtl: severe (121): Cannot access current working directory for unit 7, file "Unknown"
Image              PC                Routine            Line        Source
vasp_gam           000000000140B496  Unknown               Unknown  Unknown
vasp_gam           000000000142511E  Unknown               Unknown  Unknown
vasp_gam           000000000091665F  Unknown               Unknown  Unknown
vasp_gam           0000000000CFE655  Unknown               Unknown  Unknown
vasp_gam           00000000012AF330  Unknown               Unknown  Unknown
vasp_gam           0000000000408D1E  Unknown               Unknown  Unknown
libc-2.17.so       00007F839E16FC05  __libc_start_main     Unknown  Unknown
vasp_gam           0000000000408C29  Unknown               Unknown  Unknown

The "try-again" workaround provided by @Nathan works great and we're recommending our users to use it for now. With the libgetcwdHack.so library LD_PRELOADed, the application generates this kind of log:

NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: failed: size 4096,
buf 0x7fffd6c0959b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190701]: succeeded at try 2
of 10: size 4096, buf 0x7fffd6c0959b, ret 0x7fffd6c0959b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096,
buf 0x7ffe51196e9b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2
of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: failed: size 4096,
buf 0x7ffdb3f1f31b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190699]: succeeded at try 2
of 10: size 4096, buf 0x7ffdb3f1f31b, ret 0x7ffdb3f1f31b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: failed: size 4096,
buf 0x7ffe9713df9b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190697]: succeeded at try 2
of 10: size 4096, buf 0x7ffe9713df9b, ret 0x7ffe9713df9b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3
NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: failed: size 4096,
buf 0x7ffe51196e9b, ret (nil): No such file or directory
NF: getcwd: mpi rank -1, host sh-104-39.int: [190695]: succeeded at try 2
of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path
/scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3

which seem very indicative of the same error.

We're looking forward to the fix in 2.10.4.

Cheers,
--
Kilian

Stanford Research Computing Center added a comment - 08/May/18 3:20 PM Hi! As an additional datapoint, we'd like to report that we've been seeing this exact same behavior with the latest Maintenance Release (2.10.3) and the latest available CentOS 7.4 kernel # uname -a Linux sh-104-39. int 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux # cat /sys/fs/lustre/version 2.10.3 Symptoms are the same stack as initially reported, and happened while running VASP jobs: "forrtl: severe (121): Cannot access current working directory for unit 7, file " Unknown" Image PC Routine Line Source vasp_gam 000000000140B496 Unknown Unknown Unknown vasp_gam 000000000142511E Unknown Unknown Unknown vasp_gam 000000000091665F Unknown Unknown Unknown vasp_gam 0000000000CFE655 Unknown Unknown Unknown vasp_gam 00000000012AF330 Unknown Unknown Unknown vasp_gam 0000000000408D1E Unknown Unknown Unknown libc-2.17.so 00007F839E16FC05 __libc_start_main Unknown Unknown vasp_gam 0000000000408C29 Unknown Unknown Unknown The "try-again" workaround provided by @Nathan works great and we're recommending our users to use it for now. With the libgetcwdHack.so library LD_PRELOADed, the application generates this kind of log: NF: getcwd: mpi rank -1, host sh-104-39. int : [190701]: failed: size 4096, buf 0x7fffd6c0959b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190701]: succeeded at try 2 of 10: size 4096, buf 0x7fffd6c0959b, ret 0x7fffd6c0959b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: failed: size 4096, buf 0x7ffe51196e9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: succeeded at try 2 of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190699]: failed: size 4096, buf 0x7ffdb3f1f31b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190699]: succeeded at try 2 of 10: size 4096, buf 0x7ffdb3f1f31b, ret 0x7ffdb3f1f31b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190697]: failed: size 4096, buf 0x7ffe9713df9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190697]: succeeded at try 2 of 10: size 4096, buf 0x7ffe9713df9b, ret 0x7ffe9713df9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: failed: size 4096, buf 0x7ffe51196e9b, ret (nil): No such file or directory NF: getcwd: mpi rank -1, host sh-104-39. int : [190695]: succeeded at try 2 of 10: size 4096, buf 0x7ffe51196e9b, ret 0x7ffe51196e9b, path /scratch/users/freitas/chemical_reactions/vasp_simulations/C_fixed_V/07_restart_3 which seem very indicative of the same error. We're looking forward to the fix in 2.10.4. Cheers, -- Kilian

Gerrit Updater added a comment - 09/Feb/18 9:35 PM

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31106/
Subject: ~~LU-9735~~ compat: heed the fs_struct::seq
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 030b15004d3acf6b98c198263fcca232129568cc

Gerrit Updater added a comment - 09/Feb/18 9:35 PM John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31106/ Subject: LU-9735 compat: heed the fs_struct::seq Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 030b15004d3acf6b98c198263fcca232129568cc

Gerrit Updater added a comment - 31/Jan/18 3:26 PM

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31106
Subject: ~~LU-9735~~ compat: heed the fs_struct::seq
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 6c21f76ae7ef6cb8004dd60db87583c101b50aa6

Gerrit Updater added a comment - 31/Jan/18 3:26 PM Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31106 Subject: LU-9735 compat: heed the fs_struct::seq Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 6c21f76ae7ef6cb8004dd60db87583c101b50aa6

Minh Diep added a comment - 31/Jan/18 3:23 PM

Landed for 2.11

Minh Diep added a comment - 31/Jan/18 3:23 PM Landed for 2.11

Gerrit Updater added a comment - 31/Jan/18 5:51 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28907/
Subject: ~~LU-9735~~ compat: heed the fs_struct::seq
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fff1163fdb41190b59adb8d90919e0adf37f68fb

Gerrit Updater added a comment - 31/Jan/18 5:51 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28907/ Subject: LU-9735 compat: heed the fs_struct::seq Project: fs/lustre-release Branch: master Current Patch Set: Commit: fff1163fdb41190b59adb8d90919e0adf37f68fb

Jay Lan (Inactive) added a comment - 26/Oct/17 6:35 PM

We carry #28907 in our nas-2.10.x and it fixed our problem. It probably would be 4-6 months before we can upgrade to sles12sp3.

Would #28907 conflict with SUSE's workaround in sles12sp3?

Neil Brown while proposed that WA thought it was a bug in lustre. I think we should have a valid fix.

Jay Lan (Inactive) added a comment - 26/Oct/17 6:35 PM We carry #28907 in our nas-2.10.x and it fixed our problem. It probably would be 4-6 months before we can upgrade to sles12sp3. Would #28907 conflict with SUSE's workaround in sles12sp3? Neil Brown while proposed that WA thought it was a bug in lustre. I think we should have a valid fix.

Mahmoud Hanafi added a comment - 26/Oct/17 4:51 PM

What about the above comment

"This is a bug in lustre (it shouldn't call d_move())"

Mahmoud Hanafi added a comment - 26/Oct/17 4:51 PM What about the above comment "This is a bug in lustre (it shouldn't call d_move())"

Sebastian Parschauer (Inactive) added a comment - 26/Oct/17 2:47 PM

The patch has been submitted from SLE11-SP3-LTSS to SLE15. New kernels containing the fix will be released soon.

Sebastian Parschauer (Inactive) added a comment - 26/Oct/17 2:47 PM The patch has been submitted from SLE11-SP3-LTSS to SLE15. New kernels containing the fix will be released soon.

Bob Glossman (Inactive) added a comment - 26/Oct/17 2:28 PM - edited

I would vote for 'yes', but it's only fixed for new versions of sles12sp3. not fixed for anything older. Do other commenters in this ticket have opinions?

Has there been any feedback on the other proposed fix https://review.whamcloud.com/28907 ?

Bob Glossman (Inactive) added a comment - 26/Oct/17 2:28 PM - edited I would vote for 'yes', but it's only fixed for new versions of sles12sp3. not fixed for anything older. Do other commenters in this ticket have opinions? Has there been any feedback on the other proposed fix https://review.whamcloud.com/28907 ?

Peter Jones added a comment - 26/Oct/17 2:24 PM

So can we safely close this as "not a bug" from a Lustre perspective safe in the knowledge that this will be fixed in current version of SLES?

Peter Jones added a comment - 26/Oct/17 2:24 PM So can we safely close this as "not a bug" from a Lustre perspective safe in the knowledge that this will be fixed in current version of SLES?

Bob Glossman (Inactive) added a comment - 26/Oct/17 6:03 AM

A fix for this problem is now shipped in the latest kernel version for sles12sp3. The description of the fix is as follows:

From: NeilBrown <neilb@suse.com>
Subject: getcwd: Close race with d_move called by lustre
Patch-mainline: Not yet, under development
References: bsc#1052593

When lustre invalidates a dentry (e.g. do to a recalled lock) and then
revalidates it, ll_splice_alias() will call d_move() to move the old alias
to the name of a new one.
This will d_drop then d_rehash the old dentry, creating a small window
when the dentry in unhashed.
If getcwd is run at this time, it might incorrectly think that
the dentry really is unhashed, and so return ENOENT.

This is a bug in lustre (it shouldn't call d_move()) but we can work
around it in getcwd by  taking the d_lock to avoid the race.
First we test without the lock as the common case does not involve
any race.  If we find the the dentry appears to be unhashed, we take
the lock and check again.

Signed-off-by: Neil Brown <neilb@suse.com>

Bob Glossman (Inactive) added a comment - 26/Oct/17 6:03 AM A fix for this problem is now shipped in the latest kernel version for sles12sp3. The description of the fix is as follows: From: NeilBrown <neilb@suse.com> Subject: getcwd: Close race with d_move called by lustre Patch-mainline: Not yet, under development References: bsc#1052593 When lustre invalidates a dentry (e.g. do to a recalled lock) and then revalidates it, ll_splice_alias() will call d_move() to move the old alias to the name of a new one. This will d_drop then d_rehash the old dentry, creating a small window when the dentry in unhashed. If getcwd is run at this time, it might incorrectly think that the dentry really is unhashed, and so return ENOENT. This is a bug in lustre (it shouldn't call d_move()) but we can work around it in getcwd by taking the d_lock to avoid the race. First we test without the lock as the common case does not involve any race. If we find the the dentry appears to be unhashed, we take the lock and check again. Signed-off-by: Neil Brown <neilb@suse.com>

People

Assignee:: James A Simmons

Reporter:: Mahmoud Hanafi

Votes:: 1 Vote for this issue

Watchers:: 24 Start watching this issue

Dates

Created:: 05/Jul/17 6:57 AM

Updated:: 24/Jul/19 5:36 PM

Resolved:: 24/Jul/19 5:14 PM