[LU-17301] Client mount hangs for several minutes Created: 20/Nov/23  Updated: 13/Dec/23  Resolved: 13/Dec/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Tim Day Assignee: Shaun Tancheff
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16901 Provide l_getidentity_nss identity pr... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After patch https://review.whamcloud.com/c/fs/lustre-release/+/51329, client mount hangs for several minutes when using `llmount.sh`. When this patch is reverted, the mount completes in a normal amount of time.

 

Setup:

Kernel - 5.15.94
openZFS - 2.1.99-1735_gdc72c60ec
Distro - CentOS Linux release 8.5.2111

 

Kernel messages:

[  480.629670] Lustre: mdt_rdpg01_001: service thread pid 52665 was inactive for 40.946 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[  480.632948] Pid: 52665, comm: mdt_rdpg01_001 5.15.94 #1 SMP Fri Feb 17 01:05:48 UTC 2023
[  480.634352] Call Trace TBD:
[  500.084741] LustreError: 52665:0:(upcall_cache.c:251:upcall_cache_get_entry()) acquire for key 0: error -110
[  500.086740] Lustre: Mounted lustre-client

The mount hangs for several minutes even after the 'Mounted' message gets printed.



 Comments   
Comment by Peter Jones [ 20/Nov/23 ]

Probably the immediate step should be to revert this patch from master until a fix can be worked out...

Comment by Arshad Hussain [ 21/Nov/23 ]
$ uname -r
3.10.0-1160.15.2.el7.x86_64
$ cat /etc/redhat-release 
CentOS Linux release 7.9.2009 (Core)

Also seen on 4.X series kernel. (el8).

Nov 20 23:25:51 centos79 kernel: Lustre: mdt_rdpg00_001: service thread pid 8735 was inactive for 40.059 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 20 23:25:51 centos79 kernel: Pid: 8735, comm: mdt_rdpg00_001 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
Nov 20 23:25:51 centos79 kernel: Call Trace:
Nov 20 23:25:51 centos79 kernel: [<0>] upcall_cache_get_entry+0x201/0x9d0 [obdclass]
Nov 20 23:25:51 centos79 kernel: [<0>] mdt_identity_get+0x17/0x50 [mdt]
Nov 20 23:25:51 centos79 kernel: [<0>] old_init_ucred_common+0xed/0x2c0 [mdt]
Nov 20 23:25:51 centos79 kernel: [<0>] mdt_init_ucred+0x19c/0x320 [mdt]
Nov 20 23:25:51 centos79 kernel: [<0>] mdt_getattr+0x156/0x640 [mdt]
Nov 20 23:25:51 centos79 kernel: [<0>] tgt_request_handle+0x743/0x19a0 [ptlrpc]
Nov 20 23:25:51 centos79 kernel: [<0>] ptlrpc_server_handle_request+0x261/0xcd0 [ptlrpc]
Nov 20 23:25:51 centos79 kernel: [<0>] ptlrpc_main+0xc5c/0x16c0 [ptlrpc]
Nov 20 23:25:51 centos79 kernel: [<0>] kthread+0xd1/0xe0
Nov 20 23:25:51 centos79 kernel: [<0>] ret_from_fork_nospec_begin+0xe/0x21
Nov 20 23:25:51 centos79 kernel: [<0>] 0xfffffffffffffffe
Nov 20 23:26:11 centos79 kernel: LustreError: 8735:0:(upcall_cache.c:244:upcall_cache_get_entry()) acquire for key 0: error -110
Nov 20 23:26:11 centos79 kernel: Lustre: Mounted lustre-client
Nov 20 23:26:52 centos79 kernel: Lustre: mdt00_002: service thread pid 8733 was inactive for 40.169 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 20 23:26:52 centos79 kernel: Pid: 8733, comm: mdt00_002 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
Nov 20 23:26:52 centos79 kernel: Call Trace:
Nov 20 23:26:52 centos79 kernel: [<0>] upcall_cache_get_entry+0x201/0x9d0 [obdclass]
Nov 20 23:26:52 centos79 kernel: [<0>] mdt_identity_get+0x17/0x50 [mdt]
Nov 20 23:26:52 centos79 kernel: [<0>] old_init_ucred_common+0xed/0x2c0 [mdt]
Nov 20 23:26:52 centos79 kernel: [<0>] mdt_init_ucred+0x19c/0x320 [mdt]
Nov 20 23:26:52 centos79 kernel: [<0>] mdt_intent_getattr+0xc3/0x4b0 [mdt]
Nov 20 23:26:52 centos79 kernel: [<0>] mdt_intent_opc+0x1c5/0xc40 [mdt]
Nov 20 23:26:52 centos79 kernel: [<0>] mdt_intent_policy+0xfa/0x460 [mdt]
Nov 20 23:26:52 centos79 kernel: [<0>] ldlm_lock_enqueue+0x3a4/0xb80 [ptlrpc]
Nov 20 23:26:52 centos79 kernel: [<0>] ldlm_handle_enqueue+0x34d/0x17a0 [ptlrpc]
Nov 20 23:26:52 centos79 kernel: [<0>] tgt_enqueue+0x68/0x240 [ptlrpc]
Nov 20 23:26:52 centos79 kernel: [<0>] tgt_request_handle+0x743/0x19a0 [ptlrpc]
Nov 20 23:26:52 centos79 kernel: [<0>] ptlrpc_server_handle_request+0x261/0xcd0 [ptlrpc]
Nov 20 23:26:52 centos79 kernel: [<0>] ptlrpc_main+0xc5c/0x16c0 [ptlrpc]
Nov 20 23:26:52 centos79 kernel: [<0>] kthread+0xd1/0xe0
Nov 20 23:26:52 centos79 kernel: [<0>] ret_from_fork_nospec_begin+0xe/0x21
Nov 20 23:26:52 centos79 kernel: [<0>] 0xfffffffffffffffe
Nov 20 23:27:12 centos79 kernel: LustreError: 8733:0:(upcall_cache.c:244:upcall_cache_get_entry()) acquire for key 0: error -110
Comment by Alexander Zarochentsev [ 21/Nov/23 ]

Hello Arshad,

can you please test the following patch:

diff --git a/lustre/utils/l_getidentity.c b/lustre/utils/l_getidentity.c
index 34c70d8d71..f99bdafc23 100644
--- a/lustre/utils/l_getidentity.c
+++ b/lustre/utils/l_getidentity.c
@@ -67,7 +67,7 @@ static int nss_pw_buf_len;
 static void *nss_pw_buf;
 static int nss_grent_buf_len;
 static void *nss_grent_buf;
-static int g_n_nss_modules;
+static int g_n_nss_modules = 0;
 static int grent_mod_no = -1;
 
 struct nss_module {

unfortunately I cannot reproduce the problem quickly.

Thanks,
Zam.

Comment by Gerrit Updater [ 21/Nov/23 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53191
Subject: LU-17301 utils: l_getidentity nss init fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1794102c4b4b17f364715e3a9611191945bd5d77

Comment by Arshad Hussain [ 21/Nov/23 ]

Hi Zam,

Sorry, it did not work.  I am getting this by just doing llmount.sh

Nov 21 07:15:56 centos79 kernel: Lustre: mdt_rdpg00_001: service thread pid 10702 was inactive for 40.000 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov 21 07:15:56 centos79 kernel: Pid: 10702, comm: mdt_rdpg00_001 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
Nov 21 07:15:56 centos79 kernel: Call Trace:
Nov 21 07:15:56 centos79 kernel: [<0>] upcall_cache_get_entry+0x201/0xa00 [obdclass]
Nov 21 07:15:56 centos79 kernel: [<0>] mdt_identity_get+0x17/0x50 [mdt]
Nov 21 07:15:56 centos79 kernel: [<0>] old_init_ucred_common+0xed/0x2c0 [mdt]
Nov 21 07:15:56 centos79 kernel: [<0>] mdt_init_ucred+0x19c/0x320 [mdt]
Nov 21 07:15:56 centos79 kernel: [<0>] mdt_getattr+0x156/0x640 [mdt]
Nov 21 07:15:56 centos79 kernel: [<0>] tgt_request_handle+0x743/0x19a0 [ptlrpc]
Nov 21 07:15:56 centos79 kernel: [<0>] ptlrpc_server_handle_request+0x261/0xcd0 [ptlrpc]
Nov 21 07:15:56 centos79 kernel: [<0>] ptlrpc_main+0xc5c/0x16c0 [ptlrpc]
Nov 21 07:15:56 centos79 kernel: [<0>] kthread+0xd1/0xe0
Nov 21 07:15:56 centos79 kernel: [<0>] ret_from_fork_nospec_begin+0xe/0x21
Nov 21 07:15:56 centos79 kernel: [<0>] 0xfffffffffffffffe
Nov 21 07:16:16 centos79 kernel: LustreError: 10702:0:(upcall_cache.c:251:upcall_cache_get_entry()) acquire for key 0: error -110
Comment by Alexander Zarochentsev [ 21/Nov/23 ]

Looks like the following hunk (except -ldl) in the original patch is not needed:

diff --git a/lustre/utils/Makefile.am b/lustre/utils/Makefile.am
index 56f67afd1d..bee65a1178 100644
--- a/lustre/utils/Makefile.am
+++ b/lustre/utils/Makefile.am
@@ -245,8 +245,9 @@ l_foreign_symlink_LDADD := $(top_builddir)/libcfs/libcfs/libcfs.la
 l_foreign_symlink_DEPENDENCIES := $(top_builddir)/libcfs/libcfs/libcfs.la
 
 l_getidentity_SOURCES = l_getidentity.c
-l_getidentity_LDADD := $(top_builddir)/libcfs/libcfs/libcfs.la
-l_getidentity_DEPENDENCIES := $(top_builddir)/libcfs/libcfs/libcfs.la
+l_getidentity_LDADD := $(top_builddir)/libcfs/libcfs/libcfs.la liblustreapi.la $(LIBPTLCTL)
+l_getidentity_LDFLAGS = -ldl
+l_getidentity_DEPENDENCIES := $(top_builddir)/libcfs/libcfs/libcfs.la liblustreapi.la $(LIBPTLCTL)
 
 lhsmtool_posix_SOURCES = lhsmtool_posix.c pid_file.c pid_file.h
 lhsmtool_posix_LDADD := liblustreapi.la $(PTHREAD_LIBS) \

Moreover, it causes the shell wrapper lustre/utils/l_getidentity to fail silently when started by a kernel thread. The tool is not a binary file but a shell wrapper when I start Lustre from the source dir by llmount.sh:

root@rocky lustre-release]# file lustre/utils/l_getidentity
lustre/utils/l_getidentity: POSIX shell script, ASCII text executable, with very long lines
[root@rocky lustre-release]# 

The failures cause lustre client mount delays.

Comment by James A Simmons [ 22/Nov/23 ]

The reason for the shell wrapper is that allows running in the sandbox that people love to do. This way l_getindentity can find the proper libraries independent if its in source tree or installed into an image. If this application is important enough maybe we strip its need of any external libraries. If I remember correctly it only needs liblustreapi.so for mdt/%s/identity_info. Perhaps we can migrate to a udev rule that spins up l_getidentity instead. Another option is we use globs directly instead of cfs_get_param_paths(). BY the way why -ldl directly. Seems to want to work around the whole libtool handlng.

Comment by Arshad Hussain [ 22/Nov/23 ]

Zam, your latest patch does work. Both on 7.9 and 8.4.

>The tool is not a binary file but a shell wrapper when I start Lustre from the source dir by llmount.sh:...

The binaries are normally kept under lustre/utils/.libs/.

 

# file lustre/utils/.libs/l_getidentity 
lustre/utils/.libs/l_getidentity: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=b3cb5b448ee43365845f5d1fcb0a349bf91877c6, with debug_info, not stripped

The targets are normally names lustre/utils/.libs/lt-* (example lfs and lctl) along with with real name

 

# ls lustre/utils/.libs/lt-*
lustre/utils/.libs/lt-lctl  lustre/utils/.libs/lt-lfs
 

l_getidentity does not have lt-<target>. I am not sure if this is causing the issue.

 

 

Comment by Alexander Zarochentsev [ 22/Nov/23 ]

I intercepted the error stream output of l_getidentity being called by a Lustre server, and got something like:

[root@rocky ~]# cat /tmp/identity.log 
PATH=/sbin:/usr/sbin
0
upcall start lustre-MDT0000 0
/mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 150: ls: command not found
/mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 197: rm: command not found
/mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 211: rm: command not found
/mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 212: mv: command not found
/mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 213: rm: command not found
upcall end lustre-MDT0000 0
0
[root@rocky ~]# 

I.e. the shell wrapper uses core linux utilities from /bin but PATH contains only "/sbin" and "/usr/sbin" , it is the reason for the libtool shell wrapper to function incorrectly. If I add "/bin:/usr/bin" to the PATH variable, the upcall (in its shell variant) starts to work.

Thanks,
Zam.

Comment by Arshad Hussain [ 22/Nov/23 ]

I looked into an old vm and looks like lustre/utils/l_getidentity was always a binary and not a shell wrapper or launcher script.

Comment by Alexander Zarochentsev [ 22/Nov/23 ]

>I looked into an old vm and looks like lustre/utils/l_getidentity was always a binary and not a shell wrapper or launcher script.

yes, the patch converts it to a binary again.

Comment by Gerrit Updater [ 13/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53191/
Subject: LU-17301 utils: l_getidentity build fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8fce6b40f245c438691800f6e239539a66e2395c

Comment by Peter Jones [ 13/Dec/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:34:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.