[LU-17301] Client mount hangs for several minutes Created: 20/Nov/23 Updated: 13/Dec/23 Resolved: 13/Dec/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Tim Day | Assignee: | Shaun Tancheff |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After patch https://review.whamcloud.com/c/fs/lustre-release/+/51329, client mount hangs for several minutes when using `llmount.sh`. When this patch is reverted, the mount completes in a normal amount of time.
Setup: Kernel - 5.15.94 openZFS - 2.1.99-1735_gdc72c60ec Distro - CentOS Linux release 8.5.2111
Kernel messages: [ 480.629670] Lustre: mdt_rdpg01_001: service thread pid 52665 was inactive for 40.946 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 480.632948] Pid: 52665, comm: mdt_rdpg01_001 5.15.94 #1 SMP Fri Feb 17 01:05:48 UTC 2023 [ 480.634352] Call Trace TBD: [ 500.084741] LustreError: 52665:0:(upcall_cache.c:251:upcall_cache_get_entry()) acquire for key 0: error -110 [ 500.086740] Lustre: Mounted lustre-client The mount hangs for several minutes even after the 'Mounted' message gets printed. |
| Comments |
| Comment by Peter Jones [ 20/Nov/23 ] |
|
Probably the immediate step should be to revert this patch from master until a fix can be worked out... |
| Comment by Arshad Hussain [ 21/Nov/23 ] |
$ uname -r 3.10.0-1160.15.2.el7.x86_64 $ cat /etc/redhat-release CentOS Linux release 7.9.2009 (Core) Also seen on 4.X series kernel. (el8). Nov 20 23:25:51 centos79 kernel: Lustre: mdt_rdpg00_001: service thread pid 8735 was inactive for 40.059 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 20 23:25:51 centos79 kernel: Pid: 8735, comm: mdt_rdpg00_001 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 Nov 20 23:25:51 centos79 kernel: Call Trace: Nov 20 23:25:51 centos79 kernel: [<0>] upcall_cache_get_entry+0x201/0x9d0 [obdclass] Nov 20 23:25:51 centos79 kernel: [<0>] mdt_identity_get+0x17/0x50 [mdt] Nov 20 23:25:51 centos79 kernel: [<0>] old_init_ucred_common+0xed/0x2c0 [mdt] Nov 20 23:25:51 centos79 kernel: [<0>] mdt_init_ucred+0x19c/0x320 [mdt] Nov 20 23:25:51 centos79 kernel: [<0>] mdt_getattr+0x156/0x640 [mdt] Nov 20 23:25:51 centos79 kernel: [<0>] tgt_request_handle+0x743/0x19a0 [ptlrpc] Nov 20 23:25:51 centos79 kernel: [<0>] ptlrpc_server_handle_request+0x261/0xcd0 [ptlrpc] Nov 20 23:25:51 centos79 kernel: [<0>] ptlrpc_main+0xc5c/0x16c0 [ptlrpc] Nov 20 23:25:51 centos79 kernel: [<0>] kthread+0xd1/0xe0 Nov 20 23:25:51 centos79 kernel: [<0>] ret_from_fork_nospec_begin+0xe/0x21 Nov 20 23:25:51 centos79 kernel: [<0>] 0xfffffffffffffffe Nov 20 23:26:11 centos79 kernel: LustreError: 8735:0:(upcall_cache.c:244:upcall_cache_get_entry()) acquire for key 0: error -110 Nov 20 23:26:11 centos79 kernel: Lustre: Mounted lustre-client Nov 20 23:26:52 centos79 kernel: Lustre: mdt00_002: service thread pid 8733 was inactive for 40.169 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 20 23:26:52 centos79 kernel: Pid: 8733, comm: mdt00_002 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 Nov 20 23:26:52 centos79 kernel: Call Trace: Nov 20 23:26:52 centos79 kernel: [<0>] upcall_cache_get_entry+0x201/0x9d0 [obdclass] Nov 20 23:26:52 centos79 kernel: [<0>] mdt_identity_get+0x17/0x50 [mdt] Nov 20 23:26:52 centos79 kernel: [<0>] old_init_ucred_common+0xed/0x2c0 [mdt] Nov 20 23:26:52 centos79 kernel: [<0>] mdt_init_ucred+0x19c/0x320 [mdt] Nov 20 23:26:52 centos79 kernel: [<0>] mdt_intent_getattr+0xc3/0x4b0 [mdt] Nov 20 23:26:52 centos79 kernel: [<0>] mdt_intent_opc+0x1c5/0xc40 [mdt] Nov 20 23:26:52 centos79 kernel: [<0>] mdt_intent_policy+0xfa/0x460 [mdt] Nov 20 23:26:52 centos79 kernel: [<0>] ldlm_lock_enqueue+0x3a4/0xb80 [ptlrpc] Nov 20 23:26:52 centos79 kernel: [<0>] ldlm_handle_enqueue+0x34d/0x17a0 [ptlrpc] Nov 20 23:26:52 centos79 kernel: [<0>] tgt_enqueue+0x68/0x240 [ptlrpc] Nov 20 23:26:52 centos79 kernel: [<0>] tgt_request_handle+0x743/0x19a0 [ptlrpc] Nov 20 23:26:52 centos79 kernel: [<0>] ptlrpc_server_handle_request+0x261/0xcd0 [ptlrpc] Nov 20 23:26:52 centos79 kernel: [<0>] ptlrpc_main+0xc5c/0x16c0 [ptlrpc] Nov 20 23:26:52 centos79 kernel: [<0>] kthread+0xd1/0xe0 Nov 20 23:26:52 centos79 kernel: [<0>] ret_from_fork_nospec_begin+0xe/0x21 Nov 20 23:26:52 centos79 kernel: [<0>] 0xfffffffffffffffe Nov 20 23:27:12 centos79 kernel: LustreError: 8733:0:(upcall_cache.c:244:upcall_cache_get_entry()) acquire for key 0: error -110 |
| Comment by Alexander Zarochentsev [ 21/Nov/23 ] |
|
Hello Arshad, can you please test the following patch: diff --git a/lustre/utils/l_getidentity.c b/lustre/utils/l_getidentity.c index 34c70d8d71..f99bdafc23 100644 --- a/lustre/utils/l_getidentity.c +++ b/lustre/utils/l_getidentity.c @@ -67,7 +67,7 @@ static int nss_pw_buf_len; static void *nss_pw_buf; static int nss_grent_buf_len; static void *nss_grent_buf; -static int g_n_nss_modules; +static int g_n_nss_modules = 0; static int grent_mod_no = -1; struct nss_module { unfortunately I cannot reproduce the problem quickly. Thanks, |
| Comment by Gerrit Updater [ 21/Nov/23 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53191 |
| Comment by Arshad Hussain [ 21/Nov/23 ] |
|
Hi Zam, Sorry, it did not work. I am getting this by just doing llmount.sh Nov 21 07:15:56 centos79 kernel: Lustre: mdt_rdpg00_001: service thread pid 10702 was inactive for 40.000 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 21 07:15:56 centos79 kernel: Pid: 10702, comm: mdt_rdpg00_001 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 Nov 21 07:15:56 centos79 kernel: Call Trace: Nov 21 07:15:56 centos79 kernel: [<0>] upcall_cache_get_entry+0x201/0xa00 [obdclass] Nov 21 07:15:56 centos79 kernel: [<0>] mdt_identity_get+0x17/0x50 [mdt] Nov 21 07:15:56 centos79 kernel: [<0>] old_init_ucred_common+0xed/0x2c0 [mdt] Nov 21 07:15:56 centos79 kernel: [<0>] mdt_init_ucred+0x19c/0x320 [mdt] Nov 21 07:15:56 centos79 kernel: [<0>] mdt_getattr+0x156/0x640 [mdt] Nov 21 07:15:56 centos79 kernel: [<0>] tgt_request_handle+0x743/0x19a0 [ptlrpc] Nov 21 07:15:56 centos79 kernel: [<0>] ptlrpc_server_handle_request+0x261/0xcd0 [ptlrpc] Nov 21 07:15:56 centos79 kernel: [<0>] ptlrpc_main+0xc5c/0x16c0 [ptlrpc] Nov 21 07:15:56 centos79 kernel: [<0>] kthread+0xd1/0xe0 Nov 21 07:15:56 centos79 kernel: [<0>] ret_from_fork_nospec_begin+0xe/0x21 Nov 21 07:15:56 centos79 kernel: [<0>] 0xfffffffffffffffe Nov 21 07:16:16 centos79 kernel: LustreError: 10702:0:(upcall_cache.c:251:upcall_cache_get_entry()) acquire for key 0: error -110 |
| Comment by Alexander Zarochentsev [ 21/Nov/23 ] |
|
Looks like the following hunk (except -ldl) in the original patch is not needed: diff --git a/lustre/utils/Makefile.am b/lustre/utils/Makefile.am index 56f67afd1d..bee65a1178 100644 --- a/lustre/utils/Makefile.am +++ b/lustre/utils/Makefile.am @@ -245,8 +245,9 @@ l_foreign_symlink_LDADD := $(top_builddir)/libcfs/libcfs/libcfs.la l_foreign_symlink_DEPENDENCIES := $(top_builddir)/libcfs/libcfs/libcfs.la l_getidentity_SOURCES = l_getidentity.c -l_getidentity_LDADD := $(top_builddir)/libcfs/libcfs/libcfs.la -l_getidentity_DEPENDENCIES := $(top_builddir)/libcfs/libcfs/libcfs.la +l_getidentity_LDADD := $(top_builddir)/libcfs/libcfs/libcfs.la liblustreapi.la $(LIBPTLCTL) +l_getidentity_LDFLAGS = -ldl +l_getidentity_DEPENDENCIES := $(top_builddir)/libcfs/libcfs/libcfs.la liblustreapi.la $(LIBPTLCTL) lhsmtool_posix_SOURCES = lhsmtool_posix.c pid_file.c pid_file.h lhsmtool_posix_LDADD := liblustreapi.la $(PTHREAD_LIBS) \ Moreover, it causes the shell wrapper lustre/utils/l_getidentity to fail silently when started by a kernel thread. The tool is not a binary file but a shell wrapper when I start Lustre from the source dir by llmount.sh: root@rocky lustre-release]# file lustre/utils/l_getidentity lustre/utils/l_getidentity: POSIX shell script, ASCII text executable, with very long lines [root@rocky lustre-release]# The failures cause lustre client mount delays. |
| Comment by James A Simmons [ 22/Nov/23 ] |
|
The reason for the shell wrapper is that allows running in the sandbox that people love to do. This way l_getindentity can find the proper libraries independent if its in source tree or installed into an image. If this application is important enough maybe we strip its need of any external libraries. If I remember correctly it only needs liblustreapi.so for mdt/%s/identity_info. Perhaps we can migrate to a udev rule that spins up l_getidentity instead. Another option is we use globs directly instead of cfs_get_param_paths(). BY the way why -ldl directly. Seems to want to work around the whole libtool handlng. |
| Comment by Arshad Hussain [ 22/Nov/23 ] |
|
Zam, your latest patch does work. Both on 7.9 and 8.4. >The tool is not a binary file but a shell wrapper when I start Lustre from the source dir by llmount.sh:... The binaries are normally kept under lustre/utils/.libs/.
# file lustre/utils/.libs/l_getidentity lustre/utils/.libs/l_getidentity: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=b3cb5b448ee43365845f5d1fcb0a349bf91877c6, with debug_info, not stripped The targets are normally names lustre/utils/.libs/lt-* (example lfs and lctl) along with with real name
# ls lustre/utils/.libs/lt-* lustre/utils/.libs/lt-lctl lustre/utils/.libs/lt-lfs l_getidentity does not have lt-<target>. I am not sure if this is causing the issue.
|
| Comment by Alexander Zarochentsev [ 22/Nov/23 ] |
|
I intercepted the error stream output of l_getidentity being called by a Lustre server, and got something like: [root@rocky ~]# cat /tmp/identity.log PATH=/sbin:/usr/sbin 0 upcall start lustre-MDT0000 0 /mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 150: ls: command not found /mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 197: rm: command not found /mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 211: rm: command not found /mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 212: mv: command not found /mnt/nfs/git/lustre-release/lustre/utils/l_getidentity: line 213: rm: command not found upcall end lustre-MDT0000 0 0 [root@rocky ~]# I.e. the shell wrapper uses core linux utilities from /bin but PATH contains only "/sbin" and "/usr/sbin" , it is the reason for the libtool shell wrapper to function incorrectly. If I add "/bin:/usr/bin" to the PATH variable, the upcall (in its shell variant) starts to work. Thanks, |
| Comment by Arshad Hussain [ 22/Nov/23 ] |
|
I looked into an old vm and looks like lustre/utils/l_getidentity was always a binary and not a shell wrapper or launcher script. |
| Comment by Alexander Zarochentsev [ 22/Nov/23 ] |
|
>I looked into an old vm and looks like lustre/utils/l_getidentity was always a binary and not a shell wrapper or launcher script. yes, the patch converts it to a binary again. |
| Comment by Gerrit Updater [ 13/Dec/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53191/ |
| Comment by Peter Jones [ 13/Dec/23 ] |
|
Landed for 2.16 |