[LU-10514] all metadata operations take 1+ minutes thanks to libtool's l_getidentity Created: 15/Jan/18  Updated: 28/Dec/19  Resolved: 25/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Blocker
Reporter: John Hammond Assignee: John Hammond
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Duplicate
is duplicated by LU-10545 upcall_cache_get_entry() acquire for ... Resolved
Related
is related to LU-5541 improve liblustreapi build Resolved
is related to LU-10627 tests/utils: lhsmtool_posix now runs ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
n:lustre-release# bash $LUSTRE/tests/llmount.sh
...
n:lustre-release# time touch /mnt/lustre/f0

real    1m30.060s
user    0m0.001s
sys    0m0.004s
n:lustre-release# time touch /mnt/lustre/f1

real	2m0.058s
user	0m0.000s
sys	0m0.004s

libtoolizing lustre/utils creates a wrapper script for l_getidentity that doesn't work.

When invoked the wrapped script prints the following to stderr

/root/lustre-release/lustre/utils/l_getidentity: line 151: ls: command not found
/root/lustre-release/lustre/utils/l_getidentity: line 198: rm: command not found
/root/lustre-release/lustre/utils/l_getidentity: line 212: rm: command not found
/root/lustre-release/lustre/utils/l_getidentity: line 213: mv: command not found
/root/lustre-release/lustre/utils/l_getidentity: line 214: rm: command not found
/root/lustre-release/lustre/utils/l_getidentity: error: `/root/lustre-release/lustre/utils/.libs/lt-l_getidentity' does not exist
This script is just a wrapper for lt-l_getidentity.
See the libtool documentation for more information.

But this doesn't go anywhere because stderr is to connected to anything when l_getidentity is run.

l_getidentity should not depend on liblustreapi. We should factor out whatever it needs in separate .c files and add them to l_getidentity dependencies.

Also why do it take 2 minutes to for the operation to complete. It seems like we're not handling failure from the identity downcall very well. It sets stuck at:

n:~# stack1 mdt
8856 mdt00_003
[<ffffffffc0921e80>] upcall_cache_get_entry+0x1d0/0x8f0 [obdclass]
[<ffffffffc10120c7>] mdt_identity_get+0x17/0x50 [mdt]
[<ffffffffc0ff36eb>] old_init_ucred_common+0xdb/0x290 [mdt]
[<ffffffffc0ff39c7>] old_init_ucred+0x127/0x240 [mdt]
[<ffffffffc0ff5405>] mdt_init_ucred_intent_getattr+0x85/0xa0 [mdt]
[<ffffffffc0ff04f5>] mdt_intent_getattr+0xc5/0x470 [mdt]
[<ffffffffc0fe60b2>] mdt_intent_opc+0x442/0xad0 [mdt]
[<ffffffffc0fedc73>] mdt_intent_policy+0x1a3/0x360 [mdt]
[<ffffffffc0d042fa>] ldlm_lock_enqueue+0x38a/0x970 [ptlrpc]
[<ffffffffc0d2da33>] ldlm_handle_enqueue0+0x8f3/0x1400 [ptlrpc]
[<ffffffffc0db3752>] tgt_enqueue+0x62/0x210 [ptlrpc]
[<ffffffffc0dbb965>] tgt_request_handle+0x925/0x13b0 [ptlrpc]
[<ffffffffc0d5fc7e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[<ffffffffc0d63422>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[<ffffffff810b252f>] kthread+0xcf/0xe0
[<ffffffff816b8798>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

It's sleeping at left = schedule_timeout(expiry) in upcall_cache_get_entry().

BTW, libtool is terrible.



 Comments   
Comment by John Hammond [ 15/Jan/18 ]

Sorry, forgot to mention that the libtoolizing of lustre/utils/ was done by https://review.whamcloud.com/30562 LU-5541 build: move libcfs and liblustreapi over to libtool.

Comment by James A Simmons [ 15/Jan/18 ]

None of the tests are showing this regression. This looks like a corner case. What are the exact steps you did to reproduce this problem?

/usr/src/lustre-2.10.56/lustre/utils$ ./l_getidentity

usage: l_getidentity {-d|mdtname} {uid}
Normally invoked as an upcall from Lustre, set via:
lctl set_param mdt.${mdtname}.identity_upcall={path to upcall}
-d: debug, print values to stdout instead of Lustre

/usr/src/lustre-2.10.56/lustre/utils$ ldd ./l_getidentity
not a dynamic executable

Well the other option besides libtool is cmake

I wouldn't consider this a blocker since under normal conditions things do work.

Comment by John Hammond [ 15/Jan/18 ]

James,

Anything that breaks when ./lustre/tests/llmount.sh is run from the lustre-release directory should be a blocker. Many developers depend on this working to reduce turn around time for testing locally.

Comment by Gerrit Updater [ 16/Jan/18 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30872
Subject: LU-10514 utils: statically link l_getidentity with libcfs.a
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e53f0067d9f8b4ab2d7bf8df75b5fb02aae03727

Comment by James A Simmons [ 16/Jan/18 ]

You could also do l_getidentity_LDFLAGS := -static. We should look at making l_getidentity independent of libcfs.a as well. Currently we have to have libcfs.a due to need for libcfs_nid2str() used by /etc/perm.conf support. Do we need /etc/perm.conf anymore since we have nodemap which is way better. Second libcfs.a use is cfs_get_param() but we could just test for /proc and /sys indentity_info file and write to it directly.

Comment by Andreas Dilger [ 22/Jan/18 ]

James, getting rid of /etc/perm.conf would be great. As you write, we have nodemap already. That was part of the goal of the LU-7533 cleanup.

Comment by Gerrit Updater [ 25/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30872/
Subject: LU-10514 utils: statically link l_getidentity with libcfs.a
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 14e8e8e8a6b1bfe370e3e1f9fef6e37ccfc23290

Comment by Peter Jones [ 25/Jan/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:35:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.