[LU-9193] Multiple hangs observed with many open/getattr Created: 08/Dec/16 Updated: 14/Dec/21 Resolved: 01/Apr/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.7 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jean-Baptiste Riaux (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Centos 7.2 |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
Tested (reproduced) on 2.5 , 2.7, 2.8 and 2.9 MPI job on 300 nodes: 2/3 open and 1/3 stat on same file => hang (The MDS server threads are idle and the load is close to 0, threads are waiting for a lock but no threads have an active lock. After a long time, 15/30mn, threads become responsive again and resume operations normally). Some of the logs were similar to So far what prevents the issue is to disable SELINUX:
Reproducer (127 clients: vm3 to vm130, vm0 is MDS, vm1 and 2 are OSS): Tested disabling statahead. No impact. Traces of stuck processes on the MDS look all the same (could be related to DDN-366): I do not have (yet) the clients traces but last call on clients is mdc_enqueue. Crash dump analysis started (in progress). So far nothing obvious about SELinux, first analysis lead to ldlm_handle_enqueue0() and ldlm_lock_enqueue(). We see some processes idle for 280s, it's not entirely stuck but very very slow (we had to crash the VM to get proper dump because it was very hard to use crash as the threads are not 100% stuck). |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 09/Dec/16 ] |
|
I have assigned it to me since I have started to work on this with JB on-site. |
| Comment by Jean-Baptiste Riaux (Inactive) [ 13/Dec/16 ] |
|
I could reproduce the problem yesterday on site with Lustre 2.9 (containing the suggested |
| Comment by Bruno Faccini (Inactive) [ 05/Jan/17 ] |
|
During joint on-site analysis with JB (where we have crashed all Clients/Servers nodes of our test VM-made cluster at the same time trying to get a frozen picture of the blocked situation), we have found that due to SE-Linux, each inode/dentry look'ed-up on a path discovery/namei needs to get+control the security/selinux XATTR, and this seems to strongly interfer when concurrent access occur from same and different nodes, like with CEA's identified reproducer (A mix of concurrent stat/create/setattr from >100 nodes, as far as I can understand from reproducer's text indicated in ticket's description). We have still not finished the analysis, but seems that for a particular MDT object, for which there are a lot of mdt threads waiting to grant Clients read/write locks, and where some (there are multiple granted) of the currently granted read-lock owner's are now stuck on client side trying to getxattr of security.selinux xattr, and this since hundreds of seconds. When on MDT side we have discovered that these getxattr requests have been received/queued but are not handled. This is where we are at the moment, but there is more to come. And JB, feel free to add/correct me. |
| Comment by John Hammond [ 12/Jan/17 ] |
|
JB, could you attach the dmesg and the output of 'foreach bt' from the MDS crash image? |
| Comment by Bruno Faccini (Inactive) [ 13/Jan/17 ] |
|
Here are more detailed infos with the help of datas that we have been allowed to extract and use in this ticket. We have continued the analysis of the pseudo-hang on-site, and we have found that the getxattr() request for the security.selinux EA could not be handled by MDS/MDT because all the 149 of the 150 threads serving the "mdt" service are stuck with either of the following 2 stacks : PID: 11489 TASK: ffff880714dc0000 CPU: 2 COMMAND: "mdt00_066" #0 [ffff88070a433680] __schedule at ffffffff8168b175 #1 [ffff88070a4336e8] schedule at ffffffff8168b7c9 #2 [ffff88070a4336f8] schedule_timeout at ffffffff81689155 #3 [ffff88070a4337a0] ldlm_completion_ast at ffffffffa09b3791 [ptlrpc] #4 [ffff88070a433840] ldlm_cli_enqueue_local at ffffffffa09b44e0 [ptlrpc] #5 [ffff88070a4338d8] mdt_object_local_lock at ffffffffa0e717d2 [mdt] #6 [ffff88070a433988] mdt_object_lock_internal at ffffffffa0e71ebe [mdt] #7 [ffff88070a4339d8] mdt_object_lock at ffffffffa0e72174 [mdt] #8 [ffff88070a4339e8] mdt_reint_open at ffffffffa0e93a11 [mdt] #9 [ffff88070a433ad8] mdt_reint_rec at ffffffffa0e8a490 [mdt] #10 [ffff88070a433b00] mdt_reint_internal at ffffffffa0e6cdb2 [mdt] #11 [ffff88070a433b38] mdt_intent_reint at ffffffffa0e6d312 [mdt] #12 [ffff88070a433b78] mdt_intent_policy at ffffffffa0e77b8c [mdt] #13 [ffff88070a433bd0] ldlm_lock_enqueue at ffffffffa09991e7 [ptlrpc] #14 [ffff88070a433c28] ldlm_handle_enqueue0 at ffffffffa09c23a3 [ptlrpc] #15 [ffff88070a433cb8] tgt_enqueue at ffffffffa0a42e12 [ptlrpc] #16 [ffff88070a433cd8] tgt_request_handle at ffffffffa0a47275 [ptlrpc] #17 [ffff88070a433d20] ptlrpc_server_handle_request at ffffffffa09f31fb [ptlrpc] #18 [ffff88070a433de8] ptlrpc_main at ffffffffa09f72b0 [ptlrpc] #19 [ffff88070a433ec8] kthread at ffffffff810b052f #20 [ffff88070a433f50] ret_from_fork at ffffffff81696658 or PID: 11490 TASK: ffff880714dc1f60 CPU: 4 COMMAND: "mdt00_067" #0 [ffff88070a5ef750] __schedule at ffffffff8168b175 #1 [ffff88070a5ef7b8] schedule at ffffffff8168b7c9 #2 [ffff88070a5ef7c8] schedule_timeout at ffffffff81689155 #3 [ffff88070a5ef878] ldlm_completion_ast at ffffffffa09b3791 [ptlrpc] #4 [ffff88070a5ef918] ldlm_cli_enqueue_local at ffffffffa09b44e0 [ptlrpc] #5 [ffff88070a5ef9b0] mdt_object_local_lock at ffffffffa0e7188a [mdt] #6 [ffff88070a5efa60] mdt_object_lock_internal at ffffffffa0e71ebe [mdt] #7 [ffff88070a5efab0] mdt_getattr_name_lock at ffffffffa0e72c64 [mdt] #8 [ffff88070a5efb38] mdt_intent_getattr at ffffffffa0e73f70 [mdt] #9 [ffff88070a5efb78] mdt_intent_policy at ffffffffa0e77b8c [mdt] #10 [ffff88070a5efbd0] ldlm_lock_enqueue at ffffffffa09991e7 [ptlrpc] #11 [ffff88070a5efc28] ldlm_handle_enqueue0 at ffffffffa09c23a3 [ptlrpc] #12 [ffff88070a5efcb8] tgt_enqueue at ffffffffa0a42e12 [ptlrpc] #13 [ffff88070a5efcd8] tgt_request_handle at ffffffffa0a47275 [ptlrpc] #14 [ffff88070a5efd20] ptlrpc_server_handle_request at ffffffffa09f31fb [ptlrpc] #15 [ffff88070a5efde8] ptlrpc_main at ffffffffa09f72b0 [ptlrpc] #16 [ffff88070a5efec8] kthread at ffffffff810b052f #17 [ffff88070a5eff50] ret_from_fork at ffffffff81696658 when the last/150th seems busy handling obd_ping traffic, as found in the debug trace extracted from the crash-dump. BTW and as expected, during a new run+pseudo-hang of the reproducer, raising the max number of “mdt” service threads to 1024, has allowed the MDS to quickly recover from its pseudo-hung situation and finally handle all requests leading to reproducer's completion in a very short time !!… The contended resource appears to be the main directory where the reproducer’s files are located “/lustre/testfs/testuser/testdir/“. The Clients threads, currently having the lock on the directory resource granted, and then waiting for their getxattr() request to be serviced have the following stack : PID: 8825 TASK: ffff88072dee6dd0 CPU: 5 COMMAND: "touch"
#0 [ffff88017d4ef5a0] __schedule at ffffffff8168b175
#1 [ffff88017d4ef608] schedule at ffffffff8168b7c9
#2 [ffff88017d4ef618] schedule_timeout at ffffffff81689155
#3 [ffff88017d4ef6c8] ptlrpc_set_wait at ffffffffa09c8cd0 [ptlrpc]
#4 [ffff88017d4ef770] ptlrpc_queue_wait at ffffffffa09c918d [ptlrpc]
#5 [ffff88017d4ef790] mdc_xattr_common at ffffffffa0bd93aa [mdc]
#6 [ffff88017d4ef810] mdc_getxattr at ffffffffa0bd962f [mdc]
#7 [ffff88017d4ef850] lmv_getxattr at ffffffffa0c1660e [lmv]
#8 [ffff88017d4ef8c8] ll_getxattr_common at ffffffffa0c942c4 [lustre]
#9 [ffff88017d4ef948] ll_getxattr at ffffffffa0c94763 [lustre]
#10 [ffff88017d4ef980] inode_doinit_with_dentry at ffffffff812b281f
#11 [ffff88017d4ef9d8] selinux_d_instantiate at ffffffff812b38ec
#12 [ffff88017d4ef9e8] security_d_instantiate at ffffffff812a819b
#13 [ffff88017d4ef9f8] d_instantiate at ffffffff812162b0
#14 [ffff88017d4efa20] ll_splice_alias at ffffffffa0c8939a [lustre]
#15 [ffff88017d4efa60] ll_lookup_it_finish at ffffffffa0c89633 [lustre]
#16 [ffff88017d4efb28] ll_lookup_it at ffffffffa0c8ae6e [lustre]
#17 [ffff88017d4efbe0] ll_lookup_nd at ffffffffa0c8c6dd [lustre]
#18 [ffff88017d4efc78] lookup_real at ffffffff812083dd
#19 [ffff88017d4efc98] __lookup_hash at ffffffff81208d52
#20 [ffff88017d4efcc8] lookup_slow at ffffffff816833fb
#21 [ffff88017d4efd00] link_path_walk at ffffffff8120b96f
#22 [ffff88017d4efdb0] path_openat at ffffffff8120dc53
#23 [ffff88017d4efe48] do_filp_open at ffffffff8120fdeb
#24 [ffff88017d4eff18] do_sys_open at ffffffff811fd2f3
#25 [ffff88017d4eff70] sys_open at ffffffff811fd40e
#26 [ffff88017d4eff80] system_call_fastpath at ffffffff81696709
RIP: 00007fbf1b1312b0 RSP: 00007ffe4c0891e8 RFLAGS: 00010287
RAX: 0000000000000002 RBX: ffffffff81696709 RCX: 0000000000000014
RDX: 00000000000001b6 RSI: 0000000000000941 RDI: 00007ffe4c08ae2d
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
R10: 00007ffe4c0890b0 R11: 0000000000000246 R12: ffffffff811fd40e
R13: ffff88017d4eff78 R14: 0000000000000000 R15: 0000000000000001
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
Others reproducer's threads on the same Clients have the following stacks, depending if they support a touch (3 per reproducer step) or stat (4 per reproducer step) cmd in the reproducer : PID: 8834 TASK: ffff880034592f10 CPU: 3 COMMAND: "touch"
#0 [ffff88072e2ebbe0] __schedule at ffffffff8168b175
#1 [ffff88072e2ebc48] schedule_preempt_disabled at ffffffff8168c8b9
#2 [ffff88072e2ebc58] __mutex_lock_slowpath at ffffffff8168a515
#3 [ffff88072e2ebcb0] mutex_lock at ffffffff8168997f
#4 [ffff88072e2ebcc8] lookup_slow at ffffffff816833ec
#5 [ffff88072e2ebd00] link_path_walk at ffffffff8120b96f
#6 [ffff88072e2ebdb0] path_openat at ffffffff8120dc53
#7 [ffff88072e2ebe48] do_filp_open at ffffffff8120fdeb
#8 [ffff88072e2ebf18] do_sys_open at ffffffff811fd2f3
#9 [ffff88072e2ebf70] sys_open at ffffffff811fd40e
#10 [ffff88072e2ebf80] system_call_fastpath at ffffffff81696709
RIP: 00007fd7a10b22b0 RSP: 00007ffea7e57428 RFLAGS: 00010287
RAX: 0000000000000002 RBX: ffffffff81696709 RCX: 0000000000000014
RDX: 00000000000001b6 RSI: 0000000000000941 RDI: 00007ffea7e57e2b
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
R10: 00007ffea7e572f0 R11: 0000000000000246 R12: ffffffff811fd40e
R13: ffff88072e2ebf78 R14: 0000000000000000 R15: 0000000000000001
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
PID: 8832 TASK: ffff88073c370000 CPU: 0 COMMAND: "stat"
#0 [ffff88072e2d7bc0] __schedule at ffffffff8168b175
#1 [ffff88072e2d7c28] schedule_preempt_disabled at ffffffff8168c8b9
#2 [ffff88072e2d7c38] __mutex_lock_slowpath at ffffffff8168a515
#3 [ffff88072e2d7c98] mutex_lock at ffffffff8168997f
#4 [ffff88072e2d7cb0] lookup_slow at ffffffff816833ec
#5 [ffff88072e2d7ce8] path_lookupat at ffffffff8120c273
#6 [ffff88072e2d7d80] filename_lookup at ffffffff8120c2cb
#7 [ffff88072e2d7db8] user_path_at_empty at ffffffff8120fce7
#8 [ffff88072e2d7e88] user_path_at at ffffffff8120fd51
#9 [ffff88072e2d7e98] vfs_fstatat at ffffffff812031c3
#10 [ffff88072e2d7ee8] SYSC_newlstat at ffffffff81203791
#11 [ffff88072e2d7f70] sys_newlstat at ffffffff81203a1e
#12 [ffff88072e2d7f80] system_call_fastpath at ffffffff81696709
RIP: 00007fdea1e6eeb5 RSP: 00007ffea2b024e8 RFLAGS: 00000246
RAX: 0000000000000006 RBX: ffffffff81696709 RCX: ffffffffffffffff
RDX: 00007ffea2b02590 RSI: 00007ffea2b02590 RDI: 00007ffea2b02e34
RBP: 0000000002049460 R8: 0000000000000001 R9: 00007ffea2b02e34
R10: 00007ffea2b022e0 R11: 0000000000000246 R12: ffffffff81203a1e
R13: ffff88072e2d7f78 R14: 0000000000000002 R15: 00000000e406a798
ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b
But I need to also detail the state of the reproducer's processes on the others Clients/nodes, likely to be responsible of the requests being handled by the hung threads on MDS side just described before. I will also try to better understand if the 1st touch/file creation (requiring a write lock on directory to add entry) is mandatory or not to get the pseudo-hung situation or not. Based on all of this, and after a finer on-site crash-dump analysis of the current locking requests state from all Clients, I will also try to summarize our understanding of the exact sequence of requests/events leading to the pseudo-hung situation when running the reproducer in my next post. But I have already the feeling that we may face a problem due to a timing window where some Client(s) can get a lock on directory granted and then try to access the security.selinux xattr, but in the mean time others Clients have submitted their requests/locks on the directory resource and have monopolized all the available "mdt" service threads. |
| Comment by John Hammond [ 13/Jan/17 ] |
|
What are the kernel and lustre versions running on the MDS and on the clients? |
| Comment by Jean-Baptiste Riaux (Inactive) [ 23/Jan/17 ] |
|
Clients and MDS are running both:
|
| Comment by Jean-Baptiste Riaux (Inactive) [ 03/Feb/17 ] |
|
MDS max started threads: 149 Locks waiting: Blocked threads in mdt_intent_getattr: Blocked threads in mdt_reint_open: 126+21 = 147 which math blocked threads number. Looking compat matrix: PR and CW are not compatible. So we have processes (stat and touch) doing Concurent Write ibits 2 : ------------------- GRANTED locks: Following the locks lead again to following stacks (touch and stat) commande touch: Here the client tries to create a new lock (MDS_INODELOCK_XATTR) but MDS cannot accept it as all its threads are busy #8 [ffff88017d4ef8c8] ll_getxattr_common at ffffffffa0c942c4 [lustre] Here the client created a lock on the test directory (common on all clients) which is the granted lock we find on the MDS. But client cannot release the lock as it tries to create a lock MDS_INODELOCK_XATTR upper in the stack (which cannot be handled by MDS because all threads are busy). So we are stuck #17 [ffff88017d4efbe0] ll_lookup_nd at ffffffffa0c8c6dd [lustre] PID: 8826 TASK: ffff88072b7d0000 CPU: 6 COMMAND: "stat" We see also that invocating d_instantiate (selinux_d_instantiate) add an entry in the dentry. |
| Comment by Bruno Faccini (Inactive) [ 08/Feb/17 ] |
|
John, J-B, would it be possible to test this on-site with your reproducer+test-platform?? Patch should look as this : bfaccini-mac02:lustre-release bfaccini$ git status
On branch CEA-2+LU-5560
bfaccini-mac02:lustre-release bfaccini$ git show
commit 779f62ee53449049c92ceb12987487597ed302c7
Author: Bruno Faccini <bruno.faccini@intel.com>
Date: Tue Feb 7 22:41:30 2017 +0100
LU-5560 llite: put security.selinux in xattr cache
Do not filter out security.selinux from xattr cache because if not,
it will need to be requested from MDT during each path component
lookup and this with current directory lock granted, finally causing
a blocked situation, if there are a lot of similar access from
multiple clients, which may lead to all mdt service threads to be
busy waiting for these locks reclaim and thus unable to serve the
getxattr for these lock owners.
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Change-Id: I3e48c5cd4cdb0a9259241853784aca67305f4127
diff --git a/lustre/llite/xattr.c b/lustre/llite/xattr.c
index ca0e5c9..5293833 100644
--- a/lustre/llite/xattr.c
+++ b/lustre/llite/xattr.c
@@ -364,10 +364,7 @@ int ll_getxattr_common(struct inode *inode, const char *name,
#endif
do_getxattr:
- if (sbi->ll_xattr_cache_enabled &&
- xattr_type != XATTR_ACL_ACCESS_T &&
- (xattr_type != XATTR_SECURITY_T ||
- strcmp(name, "security.selinux") != 0)) {
+ if (sbi->ll_xattr_cache_enabled && xattr_type != XATTR_ACL_ACCESS_T) {
rc = ll_xattr_cache_get(inode, name, buffer, size, valid);
if (rc == -EAGAIN)
goto getxattr_nocache;
diff --git a/lustre/llite/xattr_cache.c b/lustre/llite/xattr_cache.c
index 18a289b..0f50426 100644
--- a/lustre/llite/xattr_cache.c
+++ b/lustre/llite/xattr_cache.c
@@ -446,10 +446,6 @@ static int ll_xattr_cache_refill(struct inode *inode)
CDEBUG(D_CACHE, "not caching %s\n",
XATTR_NAME_ACL_ACCESS);
rc = 0;
- } else if (!strcmp(xdata, "security.selinux")) {
- /* Filter out security.selinux, it is cached in slab */
- CDEBUG(D_CACHE, "not caching security.selinux\n");
- rc = 0;
} else {
rc = ll_xattr_cache_add(&lli->lli_xattrs, xdata, xval,
*xsizes);
bfaccini-mac02:lustre-release bfaccini$
|
| Comment by Bruno Faccini (Inactive) [ 13/Feb/17 ] |
|
Talking with S.Buisson about this, he also pointed to his patch (https://review.whamcloud.com/24426) for J-B will give a try to both possible solutions and expose them to the known reproducer and let us know asap about his results. |
| Comment by Jean-Baptiste Riaux (Inactive) [ 14/Feb/17 ] |
|
Tested both patches yesterday, the problem remains. However with Bruno's patch, the stack is now different than before and Sebastien's patch did improve a bit the test speed. The test did evict 2 or 3 clients at each run but every run ended after 5mns(lustre unmounted and cache dropped between runs), so it's a bit better. Before this patch we could wait 2h, lustre was not recovering properly. If I don't drop caches, the test ends in less than 3 seconds after a first run. Also for the 2 or 3 clients that failed I see this output: stat: failed to get security context of ‘-removed-/testdir’: Input/output error I did merge both patches today. New stacks on clients looks like this: PID: 8790 TASK: ffff88072eb4de20 CPU: 0 COMMAND: "stat" #0 [ffff880725a77270] __schedule at ffffffff8168b175 #1 [ffff880725a772d8] schedule at ffffffff8168b7c9 #2 [ffff880725a772e8] schedule_timeout at ffffffff81689155 #3 [ffff880725a77390] ptlrpc_set_wait at ffffffffa09ce190 [ptlrpc] #4 [ffff880725a77438] ptlrpc_queue_wait at ffffffffa09ce64d [ptlrpc] #5 [ffff880725a77458] ldlm_cli_enqueue at ffffffffa09aa856 [ptlrpc] #6 [ffff880725a77500] mdc_enqueue_base at ffffffffa0bf98ef [mdc] #7 [ffff880725a775f8] mdc_intent_lock at ffffffffa0bfc57b [mdc] #8 [ffff880725a776c0] lmv_intent_lock at ffffffffa0c2d0af [lmv] #9 [ffff880725a77778] ll_xattr_cache_refill at ffffffffa0c9cf9b [lustre] #10 [ffff880725a77858] ll_xattr_cache_get at ffffffffa0c9e33b [lustre] #11 [ffff880725a778b0] ll_getxattr_common at ffffffffa0c9aaf8 [lustre] #12 [ffff880725a77930] ll_getxattr at ffffffffa0c9b4c3 [lustre] #13 [ffff880725a77968] inode_doinit_with_dentry at ffffffff812b281f #14 [ffff880725a779c0] selinux_d_instantiate at ffffffff812b38ec #15 [ffff880725a779d0] security_d_instantiate at ffffffff812a819b #16 [ffff880725a779e0] d_instantiate at ffffffff812162b0 #17 [ffff880725a77a08] ll_splice_alias at ffffffffa0c9004a [lustre] #18 [ffff880725a77a48] ll_lookup_it_finish at ffffffffa0c902e3 [lustre] #19 [ffff880725a77b10] ll_lookup_it at ffffffffa0c91b1e [lustre] #20 [ffff880725a77bc8] ll_lookup_nd at ffffffffa0c9342e [lustre] #21 [ffff880725a77c60] lookup_real at ffffffff812083dd #22 [ffff880725a77c80] __lookup_hash at ffffffff81208d52 #23 [ffff880725a77cb0] lookup_slow at ffffffff816833fb #24 [ffff880725a77ce8] path_lookupat at ffffffff8120c273 #25 [ffff880725a77d80] filename_lookup at ffffffff8120c2cb #26 [ffff880725a77db8] user_path_at_empty at ffffffff8120fce7 #27 [ffff880725a77e88] user_path_at at ffffffff8120fd51 #28 [ffff880725a77e98] vfs_fstatat at ffffffff812031c3 #29 [ffff880725a77ee8] SYSC_newlstat at ffffffff81203791 #30 [ffff880725a77f70] sys_newlstat at ffffffff81203a1e #31 [ffff880725a77f80] system_call_fastpath at ffffffff81696709 PID: 8773 TASK: ffff880734636dd0 CPU: 5 COMMAND: "touch" #0 [ffff88072e66f280] __schedule at ffffffff8168b175 #1 [ffff88072e66f2e8] schedule at ffffffff8168b7c9 #2 [ffff88072e66f2f8] schedule_timeout at ffffffff81689155 #3 [ffff88072e66f3a8] ptlrpc_set_wait at ffffffffa09c4190 [ptlrpc] #4 [ffff88072e66f450] ptlrpc_queue_wait at ffffffffa09c464d [ptlrpc] #5 [ffff88072e66f470] ldlm_cli_enqueue at ffffffffa09a0856 [ptlrpc] #6 [ffff88072e66f518] mdc_enqueue_base at ffffffffa0bef8ef [mdc] #7 [ffff88072e66f610] mdc_intent_lock at ffffffffa0bf257b [mdc] #8 [ffff88072e66f6d8] lmv_intent_lock at ffffffffa0c230af [lmv] #9 [ffff88072e66f790] ll_xattr_cache_refill at ffffffffa0c92f9b [lustre] #10 [ffff88072e66f870] ll_xattr_cache_get at ffffffffa0c9433b [lustre] #11 [ffff88072e66f8c8] ll_getxattr_common at ffffffffa0c90af8 [lustre] #12 [ffff88072e66f948] ll_getxattr at ffffffffa0c914c3 [lustre] #13 [ffff88072e66f980] inode_doinit_with_dentry at ffffffff812b281f #14 [ffff88072e66f9d8] selinux_d_instantiate at ffffffff812b38ec #15 [ffff88072e66f9e8] security_d_instantiate at ffffffff812a819b #16 [ffff88072e66f9f8] d_instantiate at ffffffff812162b0 #17 [ffff88072e66fa20] ll_splice_alias at ffffffffa0c8604a [lustre] #18 [ffff88072e66fa60] ll_lookup_it_finish at ffffffffa0c862e3 [lustre] #19 [ffff88072e66fb28] ll_lookup_it at ffffffffa0c87b1e [lustre] #20 [ffff88072e66fbe0] ll_lookup_nd at ffffffffa0c893fd [lustre] #21 [ffff88072e66fc78] lookup_real at ffffffff812083dd #22 [ffff88072e66fc98] __lookup_hash at ffffffff81208d52 #23 [ffff88072e66fcc8] lookup_slow at ffffffff816833fb #24 [ffff88072e66fd00] link_path_walk at ffffffff8120b96f #25 [ffff88072e66fdb0] path_openat at ffffffff8120dc53 #26 [ffff88072e66fe48] do_filp_open at ffffffff8120fdeb #27 [ffff88072e66ff18] do_sys_open at ffffffff811fd2f3 #28 [ffff88072e66ff70] sys_open at ffffffff811fd40e #29 [ffff88072e66ff80] system_call_fastpath at ffffffff81696709 On the MDS, no surprises (all stacks are the same): PID: 10468 TASK: ffff88070d7a4e70 CPU: 3 COMMAND: "mdt00_083" #0 [ffff880705657750] __schedule at ffffffff8168b175 #1 [ffff8807056577b8] schedule at ffffffff8168b7c9 #2 [ffff8807056577c8] schedule_timeout at ffffffff81689155 #3 [ffff880705657878] ldlm_completion_ast at ffffffffa0a4cc71 [ptlrpc] #4 [ffff880705657918] ldlm_cli_enqueue_local at ffffffffa0a4d9c0 [ptlrpc] #5 [ffff8807056579b0] mdt_object_local_lock at ffffffffa0fe578a [mdt] #6 [ffff880705657a60] mdt_object_lock_internal at ffffffffa0fe5dbe [mdt] #7 [ffff880705657ab0] mdt_getattr_name_lock at ffffffffa0fe6b64 [mdt] #8 [ffff880705657b38] mdt_intent_getattr at ffffffffa0fe7e70 [mdt] #9 [ffff880705657b78] mdt_intent_policy at ffffffffa0feba8c [mdt] #10 [ffff880705657bd0] ldlm_lock_enqueue at ffffffffa0a32107 [ptlrpc] #11 [ffff880705657c28] ldlm_handle_enqueue0 at ffffffffa0a5b883 [ptlrpc] #12 [ffff880705657cb8] tgt_enqueue at ffffffffa0adc502 [ptlrpc] #13 [ffff880705657cd8] tgt_request_handle at ffffffffa0ae0875 [ptlrpc] #14 [ffff880705657d20] ptlrpc_server_handle_request at ffffffffa0a8c73b [ptlrpc] #15 [ffff880705657de8] ptlrpc_main at ffffffffa0a907f0 [ptlrpc] #16 [ffff880705657ec8] kthread at ffffffff810b052f #17 [ffff880705657f50] ret_from_fork at ffffffff81696658
|
| Comment by Bruno Faccini (Inactive) [ 23/Mar/17 ] |
|
We should be able to test a possible solution (embed security context/xattr in lookup reply to avoid extra getxattr() from security hooks, and thus fix dead-lock) on-site next week. |
| Comment by Sebastien Piechurski [ 18/Apr/17 ] |
|
Hi Bruno, JB, Any news from the test mentioned in your last comment ? |
| Comment by Bruno Faccini (Inactive) [ 18/Apr/17 ] |
|
Hello Seb, |
| Comment by Gerrit Updater [ 26/Apr/17 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/26831 |
| Comment by Jean-Baptiste Riaux (Inactive) [ 03/Jul/17 ] |
|
Ticket status:
|
| Comment by Jean-Baptiste Riaux (Inactive) [ 05/Jul/17 ] |
|
New logs. removed unnecessary OSS logs from the archive (file size too big) logs_050717.tar.gz |
| Comment by Jean-Baptiste Riaux (Inactive) [ 24/Aug/17 ] |
|
New test today with use of flag RMF_F_NO_SIZE_CHECK in call file_secctx: struct req_msg_field RMF_FILE_SECCTX =
DEFINE_MSGF("file_secctx", RMF_F_NO_SIZE_CHECK, -1, NULL, NULL);
EXPORT_SYMBOL(RMF_FILE_SECCTX);
Errors messages as in previous comment are gone but MDS threads are all consumed and load is at 0 (as before). |
| Comment by Jean-Baptiste Riaux (Inactive) [ 18/Sep/17 ] |
|
Patchset-11 tested. Problem remains (MDS load at 0 and all threads used) and this time I got eviction (see vm19 logs). |
| Comment by Bruno Faccini (Inactive) [ 19/Sep/17 ] |
|
Hum, having a look to the all the hung Client tasks stacks, it does not look like anymore that we are still facing a dead-lock scenario due to unexpected getxattr() of security xattr/context. On the other hand, since the MDS full thread stacks log is missing in the tarball I can't analyze further. Will need to go on-site and analyze in live with J-B, but looks like the original dead-lock issue is over and that we face a new issue. |
| Comment by Bruno Faccini (Inactive) [ 03/Oct/17 ] |
|
Well, as expected after my previous comment, this new pseudo-hang/dead-lock situation can be reproduced by running latest Lustre master on RHEL7.2 and with SELinux disabled... A lot of reproducer/clients stat+touch threads are stuck waiting for lock on directory to be granted with stacks like following: PID: 12292 TASK: ffff8806b1ee4e70 CPU: 4 COMMAND: "stat"
#0 [ffff8806c251f820] __schedule at ffffffff8168b175
#1 [ffff8806c251f888] schedule at ffffffff8168b7c9
#2 [ffff8806c251f898] schedule_timeout at ffffffff81689155
#3 [ffff8806c251f940] ptlrpc_set_wait at ffffffffa09f8030 [ptlrpc]
#4 [ffff8806c251f9e8] ptlrpc_queue_wait at ffffffffa09f84ed [ptlrpc]
#5 [ffff8806c251fa08] ldlm_cli_enqueue at ffffffffa09d46c6 [ptlrpc]
#6 [ffff8806c251fab0] mdc_enqueue_base at ffffffffa0c3fa0f [mdc]
#7 [ffff8806c251fba8] mdc_intent_lock at ffffffffa0c4186b [mdc]
#8 [ffff8806c251fc70] lmv_intent_lock at ffffffffa0c75f1f [lmv]
#9 [ffff8806c251fd28] __ll_inode_revalidate at ffffffffa0cb28db [lustre]
#10 [ffff8806c251fdd0] ll_getattr at ffffffffa0cb3793 [lustre]
#11 [ffff8806c251fe68] vfs_getattr at ffffffff812030a6
#12 [ffff8806c251fe98] vfs_fstatat at ffffffff812031d5
#13 [ffff8806c251fee8] SYSC_newlstat at ffffffff81203791
#14 [ffff8806c251ff70] sys_newlstat at ffffffff81203a1e
#15 [ffff8806c251ff80] system_call_fastpath at ffffffff81696709
RIP: 00007f4ae45d0eb5 RSP: 00007ffe8a35c2e8 RFLAGS: 00000246
RAX: 0000000000000006 RBX: ffffffff81696709 RCX: ffffffffffffffff
RDX: 00007ffe8a35c390 RSI: 00007ffe8a35c390 RDI: 00007ffe8a35ce35
RBP: 0000000001169460 R8: 0000000000000001 R9: 00007ffe8a35ce35
R10: 00007ffe8a35c0e0 R11: 0000000000000246 R12: ffffffff81203a1e
R13: ffff8806c251ff78 R14: 0000000000000002 R15: 00000000298134ed
ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b
PID: 12290 TASK: ffff8806c32a0fb0 CPU: 2 COMMAND: "touch"
#0 [ffff8806c27a75a0] __schedule at ffffffff8168b175
#1 [ffff8806c27a7608] schedule at ffffffff8168b7c9
#2 [ffff8806c27a7618] schedule_timeout at ffffffff81689155
#3 [ffff8806c27a76c0] ptlrpc_set_wait at ffffffffa09f8030 [ptlrpc]
#4 [ffff8806c27a7768] ptlrpc_queue_wait at ffffffffa09f84ed [ptlrpc]
#5 [ffff8806c27a7788] ldlm_cli_enqueue at ffffffffa09d46c6 [ptlrpc]
#6 [ffff8806c27a7830] mdc_enqueue_base at ffffffffa0c3fa0f [mdc]
#7 [ffff8806c27a7928] mdc_intent_lock at ffffffffa0c4186b [mdc]
#8 [ffff8806c27a79f0] lmv_intent_lock at ffffffffa0c7655d [lmv]
#9 [ffff8806c27a7aa8] ll_lookup_it at ffffffffa0cd9fa1 [lustre]
#10 [ffff8806c27a7c60] ll_atomic_open at ffffffffa0cdad9d [lustre]
#11 [ffff8806c27a7d00] do_last at ffffffff8120d321
#12 [ffff8806c27a7db0] path_openat at ffffffff8120dc72
#13 [ffff8806c27a7e48] do_filp_open at ffffffff8120fdeb
#14 [ffff8806c27a7f18] do_sys_open at ffffffff811fd2f3
#15 [ffff8806c27a7f70] sys_open at ffffffff811fd40e
#16 [ffff8806c27a7f80] system_call_fastpath at ffffffff81696709
RIP: 00007f7f2c5fc2b0 RSP: 00007ffc0d5338c8 RFLAGS: 00010287
RAX: 0000000000000002 RBX: ffffffff81696709 RCX: 0000000000000014
RDX: 00000000000001b6 RSI: 0000000000000941 RDI: 00007ffc0d534e2d
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
R10: 00007ffc0d533790 R11: 0000000000000246 R12: ffffffff811fd40e
R13: ffff8806c27a7f78 R14: 0000000000000000 R15: 0000000000000001
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
when all MDT service threads become busy/blocked with the similar stacks than following : PID: 11386 TASK: ffff880688ed0000 CPU: 5 COMMAND: "mdt00_121" #0 [ffff880688ecb660] __schedule at ffffffff8163b6cd #1 [ffff880688ecb6c8] schedule at ffffffff8163bd69 #2 [ffff880688ecb6d8] schedule_timeout at ffffffff816399c5 #3 [ffff880688ecb788] ldlm_completion_ast at ffffffffa0a43fa1 [ptlrpc] #4 [ffff880688ecb828] ldlm_cli_enqueue_local at ffffffffa0a45be0 [ptlrpc] #5 [ffff880688ecb8c0] mdt_object_local_lock at ffffffffa10120a2 [mdt] #6 [ffff880688ecb968] mdt_object_lock_internal at ffffffffa10124bb [mdt] #7 [ffff880688ecb9a8] mdt_object_find_lock at ffffffffa10149e2 [mdt] #8 [ffff880688ecb9d8] mdt_reint_open at ffffffffa1036c1b [mdt] #9 [ffff880688ecbac8] mdt_reint_rec at ffffffffa102bb50 [mdt] #10 [ffff880688ecbaf0] mdt_reint_internal at ffffffffa100d7f1 [mdt] #11 [ffff880688ecbb28] mdt_intent_reint at ffffffffa100db32 [mdt] #12 [ffff880688ecbb68] mdt_intent_policy at ffffffffa101825c [mdt] #13 [ffff880688ecbbd0] ldlm_lock_enqueue at ffffffffa0a2b1f7 [ptlrpc] #14 [ffff880688ecbc28] ldlm_handle_enqueue0 at ffffffffa0a50f72 [ptlrpc] #15 [ffff880688ecbcb8] tgt_enqueue at ffffffffa0ad58f2 [ptlrpc] #16 [ffff880688ecbcd8] tgt_request_handle at ffffffffa0ada2f5 [ptlrpc] #17 [ffff880688ecbd20] ptlrpc_server_handle_request at ffffffffa0a84ccb [ptlrpc] #18 [ffff880688ecbde8] ptlrpc_main at ffffffffa0a885f0 [ptlrpc] #19 [ffff880688ecbec8] kthread at ffffffff810a5b8f #20 [ffff880688ecbf50] ret_from_fork at ffffffff81646c98 PID: 11387 TASK: ffff880688ed2280 CPU: 6 COMMAND: "mdt00_122" #0 [ffff88068942f760] __schedule at ffffffff8163b6cd #1 [ffff88068942f7c8] schedule at ffffffff8163bd69 #2 [ffff88068942f7d8] schedule_timeout at ffffffff816399c5 #3 [ffff88068942f880] ldlm_completion_ast at ffffffffa0a43fa1 [ptlrpc] #4 [ffff88068942f920] ldlm_cli_enqueue_local at ffffffffa0a45be0 [ptlrpc] #5 [ffff88068942f9b8] mdt_object_local_lock at ffffffffa1011b92 [mdt] #6 [ffff88068942fa60] mdt_object_lock_internal at ffffffffa10124bb [mdt] #7 [ffff88068942faa0] mdt_getattr_name_lock at ffffffffa10132c6 [mdt] #8 [ffff88068942fb28] mdt_intent_getattr at ffffffffa10147c0 [mdt] #9 [ffff88068942fb68] mdt_intent_policy at ffffffffa101825c [mdt] #10 [ffff88068942fbd0] ldlm_lock_enqueue at ffffffffa0a2b1f7 [ptlrpc] #11 [ffff88068942fc28] ldlm_handle_enqueue0 at ffffffffa0a50f72 [ptlrpc] #12 [ffff88068942fcb8] tgt_enqueue at ffffffffa0ad58f2 [ptlrpc] #13 [ffff88068942fcd8] tgt_request_handle at ffffffffa0ada2f5 [ptlrpc] #14 [ffff88068942fd20] ptlrpc_server_handle_request at ffffffffa0a84ccb [ptlrpc] #15 [ffff88068942fde8] ptlrpc_main at ffffffffa0a885f0 [ptlrpc] #16 [ffff88068942fec8] kthread at ffffffff810a5b8f #17 [ffff88068942ff50] ret_from_fork at ffffffff81646c98 So my current feeling is that the original problem/dead-lock for this ticket should be fixed now by my current patch, but we may now face some regression in Lustre latest master. We will try to better qualify this new suspected problem, with more on-site debugging using reproducer's new runs without SELinux now. Will also try to expose same platform to others CEA regression tests on meta-datas. |
| Comment by Bruno Faccini (Inactive) [ 19/Nov/17 ] |
|
By the way, it seems to me that this last pseudo-hang/dead-lock scenario looks very similar to |
| Comment by Dominique Martinet (Inactive) [ 20/Nov/17 ] |
|
It is close, yes. It looks like mdt_reint_open() also takes a MDS_INODELOCK_UPDATE lock on the parent object before doing a lookup and create if lookup fails. In |
| Comment by Bruno Faccini (Inactive) [ 15/Jan/18 ] |
|
The last dead-lock situation encountered in this ticket, has nothing related with original issue that this ticket has been created for, but seems to be more generic and being addressed in both |
| Comment by Andreas Dilger [ 26/Jan/18 ] |
|
Bruno, unless you think your current patch is fundamentally moving in the wrong direction (I haven looked at it yet) then I think it makes sense to land the current patch that solves the first problem, then make a second patch to address this next problem. At least we are moving in the right direction to fix the first problem, which is what the application was originally hitting, and possibly the second problem you are able to produce is not actually being seen by applications, so we have more time to work on a solution. |
| Comment by Gerrit Updater [ 01/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/26831/ |
| Comment by Peter Jones [ 01/Apr/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 02/Apr/19 ] |
|
Sebastien Piechurski (sebastien.piechurski@atos.net) uploaded a new patch: https://review.whamcloud.com/34573 |
| Comment by Gerrit Updater [ 21/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34573/ |
| Comment by Mikhail Pershin [ 21/Apr/19 ] |
|
this patch causes often client reconnects somehow:
I am seeing many such messages during dbench run and git bisect shows this patch is source of problems. |
| Comment by Mikhail Pershin [ 21/Apr/19 ] |
|
Created |
| Comment by Gerrit Updater [ 02/Feb/21 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41387 |
| Comment by Gerrit Updater [ 04/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41387/ |