[LU-9193] Multiple hangs observed with many open/getattr Created: 08/Dec/16  Updated: 14/Dec/21  Resolved: 01/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.13.0, Lustre 2.12.7

Type: Bug Priority: Critical
Reporter: Jean-Baptiste Riaux (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Centos 7.2
Centos 6.[7-8]
SELinux enforcing


Attachments: File LU-9193-patchset8.tar.gz     File LU-9193.tar.gz     File logs-9193-patchset11-tests.tar.gz     File logs_050717.tar.gz     File lustre-LU9193-240817.tar.gz     File lustre-logs-210617.tar.gz     File lustre-logs.tar.gz     File vm0.tar.gz     File vm105.tar.gz     File vm62.tar.gz    
Issue Links:
Related
is related to LU-12212 Often requests timeouts during dbench... Resolved
is related to LU-10262 Lock contention when doing creates fo... Resolved
is related to LU-5560 SELinux support on the client side Resolved
is related to LU-10235 mkdir should check for directory exis... Resolved
is related to LU-6784 Defects in SELinux support Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Tested (reproduced) on 2.5 , 2.7, 2.8 and 2.9

MPI job on 300 nodes: 2/3 open and 1/3 stat on same file => hang (The MDS server threads are idle and the load is close to 0, threads are waiting for a lock but no threads have an active lock. After a long time, 15/30mn, threads become responsive again and resume operations normally).
Same job with only stat => no problem
Same job with only open => no problem

Some of the logs were similar to LU-5497 and LU-4579 but patches did not fix the issue.
If all job's clients are evicted manually then lustre recover and resume to a normal state.
Lustre 2.7.2 with patch of LU-5781 (Solve a race for LRU lock cancel) was tested too.

So far what prevents the issue is to disable SELINUX:

  1. cat policy-noxattr-lustre.patch
      • serefpolicy-3.13.1/policy/modules/kernel/filesystem.te.orig 2016-08-02 19:56:29.997519918 +0000
        +++ serefpolicy-3.13.1/policy/modules/kernel/filesystem.te 2016-08-02 19:57:10.124519918 +0000
        @@ -32,7 +32,8 @@ fs_use_xattr gfs2 gen_context(system_u:o
        fs_use_xattr gpfs gen_context(system_u:object_r:fs_t,s0);
        fs_use_xattr jffs2 gen_context(system_u:object_r:fs_t,s0);
        fs_use_xattr jfs gen_context(system_u:object_r:fs_t,s0);
        -fs_use_xattr lustre gen_context(system_u:object_r:fs_t,s0);
        +# Lustre is not supported Selinux correctly
        +#fs_use_xattr lustre gen_context(system_u:object_r:fs_t,s0);
        fs_use_xattr ocfs2 gen_context(system_u:object_r:fs_t,s0);
        fs_use_xattr overlay gen_context(system_u:object_r:fs_t,s0);
        fs_use_xattr xfs gen_context(system_u:object_r:fs_t,s0);

Reproducer (127 clients: vm3 to vm130, vm0 is MDS, vm1 and 2 are OSS):
mkdir /lustre/testfs/testuser/testdir; sleep 4; clush -bw vm[3-130] 'seq 0 1000 | xargs -P 7 -I{} sh -c "(({}%3==0)) && touch /lustre/testfs/testuser/testdir/foo$(hostname -s | tr -d vm) || stat /lustre/testfs/testuser/testdir > /dev/null"'

Tested disabling statahead. No impact.

Traces of stuck processes on the MDS look all the same (could be related to DDN-366):
8631 TASK: ffff880732202280 CPU:
[ffff88071587f760] __schedule at ffffffff8163b6cd
[ffff88071587f7c8] schedule at ffffffff8163bd69
[ffff88071587f7d8] schedule_timeout at ffffffff816399c5
[ffff88071587f880] ldlm_completion_ast at ffffffffa08b7fe1
[ffff88071587f920] ldlm_cli_enqueue_local at ffffffffa08b9c20
[ffff88071587f9b8] mdt_object_local_lock at ffffffffa0f3c6d2
[ffff88071587fa60] mdt_object_lock_internal at ffffffffa0f3cffb
[ffff88071587faa0] mdt_getattr_name_lock at ffffffffa0f3ddf6
[ffff88071587fb28] mdt_intent_getattr at ffffffffa0f3f3f0
[ffff88071587fb68] mdt_intent_policy at ffffffffa0f42e7c
[ffff88071587fbd0] ldlm_lock_enqueue at ffffffffa089f1f7
[ffff88071587fc28] ldlm_handle_enqueue0 at ffffffffa08c4fb2
[ffff88071587fcb8] tgt_enqueue at ffffffffa09493f2
[ffff88071587fcd8] tgt_request_handle at ffffffffa094ddf5
[ffff88071587fd20] ptlrpc_server_handle_request at ffffffffa08f87cb
[ffff88071587fde8] ptlrpc_main at ffffffffa08fc0f0
[ffff88071587fec8] kthread at ffffffff810a5b8f
[ffff88071587ff50] ret_from_fork at ffffffff81646c98

I do not have (yet) the clients traces but last call on clients is mdc_enqueue.

Crash dump analysis started (in progress). So far nothing obvious about SELinux, first analysis lead to ldlm_handle_enqueue0() and ldlm_lock_enqueue(). We see some processes idle for 280s, it's not entirely stuck but very very slow (we had to crash the VM to get proper dump because it was very hard to use crash as the threads are not 100% stuck).



 Comments   
Comment by Bruno Faccini (Inactive) [ 09/Dec/16 ]

I have assigned it to me since I have started to work on this with JB on-site.

Comment by Jean-Baptiste Riaux (Inactive) [ 13/Dec/16 ]

I could reproduce the problem yesterday on site with Lustre 2.9 (containing the suggested LU-5560 patch).
I am preparing a custom build with more debug information on locks in some ldlm functions.
I will test it tomorrow on site, and with debug done with Bruno last week in crash, I should have a precise idea of what it is going on by tomorrow evening.

Comment by Bruno Faccini (Inactive) [ 05/Jan/17 ]

During joint on-site analysis with JB (where we have crashed all Clients/Servers nodes of our test VM-made cluster at the same time trying to get a frozen picture of the blocked situation), we have found that due to SE-Linux, each inode/dentry look'ed-up on a path discovery/namei needs to get+control the security/selinux XATTR, and this seems to strongly interfer when concurrent access occur from same and different nodes, like with CEA's identified reproducer (A mix of concurrent stat/create/setattr from >100 nodes, as far as I can understand from reproducer's text indicated in ticket's description).

We have still not finished the analysis, but seems that for a particular MDT object, for which there are a lot of mdt threads waiting to grant Clients read/write locks, and where some (there are multiple granted) of the currently granted read-lock owner's are now stuck on client side trying to getxattr of security.selinux xattr, and this since hundreds of seconds. When on MDT side we have discovered that these getxattr requests have been received/queued but are not handled.

This is where we are at the moment, but there is more to come. And JB, feel free to add/correct me.

Comment by John Hammond [ 12/Jan/17 ]

JB, could you attach the dmesg and the output of 'foreach bt' from the MDS crash image?

Comment by Bruno Faccini (Inactive) [ 13/Jan/17 ]

Here are more detailed infos with the help of datas that we have been allowed to extract and use in this ticket.

We have continued the analysis of the pseudo-hang on-site, and we have found that the getxattr() request for the security.selinux EA could not be handled by MDS/MDT because all the 149 of the 150 threads serving the "mdt" service are stuck with either of the following 2 stacks :

PID: 11489  TASK: ffff880714dc0000  CPU: 2   COMMAND: "mdt00_066"
 #0 [ffff88070a433680] __schedule at ffffffff8168b175
 #1 [ffff88070a4336e8] schedule at ffffffff8168b7c9
 #2 [ffff88070a4336f8] schedule_timeout at ffffffff81689155
 #3 [ffff88070a4337a0] ldlm_completion_ast at ffffffffa09b3791 [ptlrpc]
 #4 [ffff88070a433840] ldlm_cli_enqueue_local at ffffffffa09b44e0 [ptlrpc]
 #5 [ffff88070a4338d8] mdt_object_local_lock at ffffffffa0e717d2 [mdt]
 #6 [ffff88070a433988] mdt_object_lock_internal at ffffffffa0e71ebe [mdt]
 #7 [ffff88070a4339d8] mdt_object_lock at ffffffffa0e72174 [mdt]
 #8 [ffff88070a4339e8] mdt_reint_open at ffffffffa0e93a11 [mdt]
 #9 [ffff88070a433ad8] mdt_reint_rec at ffffffffa0e8a490 [mdt]
#10 [ffff88070a433b00] mdt_reint_internal at ffffffffa0e6cdb2 [mdt]
#11 [ffff88070a433b38] mdt_intent_reint at ffffffffa0e6d312 [mdt]
#12 [ffff88070a433b78] mdt_intent_policy at ffffffffa0e77b8c [mdt]
#13 [ffff88070a433bd0] ldlm_lock_enqueue at ffffffffa09991e7 [ptlrpc]
#14 [ffff88070a433c28] ldlm_handle_enqueue0 at ffffffffa09c23a3 [ptlrpc]
#15 [ffff88070a433cb8] tgt_enqueue at ffffffffa0a42e12 [ptlrpc]
#16 [ffff88070a433cd8] tgt_request_handle at ffffffffa0a47275 [ptlrpc]
#17 [ffff88070a433d20] ptlrpc_server_handle_request at ffffffffa09f31fb [ptlrpc]
#18 [ffff88070a433de8] ptlrpc_main at ffffffffa09f72b0 [ptlrpc]
#19 [ffff88070a433ec8] kthread at ffffffff810b052f
#20 [ffff88070a433f50] ret_from_fork at ffffffff81696658

or

PID: 11490  TASK: ffff880714dc1f60  CPU: 4   COMMAND: "mdt00_067"
 #0 [ffff88070a5ef750] __schedule at ffffffff8168b175
 #1 [ffff88070a5ef7b8] schedule at ffffffff8168b7c9
 #2 [ffff88070a5ef7c8] schedule_timeout at ffffffff81689155
 #3 [ffff88070a5ef878] ldlm_completion_ast at ffffffffa09b3791 [ptlrpc]
 #4 [ffff88070a5ef918] ldlm_cli_enqueue_local at ffffffffa09b44e0 [ptlrpc]
 #5 [ffff88070a5ef9b0] mdt_object_local_lock at ffffffffa0e7188a [mdt]
 #6 [ffff88070a5efa60] mdt_object_lock_internal at ffffffffa0e71ebe [mdt]
 #7 [ffff88070a5efab0] mdt_getattr_name_lock at ffffffffa0e72c64 [mdt]
 #8 [ffff88070a5efb38] mdt_intent_getattr at ffffffffa0e73f70 [mdt]
 #9 [ffff88070a5efb78] mdt_intent_policy at ffffffffa0e77b8c [mdt]
#10 [ffff88070a5efbd0] ldlm_lock_enqueue at ffffffffa09991e7 [ptlrpc]
#11 [ffff88070a5efc28] ldlm_handle_enqueue0 at ffffffffa09c23a3 [ptlrpc]
#12 [ffff88070a5efcb8] tgt_enqueue at ffffffffa0a42e12 [ptlrpc]
#13 [ffff88070a5efcd8] tgt_request_handle at ffffffffa0a47275 [ptlrpc]
#14 [ffff88070a5efd20] ptlrpc_server_handle_request at ffffffffa09f31fb [ptlrpc]
#15 [ffff88070a5efde8] ptlrpc_main at ffffffffa09f72b0 [ptlrpc]
#16 [ffff88070a5efec8] kthread at ffffffff810b052f
#17 [ffff88070a5eff50] ret_from_fork at ffffffff81696658

when the last/150th seems busy handling obd_ping traffic, as found in the debug trace extracted from the crash-dump.

BTW and as expected, during a new run+pseudo-hang of the reproducer, raising the max number of “mdt” service threads to 1024, has allowed the MDS to quickly recover from its pseudo-hung situation and finally handle all requests leading to reproducer's completion in a very short time !!…

The contended resource appears to be the main directory where the reproducer’s files are located “/lustre/testfs/testuser/testdir/“.

The Clients threads, currently having the lock on the directory resource granted, and then waiting for their getxattr() request to be serviced have the following stack :

PID: 8825   TASK: ffff88072dee6dd0  CPU: 5   COMMAND: "touch"
 #0 [ffff88017d4ef5a0] __schedule at ffffffff8168b175
 #1 [ffff88017d4ef608] schedule at ffffffff8168b7c9
 #2 [ffff88017d4ef618] schedule_timeout at ffffffff81689155
 #3 [ffff88017d4ef6c8] ptlrpc_set_wait at ffffffffa09c8cd0 [ptlrpc]
 #4 [ffff88017d4ef770] ptlrpc_queue_wait at ffffffffa09c918d [ptlrpc]
 #5 [ffff88017d4ef790] mdc_xattr_common at ffffffffa0bd93aa [mdc]
 #6 [ffff88017d4ef810] mdc_getxattr at ffffffffa0bd962f [mdc]
 #7 [ffff88017d4ef850] lmv_getxattr at ffffffffa0c1660e [lmv]
 #8 [ffff88017d4ef8c8] ll_getxattr_common at ffffffffa0c942c4 [lustre]
 #9 [ffff88017d4ef948] ll_getxattr at ffffffffa0c94763 [lustre]
#10 [ffff88017d4ef980] inode_doinit_with_dentry at ffffffff812b281f
#11 [ffff88017d4ef9d8] selinux_d_instantiate at ffffffff812b38ec
#12 [ffff88017d4ef9e8] security_d_instantiate at ffffffff812a819b
#13 [ffff88017d4ef9f8] d_instantiate at ffffffff812162b0
#14 [ffff88017d4efa20] ll_splice_alias at ffffffffa0c8939a [lustre]
#15 [ffff88017d4efa60] ll_lookup_it_finish at ffffffffa0c89633 [lustre]
#16 [ffff88017d4efb28] ll_lookup_it at ffffffffa0c8ae6e [lustre]
#17 [ffff88017d4efbe0] ll_lookup_nd at ffffffffa0c8c6dd [lustre]
#18 [ffff88017d4efc78] lookup_real at ffffffff812083dd
#19 [ffff88017d4efc98] __lookup_hash at ffffffff81208d52
#20 [ffff88017d4efcc8] lookup_slow at ffffffff816833fb
#21 [ffff88017d4efd00] link_path_walk at ffffffff8120b96f
#22 [ffff88017d4efdb0] path_openat at ffffffff8120dc53
#23 [ffff88017d4efe48] do_filp_open at ffffffff8120fdeb
#24 [ffff88017d4eff18] do_sys_open at ffffffff811fd2f3
#25 [ffff88017d4eff70] sys_open at ffffffff811fd40e
#26 [ffff88017d4eff80] system_call_fastpath at ffffffff81696709
    RIP: 00007fbf1b1312b0  RSP: 00007ffe4c0891e8  RFLAGS: 00010287
    RAX: 0000000000000002  RBX: ffffffff81696709  RCX: 0000000000000014
    RDX: 00000000000001b6  RSI: 0000000000000941  RDI: 00007ffe4c08ae2d
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 00007ffe4c0890b0  R11: 0000000000000246  R12: ffffffff811fd40e
    R13: ffff88017d4eff78  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

Others reproducer's threads on the same Clients have the following stacks, depending if they support a touch (3 per reproducer step) or stat (4 per reproducer step) cmd in the reproducer :

PID: 8834   TASK: ffff880034592f10  CPU: 3   COMMAND: "touch"
 #0 [ffff88072e2ebbe0] __schedule at ffffffff8168b175
 #1 [ffff88072e2ebc48] schedule_preempt_disabled at ffffffff8168c8b9
 #2 [ffff88072e2ebc58] __mutex_lock_slowpath at ffffffff8168a515
 #3 [ffff88072e2ebcb0] mutex_lock at ffffffff8168997f
 #4 [ffff88072e2ebcc8] lookup_slow at ffffffff816833ec
 #5 [ffff88072e2ebd00] link_path_walk at ffffffff8120b96f
 #6 [ffff88072e2ebdb0] path_openat at ffffffff8120dc53
 #7 [ffff88072e2ebe48] do_filp_open at ffffffff8120fdeb
 #8 [ffff88072e2ebf18] do_sys_open at ffffffff811fd2f3
 #9 [ffff88072e2ebf70] sys_open at ffffffff811fd40e
#10 [ffff88072e2ebf80] system_call_fastpath at ffffffff81696709
    RIP: 00007fd7a10b22b0  RSP: 00007ffea7e57428  RFLAGS: 00010287
    RAX: 0000000000000002  RBX: ffffffff81696709  RCX: 0000000000000014
    RDX: 00000000000001b6  RSI: 0000000000000941  RDI: 00007ffea7e57e2b
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 00007ffea7e572f0  R11: 0000000000000246  R12: ffffffff811fd40e
    R13: ffff88072e2ebf78  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

PID: 8832   TASK: ffff88073c370000  CPU: 0   COMMAND: "stat"
 #0 [ffff88072e2d7bc0] __schedule at ffffffff8168b175
 #1 [ffff88072e2d7c28] schedule_preempt_disabled at ffffffff8168c8b9
 #2 [ffff88072e2d7c38] __mutex_lock_slowpath at ffffffff8168a515
 #3 [ffff88072e2d7c98] mutex_lock at ffffffff8168997f
 #4 [ffff88072e2d7cb0] lookup_slow at ffffffff816833ec
 #5 [ffff88072e2d7ce8] path_lookupat at ffffffff8120c273
 #6 [ffff88072e2d7d80] filename_lookup at ffffffff8120c2cb
 #7 [ffff88072e2d7db8] user_path_at_empty at ffffffff8120fce7
 #8 [ffff88072e2d7e88] user_path_at at ffffffff8120fd51
 #9 [ffff88072e2d7e98] vfs_fstatat at ffffffff812031c3
#10 [ffff88072e2d7ee8] SYSC_newlstat at ffffffff81203791
#11 [ffff88072e2d7f70] sys_newlstat at ffffffff81203a1e
#12 [ffff88072e2d7f80] system_call_fastpath at ffffffff81696709
    RIP: 00007fdea1e6eeb5  RSP: 00007ffea2b024e8  RFLAGS: 00000246
    RAX: 0000000000000006  RBX: ffffffff81696709  RCX: ffffffffffffffff
    RDX: 00007ffea2b02590  RSI: 00007ffea2b02590  RDI: 00007ffea2b02e34
    RBP: 0000000002049460   R8: 0000000000000001   R9: 00007ffea2b02e34
    R10: 00007ffea2b022e0  R11: 0000000000000246  R12: ffffffff81203a1e
    R13: ffff88072e2d7f78  R14: 0000000000000002  R15: 00000000e406a798
    ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b

But I need to also detail the state of the reproducer's processes on the others Clients/nodes, likely to be responsible of the requests being handled by the hung threads on MDS side just described before.

I will also try to better understand if the 1st touch/file creation (requiring a write lock on directory to add entry) is mandatory or not to get the pseudo-hung situation or not.

Based on all of this, and after a finer on-site crash-dump analysis of the current locking requests state from all Clients, I will also try to summarize our understanding of the exact sequence of requests/events leading to the pseudo-hung situation when running the reproducer in my next post.

But I have already the feeling that we may face a problem due to a timing window where some Client(s) can get a lock on directory granted and then try to access the security.selinux xattr, but in the mean time others Clients have submitted their requests/locks on the directory resource and have monopolized all the available "mdt" service threads.

Comment by John Hammond [ 13/Jan/17 ]

What are the kernel and lustre versions running on the MDS and on the clients?

Comment by Jean-Baptiste Riaux (Inactive) [ 23/Jan/17 ]

Clients and MDS are running both:

  • 3.10.0-514.el7_lustre.x86_64
  • lustre: 2.9.0 (lustre-2.9.0-1)
Comment by Jean-Baptiste Riaux (Inactive) [ 03/Feb/17 ]

MDS max started threads: 149
Blocked thread blocked on the MDS: 147
Remaining non-blocked threads are high prio threads for OBD_PING, MDS_CONNECT and OST_CONNECT operations.

Locks waiting:
[riauxj@test174 ]$ cat locks_waiting | wc -l
147

Blocked threads in mdt_intent_getattr:
[riauxj@test174 tmp]$ grep mdt_intent_getattr /tmp/bt_all_mdt_service_threads.out |wc -l
126
locks are LCK_PR
[riauxj@test174 ]$ grep LCK_PR locks_waiting.out | wc -l
126
[riauxj@test174 ]$ cat locks_waiting.out | grep "l_req_mode = LCK_PR" -A 28 | grep -i "bits = " | grep -v l_inodebits | uniq
      bits = 19 # 13 en hexa
[riauxj@test174 ]$ cat locks_waiting.out | grep "l_req_mode = LCK_PR" -A 28 | grep -i "bits = " | grep -v l_inodebits | wc -l
126

Blocked threads in mdt_reint_open:
[riauxj@test174 tmp]$ grep mdt_reint_open /tmp/bt_all_mdt_service_threads.out |wc -l
21
Locks are LCK_CW, ibits 2.
[riauxj@test174 ]$ grep LCK_CW locks_waiting.out | wc -l
21
[riauxj@test174 ]$ cat locks_waiting.out | grep "l_req_mode = LCK_CW" -A 28 | grep -i "bits = " | grep -v l_inodebits | uniq
      bits = 2
[riauxj@test174 ]$ cat locks_waiting.out | grep "l_req_mode = LCK_CW" -A 28 | grep -i "bits = " | grep -v l_inodebits | wc -l
21

126+21 = 147 which math blocked threads number.

Looking compat matrix:
 * <PRE>
 *       NL  CR  CW  PR  PW  EX
 *  NL    1   1   1   1   1   1
 *  CR    1   1   1   1   1   0
 *  CW    1   1   1   0   0   0
 *  PR    1   1   0   1   0   0
 *  PW    1   1   0   0   0   0
 *  EX    1   0   0   0   0   0
 * </PRE>

PR and CW are not compatible.

So we have processes (stat and touch) doing Concurent Write ibits 2 :
MDS_INODELOCK_UPDATE
and Protected Read ibits 0x13:
MDS_INODELOCK_LOOKUP
MDS_INODELOCK_UPDATE
MDS_INODELOCK_PERM

-------------------

GRANTED locks:
We identified why granted locks are not released.
Granted locks are LCK_PR ibits 0x13
[riauxj@test174 ]$ cat locks_granted.out | grep "l_req_mode = LCK_PR" -A 28 | grep -i "bits = " | grep -v l_inodebits | uniq
bits = 19
[riauxj@test174 ]$ cat locks_granted.out | grep "l_req_mode = LCK_PR" -A 28 | grep -i "bits = " | grep -v l_inodebits | wc -l
33

Following the locks lead again to following stacks (touch and stat)

commande touch:
PID: 8825   TASK: ffff88072dee6dd0  CPU: 5   COMMAND: "touch"
 #0 [ffff88017d4ef5a0] __schedule at ffffffff8168b175
 #1 [ffff88017d4ef608] schedule at ffffffff8168b7c9
 #2 [ffff88017d4ef618] schedule_timeout at ffffffff81689155
 #3 [ffff88017d4ef6c8] ptlrpc_set_wait at ffffffffa09c8cd0 [ptlrpc]
 #4 [ffff88017d4ef770] ptlrpc_queue_wait at ffffffffa09c918d [ptlrpc]
 #5 [ffff88017d4ef790] mdc_xattr_common at ffffffffa0bd93aa [mdc]
 #6 [ffff88017d4ef810] mdc_getxattr at ffffffffa0bd962f [mdc]      

Here the client tries to create a new lock (MDS_INODELOCK_XATTR) but MDS cannot accept it as all its threads are busy
#7 [ffff88017d4ef850] lmv_getxattr at ffffffffa0c1660e [lmv]

#8 [ffff88017d4ef8c8] ll_getxattr_common at ffffffffa0c942c4 [lustre]        
 #9 [ffff88017d4ef948] ll_getxattr at ffffffffa0c94763 [lustre]
#10 [ffff88017d4ef980] inode_doinit_with_dentry at ffffffff812b281f
#11 [ffff88017d4ef9d8] selinux_d_instantiate at ffffffff812b38ec
#12 [ffff88017d4ef9e8] security_d_instantiate at ffffffff812a819b
#13 [ffff88017d4ef9f8] d_instantiate at ffffffff812162b0
#14 [ffff88017d4efa20] ll_splice_alias at ffffffffa0c8939a [lustre]
#15 [ffff88017d4efa60] ll_lookup_it_finish at ffffffffa0c89633 [lustre]

Here the client created a lock on the test directory (common on all clients) which is the granted lock we find on the MDS. But client cannot release the lock as it tries to create a lock MDS_INODELOCK_XATTR upper in the stack (which cannot be handled by MDS because all threads are busy). So we are stuck
#16 [ffff88017d4efb28] ll_lookup_it at ffffffffa0c8ae6e [lustre]

#17 [ffff88017d4efbe0] ll_lookup_nd at ffffffffa0c8c6dd [lustre]
#18 [ffff88017d4efc78] lookup_real at ffffffff812083dd
#19 [ffff88017d4efc98] __lookup_hash at ffffffff81208d52
#20 [ffff88017d4efcc8] lookup_slow at ffffffff816833fb
#21 [ffff88017d4efd00] link_path_walk at ffffffff8120b96f    
#22 [ffff88017d4efdb0] path_openat at ffffffff8120dc53        
#23 [ffff88017d4efe48] do_filp_open at ffffffff8120fdeb        
#24 [ffff88017d4eff18] do_sys_open at ffffffff811fd2f3                                  
#25 [ffff88017d4eff70] sys_open at ffffffff811fd40e
#26 [ffff88017d4eff80] system_call_fastpath at ffffffff81696709

PID: 8826   TASK: ffff88072b7d0000  CPU: 6   COMMAND: "stat"
 #0 [ffff880729abbbc0] __schedule at ffffffff8168b175
 #1 [ffff880729abbc28] schedule_preempt_disabled at ffffffff8168c8b9
 #2 [ffff880729abbc38] __mutex_lock_slowpath at ffffffff8168a515
 #3 [ffff880729abbc98] mutex_lock at ffffffff8168997f
 #4 [ffff880729abbcb0] lookup_slow at ffffffff816833ec            
 #5 [ffff880729abbce8] path_lookupat at ffffffff8120c273        
 #6 [ffff880729abbd80] filename_lookup at ffffffff8120c2cb
 #7 [ffff880729abbdb8] user_path_at_empty at ffffffff8120fce7
 #8 [ffff880729abbe88] user_path_at at ffffffff8120fd51
 #9 [ffff880729abbe98] vfs_fstatat at ffffffff812031c3
#10 [ffff880729abbee8] SYSC_newlstat at ffffffff81203791
#11 [ffff880729abbf70] sys_newlstat at ffffffff81203a1e
#12 [ffff880729abbf80] system_call_fastpath at ffffffff81696709

We see also that invocating d_instantiate (selinux_d_instantiate) add an entry in the dentry.
So sometimes with luck, some processes finished (before we kill the job) and when we rerun the test it works because the cache was populated (at least a bit).

Comment by Bruno Faccini (Inactive) [ 08/Feb/17 ]

John,
After I had an indeep look into the related source code and the patches for LU-5560, I wonder if the solution to this problem could not be to simply restore the xattr cache usage (that is presently removed in merged patch version of "LU-5560 llite: basic support of SELinux in CLIO", i.e. master commit 8a11cb6282cfbdc8617b809344e6a11223e86a38) also for security.selinux ??
What do you think?

J-B, would it be possible to test this on-site with your reproducer+test-platform??

Patch should look as this :

bfaccini-mac02:lustre-release bfaccini$ git status
On branch CEA-2+LU-5560
bfaccini-mac02:lustre-release bfaccini$ git show
commit 779f62ee53449049c92ceb12987487597ed302c7
Author: Bruno Faccini <bruno.faccini@intel.com>
Date:   Tue Feb 7 22:41:30 2017 +0100

    LU-5560 llite: put security.selinux in xattr cache
    
    Do not filter out security.selinux from xattr cache because if not,
    it will need to be requested from MDT during each path component
    lookup and this with current directory lock granted, finally causing
    a blocked situation, if there are a lot of similar access from
    multiple clients, which may lead to all mdt service threads to be
    busy waiting for these locks reclaim and thus unable to serve the
    getxattr for these lock owners.
    
    Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
    Change-Id: I3e48c5cd4cdb0a9259241853784aca67305f4127

diff --git a/lustre/llite/xattr.c b/lustre/llite/xattr.c
index ca0e5c9..5293833 100644
--- a/lustre/llite/xattr.c
+++ b/lustre/llite/xattr.c
@@ -364,10 +364,7 @@ int ll_getxattr_common(struct inode *inode, const char *name,
 #endif
 
 do_getxattr:
-       if (sbi->ll_xattr_cache_enabled &&
-           xattr_type != XATTR_ACL_ACCESS_T &&
-           (xattr_type != XATTR_SECURITY_T ||
-               strcmp(name, "security.selinux") != 0)) {
+       if (sbi->ll_xattr_cache_enabled && xattr_type != XATTR_ACL_ACCESS_T) {
                rc = ll_xattr_cache_get(inode, name, buffer, size, valid);
                if (rc == -EAGAIN)
                        goto getxattr_nocache;
diff --git a/lustre/llite/xattr_cache.c b/lustre/llite/xattr_cache.c
index 18a289b..0f50426 100644
--- a/lustre/llite/xattr_cache.c
+++ b/lustre/llite/xattr_cache.c
@@ -446,10 +446,6 @@ static int ll_xattr_cache_refill(struct inode *inode)
                        CDEBUG(D_CACHE, "not caching %s\n",
                               XATTR_NAME_ACL_ACCESS);
                        rc = 0;
-               } else if (!strcmp(xdata, "security.selinux")) {
-                       /* Filter out security.selinux, it is cached in slab */
-                       CDEBUG(D_CACHE, "not caching security.selinux\n");
-                       rc = 0;
                } else {
                        rc = ll_xattr_cache_add(&lli->lli_xattrs, xdata, xval,
                                                *xsizes);
bfaccini-mac02:lustre-release bfaccini$ 
Comment by Bruno Faccini (Inactive) [ 13/Feb/17 ]

Talking with S.Buisson about this, he also pointed to his patch (https://review.whamcloud.com/24426) for LU-8956 which should also avoid extraneous getxattr() request to fetch security related EAs.

J-B will give a try to both possible solutions and expose them to the known reproducer and let us know asap about his results.

Comment by Jean-Baptiste Riaux (Inactive) [ 14/Feb/17 ]

Tested both patches yesterday, the problem remains. However with Bruno's patch, the stack is now different than before and Sebastien's patch did improve a bit the test speed. The test did evict 2 or 3 clients at each run but every run ended after 5mns(lustre unmounted and cache dropped between runs), so it's a bit better. Before this patch we could wait 2h, lustre was not recovering properly. If I don't drop caches, the test ends in less than 3 seconds after a first run.

Also for the 2 or 3 clients that failed I see this output:

stat: failed to get security context of ‘-removed-/testdir’: Input/output error

I did merge both patches today. New stacks on clients looks like this:

PID: 8790   TASK: ffff88072eb4de20  CPU: 0   COMMAND: "stat"
 #0 [ffff880725a77270] __schedule at ffffffff8168b175
 #1 [ffff880725a772d8] schedule at ffffffff8168b7c9
 #2 [ffff880725a772e8] schedule_timeout at ffffffff81689155
 #3 [ffff880725a77390] ptlrpc_set_wait at ffffffffa09ce190 [ptlrpc]
 #4 [ffff880725a77438] ptlrpc_queue_wait at ffffffffa09ce64d [ptlrpc]
 #5 [ffff880725a77458] ldlm_cli_enqueue at ffffffffa09aa856 [ptlrpc]
 #6 [ffff880725a77500] mdc_enqueue_base at ffffffffa0bf98ef [mdc]
 #7 [ffff880725a775f8] mdc_intent_lock at ffffffffa0bfc57b [mdc]
 #8 [ffff880725a776c0] lmv_intent_lock at ffffffffa0c2d0af [lmv]
 #9 [ffff880725a77778] ll_xattr_cache_refill at ffffffffa0c9cf9b [lustre]
#10 [ffff880725a77858] ll_xattr_cache_get at ffffffffa0c9e33b [lustre]
#11 [ffff880725a778b0] ll_getxattr_common at ffffffffa0c9aaf8 [lustre]
#12 [ffff880725a77930] ll_getxattr at ffffffffa0c9b4c3 [lustre]
#13 [ffff880725a77968] inode_doinit_with_dentry at ffffffff812b281f
#14 [ffff880725a779c0] selinux_d_instantiate at ffffffff812b38ec
#15 [ffff880725a779d0] security_d_instantiate at ffffffff812a819b
#16 [ffff880725a779e0] d_instantiate at ffffffff812162b0
#17 [ffff880725a77a08] ll_splice_alias at ffffffffa0c9004a [lustre]
#18 [ffff880725a77a48] ll_lookup_it_finish at ffffffffa0c902e3 [lustre]
#19 [ffff880725a77b10] ll_lookup_it at ffffffffa0c91b1e [lustre]
#20 [ffff880725a77bc8] ll_lookup_nd at ffffffffa0c9342e [lustre]
#21 [ffff880725a77c60] lookup_real at ffffffff812083dd
#22 [ffff880725a77c80] __lookup_hash at ffffffff81208d52
#23 [ffff880725a77cb0] lookup_slow at ffffffff816833fb
#24 [ffff880725a77ce8] path_lookupat at ffffffff8120c273
#25 [ffff880725a77d80] filename_lookup at ffffffff8120c2cb
#26 [ffff880725a77db8] user_path_at_empty at ffffffff8120fce7
#27 [ffff880725a77e88] user_path_at at ffffffff8120fd51
#28 [ffff880725a77e98] vfs_fstatat at ffffffff812031c3
#29 [ffff880725a77ee8] SYSC_newlstat at ffffffff81203791
#30 [ffff880725a77f70] sys_newlstat at ffffffff81203a1e
#31 [ffff880725a77f80] system_call_fastpath at ffffffff81696709

PID: 8773   TASK: ffff880734636dd0  CPU: 5   COMMAND: "touch"
 #0 [ffff88072e66f280] __schedule at ffffffff8168b175
 #1 [ffff88072e66f2e8] schedule at ffffffff8168b7c9
 #2 [ffff88072e66f2f8] schedule_timeout at ffffffff81689155
 #3 [ffff88072e66f3a8] ptlrpc_set_wait at ffffffffa09c4190 [ptlrpc]
 #4 [ffff88072e66f450] ptlrpc_queue_wait at ffffffffa09c464d [ptlrpc]
 #5 [ffff88072e66f470] ldlm_cli_enqueue at ffffffffa09a0856 [ptlrpc]
 #6 [ffff88072e66f518] mdc_enqueue_base at ffffffffa0bef8ef [mdc]
 #7 [ffff88072e66f610] mdc_intent_lock at ffffffffa0bf257b [mdc]
 #8 [ffff88072e66f6d8] lmv_intent_lock at ffffffffa0c230af [lmv]
 #9 [ffff88072e66f790] ll_xattr_cache_refill at ffffffffa0c92f9b [lustre]
#10 [ffff88072e66f870] ll_xattr_cache_get at ffffffffa0c9433b [lustre]
#11 [ffff88072e66f8c8] ll_getxattr_common at ffffffffa0c90af8 [lustre]
#12 [ffff88072e66f948] ll_getxattr at ffffffffa0c914c3 [lustre]
#13 [ffff88072e66f980] inode_doinit_with_dentry at ffffffff812b281f
#14 [ffff88072e66f9d8] selinux_d_instantiate at ffffffff812b38ec
#15 [ffff88072e66f9e8] security_d_instantiate at ffffffff812a819b
#16 [ffff88072e66f9f8] d_instantiate at ffffffff812162b0
#17 [ffff88072e66fa20] ll_splice_alias at ffffffffa0c8604a [lustre]
#18 [ffff88072e66fa60] ll_lookup_it_finish at ffffffffa0c862e3 [lustre]
#19 [ffff88072e66fb28] ll_lookup_it at ffffffffa0c87b1e [lustre]
#20 [ffff88072e66fbe0] ll_lookup_nd at ffffffffa0c893fd [lustre]
#21 [ffff88072e66fc78] lookup_real at ffffffff812083dd
#22 [ffff88072e66fc98] __lookup_hash at ffffffff81208d52
#23 [ffff88072e66fcc8] lookup_slow at ffffffff816833fb
#24 [ffff88072e66fd00] link_path_walk at ffffffff8120b96f
#25 [ffff88072e66fdb0] path_openat at ffffffff8120dc53
#26 [ffff88072e66fe48] do_filp_open at ffffffff8120fdeb
#27 [ffff88072e66ff18] do_sys_open at ffffffff811fd2f3
#28 [ffff88072e66ff70] sys_open at ffffffff811fd40e
#29 [ffff88072e66ff80] system_call_fastpath at ffffffff81696709

 On the MDS, no surprises (all stacks are the same):

PID: 10468  TASK: ffff88070d7a4e70  CPU: 3   COMMAND: "mdt00_083"
 #0 [ffff880705657750] __schedule at ffffffff8168b175
 #1 [ffff8807056577b8] schedule at ffffffff8168b7c9
 #2 [ffff8807056577c8] schedule_timeout at ffffffff81689155
 #3 [ffff880705657878] ldlm_completion_ast at ffffffffa0a4cc71 [ptlrpc]
 #4 [ffff880705657918] ldlm_cli_enqueue_local at ffffffffa0a4d9c0 [ptlrpc]
 #5 [ffff8807056579b0] mdt_object_local_lock at ffffffffa0fe578a [mdt]
 #6 [ffff880705657a60] mdt_object_lock_internal at ffffffffa0fe5dbe [mdt]
 #7 [ffff880705657ab0] mdt_getattr_name_lock at ffffffffa0fe6b64 [mdt]
 #8 [ffff880705657b38] mdt_intent_getattr at ffffffffa0fe7e70 [mdt]
 #9 [ffff880705657b78] mdt_intent_policy at ffffffffa0feba8c [mdt]
#10 [ffff880705657bd0] ldlm_lock_enqueue at ffffffffa0a32107 [ptlrpc]
#11 [ffff880705657c28] ldlm_handle_enqueue0 at ffffffffa0a5b883 [ptlrpc]
#12 [ffff880705657cb8] tgt_enqueue at ffffffffa0adc502 [ptlrpc]
#13 [ffff880705657cd8] tgt_request_handle at ffffffffa0ae0875 [ptlrpc]
#14 [ffff880705657d20] ptlrpc_server_handle_request at ffffffffa0a8c73b [ptlrpc]
#15 [ffff880705657de8] ptlrpc_main at ffffffffa0a907f0 [ptlrpc]
#16 [ffff880705657ec8] kthread at ffffffff810b052f
#17 [ffff880705657f50] ret_from_fork at ffffffff81696658

 

 

Comment by Bruno Faccini (Inactive) [ 23/Mar/17 ]

We should be able to test a possible solution (embed security context/xattr in lookup reply to avoid extra getxattr() from security hooks, and thus fix dead-lock) on-site next week.

Comment by Sebastien Piechurski [ 18/Apr/17 ]

Hi Bruno, JB,

Any news from the test mentioned in your last comment ?

Comment by Bruno Faccini (Inactive) [ 18/Apr/17 ]

Hello Seb,
Tests of my patch by JB are still on-going (I had a few issues to fix in-between), and I should hopefully be able to push it soon now.

Comment by Gerrit Updater [ 26/Apr/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/26831
Subject: LU-9193 security: return security context for lookup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7dde7a39a58fbdf580238970608b6a74e976659f

Comment by Jean-Baptiste Riaux (Inactive) [ 03/Jul/17 ]

Ticket status:

  • gerrit patchset-8 tested: the selinux attr returned by MDC (for the new atomic op) had a wrong size so the client had to do a getxattr so bug can be reproduced.
  • gerrit patchset-9 tested: fix of attr size, could reproduce again but need to do another run to get full logs and investigate.
Comment by Jean-Baptiste Riaux (Inactive) [ 05/Jul/17 ]

New logs.
Size problem on attribute security.selinux remains:
00000100:00020000:5.0:1499254632.192081:0:8826:0:(pack_generic.c:425:lustre_msg_buf_v2()) msg ffff8806812ca0c0 buffer[6] size 37 too small (required 128, opc=101)
00000100:00020000:5.0:1499254632.193816:0:8826:0:(layout.c:2087:__req_capsule_get()) @@@ Wrong buffer for field `file_secctx' (6 of 7) in format `LDLM_INTENT_GETATTR': 37 vs. 128 (server)
req@ffff8806746f8000 x1572082208998624/t0(0) o101->testfs1-MDT0000-mdc-ffff88066da70800@10.252.0.1@tcp:12/10 lens 728/584 e 0 to 0 dl 1499254639 ref 1 fl Complete:R/0/0 rc 0/0
00000080:08000000:5.0:1499254632.196775:0:8826:0:(namei.c:513:ll_lookup_it_finish()) Server returned 'security.selinux' security context/xattr for [0x200000401:0x1:0x0] (37 bytes long): (null)
00000080:00000400:5.0:1499254632.196777:0:8826:0:(namei.c:521:ll_lookup_it_finish()) Error when trying to set security context for [0x200000401:0x1:0x0] rc = -13

removed unnecessary OSS logs from the archive (file size too big) logs_050717.tar.gz

Comment by Jean-Baptiste Riaux (Inactive) [ 24/Aug/17 ]

New test today with use of flag RMF_F_NO_SIZE_CHECK in call file_secctx:

struct req_msg_field RMF_FILE_SECCTX =
        DEFINE_MSGF("file_secctx", RMF_F_NO_SIZE_CHECK, -1, NULL, NULL);
EXPORT_SYMBOL(RMF_FILE_SECCTX);

Errors messages as in previous comment are gone but MDS threads are all consumed and load is at 0 (as before).
New set of logs attached (one mistake: backtrace of MDS, vm0.bt.log file, is in lustre directory instead of crash directory).
lustre-LU9193-240817.tar.gz

Comment by Jean-Baptiste Riaux (Inactive) [ 18/Sep/17 ]

Patchset-11 tested. Problem remains (MDS load at 0 and all threads used) and this time I got eviction (see vm19 logs).
Logs attached. logs-9193-patchset11-tests.tar.gz

Comment by Bruno Faccini (Inactive) [ 19/Sep/17 ]

Hum, having a look to the all the hung Client tasks stacks, it does not look like anymore that we are still facing a dead-lock scenario due to unexpected getxattr() of security xattr/context. On the other hand, since the MDS full thread stacks log is missing in the tarball I can't analyze further. Will need to go on-site and analyze in live with J-B, but looks like the original dead-lock issue is over and that we face a new issue.

Comment by Bruno Faccini (Inactive) [ 03/Oct/17 ]

Well, as expected after my previous comment, this new pseudo-hang/dead-lock situation can be reproduced by running latest Lustre master on RHEL7.2 and with SELinux disabled...

A lot of reproducer/clients stat+touch threads are stuck waiting for lock on directory to be granted with stacks like following:

PID: 12292  TASK: ffff8806b1ee4e70  CPU: 4   COMMAND: "stat"
 #0 [ffff8806c251f820] __schedule at ffffffff8168b175
 #1 [ffff8806c251f888] schedule at ffffffff8168b7c9
 #2 [ffff8806c251f898] schedule_timeout at ffffffff81689155
 #3 [ffff8806c251f940] ptlrpc_set_wait at ffffffffa09f8030 [ptlrpc]
 #4 [ffff8806c251f9e8] ptlrpc_queue_wait at ffffffffa09f84ed [ptlrpc]
 #5 [ffff8806c251fa08] ldlm_cli_enqueue at ffffffffa09d46c6 [ptlrpc]
 #6 [ffff8806c251fab0] mdc_enqueue_base at ffffffffa0c3fa0f [mdc]
 #7 [ffff8806c251fba8] mdc_intent_lock at ffffffffa0c4186b [mdc]
 #8 [ffff8806c251fc70] lmv_intent_lock at ffffffffa0c75f1f [lmv]
 #9 [ffff8806c251fd28] __ll_inode_revalidate at ffffffffa0cb28db [lustre]
#10 [ffff8806c251fdd0] ll_getattr at ffffffffa0cb3793 [lustre]
#11 [ffff8806c251fe68] vfs_getattr at ffffffff812030a6
#12 [ffff8806c251fe98] vfs_fstatat at ffffffff812031d5
#13 [ffff8806c251fee8] SYSC_newlstat at ffffffff81203791
#14 [ffff8806c251ff70] sys_newlstat at ffffffff81203a1e
#15 [ffff8806c251ff80] system_call_fastpath at ffffffff81696709
    RIP: 00007f4ae45d0eb5  RSP: 00007ffe8a35c2e8  RFLAGS: 00000246
    RAX: 0000000000000006  RBX: ffffffff81696709  RCX: ffffffffffffffff
    RDX: 00007ffe8a35c390  RSI: 00007ffe8a35c390  RDI: 00007ffe8a35ce35
    RBP: 0000000001169460   R8: 0000000000000001   R9: 00007ffe8a35ce35
    R10: 00007ffe8a35c0e0  R11: 0000000000000246  R12: ffffffff81203a1e
    R13: ffff8806c251ff78  R14: 0000000000000002  R15: 00000000298134ed
    ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b

PID: 12290  TASK: ffff8806c32a0fb0  CPU: 2   COMMAND: "touch"
 #0 [ffff8806c27a75a0] __schedule at ffffffff8168b175
 #1 [ffff8806c27a7608] schedule at ffffffff8168b7c9
 #2 [ffff8806c27a7618] schedule_timeout at ffffffff81689155
 #3 [ffff8806c27a76c0] ptlrpc_set_wait at ffffffffa09f8030 [ptlrpc]
 #4 [ffff8806c27a7768] ptlrpc_queue_wait at ffffffffa09f84ed [ptlrpc]
 #5 [ffff8806c27a7788] ldlm_cli_enqueue at ffffffffa09d46c6 [ptlrpc]
 #6 [ffff8806c27a7830] mdc_enqueue_base at ffffffffa0c3fa0f [mdc]
 #7 [ffff8806c27a7928] mdc_intent_lock at ffffffffa0c4186b [mdc]
 #8 [ffff8806c27a79f0] lmv_intent_lock at ffffffffa0c7655d [lmv]
 #9 [ffff8806c27a7aa8] ll_lookup_it at ffffffffa0cd9fa1 [lustre]
#10 [ffff8806c27a7c60] ll_atomic_open at ffffffffa0cdad9d [lustre]
#11 [ffff8806c27a7d00] do_last at ffffffff8120d321
#12 [ffff8806c27a7db0] path_openat at ffffffff8120dc72
#13 [ffff8806c27a7e48] do_filp_open at ffffffff8120fdeb
#14 [ffff8806c27a7f18] do_sys_open at ffffffff811fd2f3
#15 [ffff8806c27a7f70] sys_open at ffffffff811fd40e
#16 [ffff8806c27a7f80] system_call_fastpath at ffffffff81696709
    RIP: 00007f7f2c5fc2b0  RSP: 00007ffc0d5338c8  RFLAGS: 00010287
    RAX: 0000000000000002  RBX: ffffffff81696709  RCX: 0000000000000014
    RDX: 00000000000001b6  RSI: 0000000000000941  RDI: 00007ffc0d534e2d
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 00007ffc0d533790  R11: 0000000000000246  R12: ffffffff811fd40e
    R13: ffff8806c27a7f78  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

when all MDT service threads become busy/blocked with the similar stacks than following :

PID: 11386  TASK: ffff880688ed0000  CPU: 5   COMMAND: "mdt00_121"
 #0 [ffff880688ecb660] __schedule at ffffffff8163b6cd
 #1 [ffff880688ecb6c8] schedule at ffffffff8163bd69
 #2 [ffff880688ecb6d8] schedule_timeout at ffffffff816399c5
 #3 [ffff880688ecb788] ldlm_completion_ast at ffffffffa0a43fa1 [ptlrpc]
 #4 [ffff880688ecb828] ldlm_cli_enqueue_local at ffffffffa0a45be0 [ptlrpc]
 #5 [ffff880688ecb8c0] mdt_object_local_lock at ffffffffa10120a2 [mdt]
 #6 [ffff880688ecb968] mdt_object_lock_internal at ffffffffa10124bb [mdt]
 #7 [ffff880688ecb9a8] mdt_object_find_lock at ffffffffa10149e2 [mdt]
 #8 [ffff880688ecb9d8] mdt_reint_open at ffffffffa1036c1b [mdt]
 #9 [ffff880688ecbac8] mdt_reint_rec at ffffffffa102bb50 [mdt]
#10 [ffff880688ecbaf0] mdt_reint_internal at ffffffffa100d7f1 [mdt]
#11 [ffff880688ecbb28] mdt_intent_reint at ffffffffa100db32 [mdt]
#12 [ffff880688ecbb68] mdt_intent_policy at ffffffffa101825c [mdt]
#13 [ffff880688ecbbd0] ldlm_lock_enqueue at ffffffffa0a2b1f7 [ptlrpc]
#14 [ffff880688ecbc28] ldlm_handle_enqueue0 at ffffffffa0a50f72 [ptlrpc]
#15 [ffff880688ecbcb8] tgt_enqueue at ffffffffa0ad58f2 [ptlrpc]
#16 [ffff880688ecbcd8] tgt_request_handle at ffffffffa0ada2f5 [ptlrpc]
#17 [ffff880688ecbd20] ptlrpc_server_handle_request at ffffffffa0a84ccb [ptlrpc]
#18 [ffff880688ecbde8] ptlrpc_main at ffffffffa0a885f0 [ptlrpc]
#19 [ffff880688ecbec8] kthread at ffffffff810a5b8f
#20 [ffff880688ecbf50] ret_from_fork at ffffffff81646c98

PID: 11387  TASK: ffff880688ed2280  CPU: 6   COMMAND: "mdt00_122"
 #0 [ffff88068942f760] __schedule at ffffffff8163b6cd
 #1 [ffff88068942f7c8] schedule at ffffffff8163bd69
 #2 [ffff88068942f7d8] schedule_timeout at ffffffff816399c5
 #3 [ffff88068942f880] ldlm_completion_ast at ffffffffa0a43fa1 [ptlrpc]
 #4 [ffff88068942f920] ldlm_cli_enqueue_local at ffffffffa0a45be0 [ptlrpc]
 #5 [ffff88068942f9b8] mdt_object_local_lock at ffffffffa1011b92 [mdt]
 #6 [ffff88068942fa60] mdt_object_lock_internal at ffffffffa10124bb [mdt]
 #7 [ffff88068942faa0] mdt_getattr_name_lock at ffffffffa10132c6 [mdt]
 #8 [ffff88068942fb28] mdt_intent_getattr at ffffffffa10147c0 [mdt]
 #9 [ffff88068942fb68] mdt_intent_policy at ffffffffa101825c [mdt]
#10 [ffff88068942fbd0] ldlm_lock_enqueue at ffffffffa0a2b1f7 [ptlrpc]
#11 [ffff88068942fc28] ldlm_handle_enqueue0 at ffffffffa0a50f72 [ptlrpc]
#12 [ffff88068942fcb8] tgt_enqueue at ffffffffa0ad58f2 [ptlrpc]
#13 [ffff88068942fcd8] tgt_request_handle at ffffffffa0ada2f5 [ptlrpc]
#14 [ffff88068942fd20] ptlrpc_server_handle_request at ffffffffa0a84ccb [ptlrpc]
#15 [ffff88068942fde8] ptlrpc_main at ffffffffa0a885f0 [ptlrpc]
#16 [ffff88068942fec8] kthread at ffffffff810a5b8f
#17 [ffff88068942ff50] ret_from_fork at ffffffff81646c98

So my current feeling is that the original problem/dead-lock for this ticket should be fixed now by my current patch, but we may now face some regression in Lustre latest master.

We will try to better qualify this new suspected problem, with more on-site debugging using reproducer's new runs without SELinux now. Will also try to expose same platform to others CEA regression tests on meta-datas.

Comment by Bruno Faccini (Inactive) [ 19/Nov/17 ]

By the way, it seems to me that this last pseudo-hang/dead-lock scenario looks very similar to LU-10235 !!

Comment by Dominique Martinet (Inactive) [ 20/Nov/17 ]

It is close, yes. It looks like mdt_reint_open() also takes a MDS_INODELOCK_UPDATE lock on the parent object before doing a lookup and create if lookup fails.

In LU-10235 I was only looking at mdt_create() but a similar solution of doing a first lookup with MDS_INODELOCK_LOOKUP followed by dropping/getting lock for update and another lookup would likely help in this precise case (since you're hammering a file that already exists), but if all the files are different e.g. hundreds of clients creating thousands different files in a single directory then it wouldn't help at all. I'm not sure what we could do for this scenario, it's practically a denial of service at this point...

Comment by Bruno Faccini (Inactive) [ 15/Jan/18 ]

The last dead-lock situation encountered in this ticket, has nothing related with original issue that this ticket has been created for, but seems to be more generic and being addressed in both LU-10235/LU-10262 tickets now.

Comment by Andreas Dilger [ 26/Jan/18 ]

Bruno, unless you think your current patch is fundamentally moving in the wrong direction (I haven looked at it yet) then I think it makes sense to land the current patch that solves the first problem, then make a second patch to address this next problem.

At least we are moving in the right direction to fix the first problem, which is what the application was originally hitting, and possibly the second problem you are able to produce is not actually being seen by applications, so we have more time to work on a solution.

Comment by Gerrit Updater [ 01/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/26831/
Subject: LU-9193 security: return security context for metadata ops
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fca35f74f9ec5c5ed77e774f3e3209d9df057a01

Comment by Peter Jones [ 01/Apr/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 02/Apr/19 ]

Sebastien Piechurski (sebastien.piechurski@atos.net) uploaded a new patch: https://review.whamcloud.com/34573
Subject: LU-9193 security: return security context for metadata ops
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 24b11428909ea77c7f4a101b4e2b7f8b2c490f06

Comment by Gerrit Updater [ 21/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34573/
Subject: LU-9193 security: return security context for metadata ops
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 7aa5ae2673f70ef851fb903b280a0fc9a47c476b

Comment by Mikhail Pershin [ 21/Apr/19 ]

this patch causes often client reconnects somehow:

Apr 21 03:09:50 nodez kernel: Lustre: 4239:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555830583/real 1555830583] req@ffff8800ad5fc740 x1631406542374112/t0(0) o101->lustre-MDT0000-mdc-ffff8800af2fb800@0@lo:12/10 lens 616/4752 e 0 to 1 dl 1555830590 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
Apr 21 03:09:50 nodez kernel: Lustre: lustre-MDT0000-mdc-ffff8800af2fb800: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Apr 21 03:09:50 nodez kernel: Lustre: lustre-MDT0000: Client 2f22eecb-5055-b4d6-16b5-a958700fcbda (at 0@lo) reconnecting
Apr 21 03:09:50 nodez kernel: Lustre: lustre-MDT0000: Connection restored to 4e8c3a45-3b8a-cd6a-047f-22363e8171e6 (at 0@lo)
Apr 21 03:09:50 nodez kernel: Lustre: Skipped 1 previous similar message

I am seeing many such messages during dbench run and git bisect shows this patch is source of problems.

Comment by Mikhail Pershin [ 21/Apr/19 ]

Created LU-12212 for reconnect issue

Comment by Gerrit Updater [ 02/Feb/21 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41387
Subject: LU-9193 security: return security context for metadata ops
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 68554f5ebf1479748b9bc9f6f86d4eb4c5dc9873

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41387/
Subject: LU-9193 security: return security context for metadata ops
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: cd17e1a2e6367a7c3f07753e71fa569c28000c81

Generated at Sat Feb 10 02:24:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.