[LU-4300] ptlrpcd threads deadlocked in cl_lock_mutex_get Created: 22/Nov/13 Updated: 06/Sep/19 Resolved: 07/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Sergey Cheremencev | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 11787 | ||||||||||||||||||||||||
| Description |
|
TraveltimeLaunc thread held cl_lock and cause call of cl_lock_cancel. Lock cancel causes rpc OST_WRITE. But OST_WRITE can't be sent because all ptlrpcd are waiting for cl_lock. So, OST_WRITE rpc sticks at new requests queue and can't be sent because ptlrpcd_19 is blocked by osc lock up call. dump from ptlrpcd threads - each thread has a recovery set, and two "production" new and processed. thread 19 have queue All 32 ptlrpcd threads are waiting with this stack trace: > 2013-10-15T15:14:18.766544-05:00 c0-0c0s5n3 ptlrpcd_0 D 0000000000000000 0 4183 2 0x00000000 > 2013-10-15T15:14:18.791825-05:00 c0-0c0s5n3 ffff881fe7085aa0 0000000000000046 ffff881fe7084010 ffff881fea497040 > 2013-10-15T15:14:18.792017-05:00 c0-0c0s5n3 0000000000010b00 ffff881fe7085fd8 ffff881fe7085fd8 0000000000010b00 > 2013-10-15T15:14:18.817016-05:00 c0-0c0s5n3 0000000000000013 ffff881fea497040 0000000000000000 ffff881fe9800040 > 2013-10-15T15:14:18.817082-05:00 c0-0c0s5n3 Call Trace: > 2013-10-15T15:14:18.817244-05:00 c0-0c0s5n3 [<ffffffff8138311f>] schedule+0x3f/0x60 > 2013-10-15T15:14:18.817388-05:00 c0-0c0s5n3 [<ffffffff81383f92>] __mutex_lock_slowpath+0x102/0x180 > 2013-10-15T15:14:18.842380-05:00 c0-0c0s5n3 [<ffffffff81383993>] mutex_lock+0x23/0x40 > 2013-10-15T15:14:18.842582-05:00 c0-0c0s5n3 [<ffffffffa0375806>] cl_lock_mutex_get+0x76/0xd0 [obdclass] > 2013-10-15T15:14:18.867656-05:00 c0-0c0s5n3 [<ffffffffa07bbc08>] lovsub_parent_lock+0x48/0x120 [lov] > 2013-10-15T15:14:18.867709-05:00 c0-0c0s5n3 [<ffffffffa07bc368>] lovsub_lock_state+0x68/0x1a0 [lov] > 2013-10-15T15:14:18.867880-05:00 c0-0c0s5n3 [<ffffffffa0373558>] cl_lock_state_signal+0x68/0x160 [obdclass] > 2013-10-15T15:14:18.893131-05:00 c0-0c0s5n3 [<ffffffffa0373e7a>] cl_lock_signal+0x5a/0x130 [obdclass] > 2013-10-15T15:14:18.893285-05:00 c0-0c0s5n3 [<ffffffffa072279d>] osc_lock_upcall+0x25d/0x610 [osc] > 2013-10-15T15:14:18.893419-05:00 c0-0c0s5n3 [<ffffffffa07041bd>] osc_enqueue_fini+0x9d/0x240 [osc] > 2013-10-15T15:14:18.918520-05:00 c0-0c0s5n3 [<ffffffffa0706eb7>] osc_enqueue_interpret+0xe7/0x1d0 [osc] > 2013-10-15T15:14:18.918702-05:00 c0-0c0s5n3 [<ffffffffa049733c>] ptlrpc_check_set+0x52c/0x1e20 [ptlrpc] > 2013-10-15T15:14:18.943647-05:00 c0-0c0s5n3 [<ffffffffa04c4acb>] ptlrpcd_check+0x53b/0x560 [ptlrpc] > 2013-10-15T15:14:18.943885-05:00 c0-0c0s5n3 [<ffffffffa04c504b>] ptlrpcd+0x29b/0x3b0 [ptlrpc] > 2013-10-15T15:14:18.943979-05:00 c0-0c0s5n3 [<ffffffff8138de94>] kernel_thread_helper+0x4/0x10 The following application process is waiting for an RPC to be sent: > 2013-10-15T15:14:31.200904-05:00 c0-0c0s5n3 TraveltimeLaunc S 00000001050e94cc 0 16793 16777 0x00000000 > 2013-10-15T15:14:31.200927-05:00 c0-0c0s5n3 ffff880fe8979308 0000000000000082 ffff880fe8978010 ffff880faca8e040 > 2013-10-15T15:14:31.200967-05:00 c0-0c0s5n3 0000000000010b00 ffff880fe8979fd8 ffff880fe8979fd8 0000000000010b00 > 2013-10-15T15:14:31.226179-05:00 c0-0c0s5n3 0000000000000000 ffff880faca8e040 0000000000000000 ffffffff81667020 > 2013-10-15T15:14:31.226195-05:00 c0-0c0s5n3 Call Trace: > 2013-10-15T15:14:31.226219-05:00 c0-0c0s5n3 [<ffffffff8138311f>] schedule+0x3f/0x60 > 2013-10-15T15:14:31.251476-05:00 c0-0c0s5n3 [<ffffffffa021d74e>] cfs_waitq_wait+0xe/0x10 [libcfs] > 2013-10-15T15:14:31.251533-05:00 c0-0c0s5n3 [<ffffffffa0731e56>] osc_extent_wait+0x576/0x630 [osc] > 2013-10-15T15:14:31.251563-05:00 c0-0c0s5n3 [<ffffffffa0732447>] osc_cache_wait_range+0x537/0x820 [osc] > 2013-10-15T15:14:31.276843-05:00 c0-0c0s5n3 [<ffffffffa0733199>] osc_cache_writeback_range+0xa69/0x1030 [osc] > 2013-10-15T15:14:31.276897-05:00 c0-0c0s5n3 [<ffffffffa071febe>] osc_lock_flush+0x7e/0x260 [osc] > 2013-10-15T15:14:31.302146-05:00 c0-0c0s5n3 [<ffffffffa0720181>] osc_lock_cancel+0xe1/0x1c0 [osc] > 2013-10-15T15:14:31.302187-05:00 c0-0c0s5n3 [<ffffffffa03730d5>] cl_lock_cancel0+0x75/0x160 [obdclass] > 2013-10-15T15:14:31.302229-05:00 c0-0c0s5n3 [<ffffffffa0373e1b>] cl_lock_cancel+0x13b/0x140 [obdclass] > 2013-10-15T15:14:31.327498-05:00 c0-0c0s5n3 [<ffffffffa072162c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc] > 2013-10-15T15:14:31.327566-05:00 c0-0c0s5n3 [<ffffffffa046baab>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc] > 2013-10-15T15:14:31.352863-05:00 c0-0c0s5n3 [<ffffffffa0479d1a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] > 2013-10-15T15:14:31.352917-05:00 c0-0c0s5n3 [<ffffffffa047cfec>] ldlm_cli_cancel_list_local+0xec/0x290 [ptlrpc] > 2013-10-15T15:14:31.378183-05:00 c0-0c0s5n3 [<ffffffffa047dee5>] ldlm_cancel_lru_local+0x35/0x40 [ptlrpc] > 2013-10-15T15:14:31.378216-05:00 c0-0c0s5n3 [<ffffffffa047f36f>] ldlm_prep_elc_req+0x3df/0x490 [ptlrpc] > 2013-10-15T15:14:31.403528-05:00 c0-0c0s5n3 [<ffffffffa047f448>] ldlm_prep_enqueue_req+0x28/0x30 [ptlrpc] > 2013-10-15T15:14:31.403581-05:00 c0-0c0s5n3 [<ffffffffa0706963>] osc_enqueue_base+0x103/0x570 [osc] > 2013-10-15T15:14:31.403617-05:00 c0-0c0s5n3 [<ffffffffa0720e36>] osc_lock_enqueue+0x506/0x900 [osc] > 2013-10-15T15:14:31.403617-05:00 c0-0c0s5n3 [<ffffffffa0720e36>] osc_lock_enqueue+0x506/0x900 [osc] > 2013-10-15T15:14:31.428880-05:00 c0-0c0s5n3 [<ffffffffa03777bb>] cl_enqueue_try+0xfb/0x320 [obdclass] > 2013-10-15T15:14:31.428980-05:00 c0-0c0s5n3 [<ffffffffa07b4a54>] lov_lock_enqueue+0x1f4/0x890 [lov] > 2013-10-15T15:14:31.429041-05:00 c0-0c0s5n3 [<ffffffffa03777bb>] cl_enqueue_try+0xfb/0x320 [obdclass] > 2013-10-15T15:14:31.454182-05:00 c0-0c0s5n3 [<ffffffffa037869f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass] > 2013-10-15T15:14:31.454296-05:00 c0-0c0s5n3 [<ffffffffa03792ce>] cl_lock_request+0x7e/0x270 [obdclass] > 2013-10-15T15:14:31.454354-05:00 c0-0c0s5n3 [<ffffffffa0883e9f>] cl_glimpse_lock+0x17f/0x490 [lustre] > 2013-10-15T15:14:31.479540-05:00 c0-0c0s5n3 [<ffffffffa088434d>] cl_glimpse_size0+0x19d/0x1c0 [lustre] > 2013-10-15T15:14:31.504845-05:00 c0-0c0s5n3 [<ffffffffa0837650>] ll_inode_revalidate_it+0x1b0/0x1d0 [lustre] > 2013-10-15T15:14:31.504912-05:00 c0-0c0s5n3 [<ffffffffa08376b6>] ll_getattr_it+0x46/0x180 [lustre] > 2013-10-15T15:14:31.504990-05:00 c0-0c0s5n3 [<ffffffffa083782c>] ll_getattr+0x3c/0x40 [lustre] > 2013-10-15T15:14:31.505021-05:00 c0-0c0s5n3 [<ffffffff81147355>] vfs_getattr+0x25/0x50 > 2013-10-15T15:14:31.530127-05:00 c0-0c0s5n3 [<ffffffff81147920>] vfs_fstatat+0x80/0x90 > 2013-10-15T15:14:31.530209-05:00 c0-0c0s5n3 [<ffffffff81147a5b>] vfs_stat+0x1b/0x20 > 2013-10-15T15:14:31.530264-05:00 c0-0c0s5n3 [<ffffffff81147a84>] sys_newstat+0x24/0x50 > 2013-10-15T15:14:31.530321-05:00 c0-0c0s5n3 [<ffffffff8138cdab>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Alexey Lyashkov [ 22/Nov/13 ] |
|
problem is simple. GLIMPSE lock is in NEW state, to switch into ENQUEUE state it need to send requests for each stripe assigned to the file.. fine. Xyratex have a several ideas how it's need fixed. We want to discus with Intel before we will be prepare a patch for that issue. |
| Comment by Andreas Dilger [ 27/Nov/13 ] |
|
I think Jinshan and Bobijam are already working on simplifying cl_lock as part of the CLIO cleanup project. Have added them to the CC list of this bug to discuss possible solutions. |
| Comment by Alexey Lyashkov [ 28/Nov/13 ] |
|
from my point view - we need implement an something like to 2+3. as about 2 - it's changes just to avoid an any recursive dependence in preparing operation. Other solution in that area will be separate lock namespace completely. FLOCK's from IBITS, Extents from Glimpse ... I think we need invite an Alex and Tappro to that discussion but i don't able to add new watches to that ticket. |
| Comment by Jinshan Xiong (Inactive) [ 02/Dec/13 ] |
|
It seems that the 2nd option is a good bet. The glimpse requests should be sent out asap so it makes no sense to write some data back before doing that. A rule thumb is never to block ptlrpc threads and we have taken this into account in the cl_lock rebuild project. |
| Comment by Alexey Lyashkov [ 03/Dec/13 ] |
|
Jay, what about separate a declare and execute phases for a any CLIO (and may be MD) operations ? |
| Comment by Vitaly Fertman [ 04/Dec/13 ] |
|
What occurred with glimpse rpc may happen with regular enqueue rpc, so let's separate these problems: regarding 2 - we may want to drop lru cancellation on enqueue rpc or only on glimpse enqueue rpc or even leave it as is. the original idea was that shrinker needs to cancel aged locks from lru and instead of creating new rpc from time to time, we may distribute this job among existing enqueue rpc. |
| Comment by Alexey Lyashkov [ 05/Dec/13 ] |
|
Vitaly, i think we need to align LRU policy to avoid cancelling a locks with dirty data. It's will solve both issues - just make glimpse faster and avoid any waiting in lock enqueue. what you think about ? |
| Comment by Jinshan Xiong (Inactive) [ 05/Dec/13 ] |
|
That sounds perfect in my opinion, or we can only cancel PR locks. Please take a look at osc_ldlm_weigh_ast() which can be revised to figure out if there is any dirty pages covered by a specific lock. |
| Comment by Cheng Shao (Inactive) [ 10/Dec/13 ] |
|
Hi Jinshan, just to follow up with you - will you be working on it, or it is ok for Xyratex to provide a patch for you to review? Thanks. |
| Comment by Jinshan Xiong (Inactive) [ 10/Dec/13 ] |
|
I'm overwhelmed so I will appreciate if you guys can create a patch for this. |
| Comment by Cheng Shao (Inactive) [ 11/Dec/13 ] |
|
OK, thanks Jinshan. |
| Comment by Alexey Lyashkov [ 11/Jan/14 ] |
|
That ticket affect 2.6 also. |
| Comment by Jinshan Xiong (Inactive) [ 29/Jan/14 ] |
|
Hi, do you guys have a patch for this issue yet? |
| Comment by James A Simmons [ 04/Feb/14 ] |
|
This problem is also impacting ORNL. Does a reproducer exit? |
| Comment by James A Simmons [ 04/Feb/14 ] |
|
This is critical bug for us. Alexey if you have a patch can you push it so I can be tested and inspected. |
| Comment by Patrick Farrell (Inactive) [ 04/Feb/14 ] |
|
James, Jinshan, Cray's been working with Xyratex on this. Unfortunately, they haven't been able to come up with a patch for this issue yet. If someone from Intel is able to look at it, we can ask them to speak to you. |
| Comment by Alexey Lyashkov [ 04/Feb/14 ] |
|
Ann think simple fix should be enough as workaround, until we will have complex fix. I will push a patch in next two days. |
| Comment by Shuichi Ihara (Inactive) [ 04/Feb/14 ] |
|
James, agreed with you. This is critical and we also hit rancoramly client crash due to this issue. As a quick workaround, "echo 0 > /proc/fs/lustre/ldlm/namespaces/*/early_lock_cancel" on clients helped to avoid this issue in our case. Hope this helps in your case too. |
| Comment by Bruno Faccini (Inactive) [ 06/Feb/14 ] |
|
Yes, it would be interesting to know if disabling ELC, the work-around we found for |
| Comment by Alexey Lyashkov [ 06/Feb/14 ] |
|
please try attached fix and say how it help. |
| Comment by James A Simmons [ 06/Feb/14 ] |
|
Pushed the patch to gerrit so it can run through maloo. |
| Comment by James A Simmons [ 06/Feb/14 ] |
|
We also disabled ELC and our problems did stop. |
| Comment by Jinshan Xiong (Inactive) [ 07/Feb/14 ] |
|
That patch is a good workaround but it will affect performance. I'm going to work out another solution to not do early cancel for glimpse lock ENQ. |
| Comment by Jinshan Xiong (Inactive) [ 07/Feb/14 ] |
|
I pushed a new patch at: http://review.whamcloud.com/9175 I don't run any tests so please run sanity test and then see if it can fix the problem as well. I will run test tomorrow. |
| Comment by James A Simmons [ 07/Feb/14 ] |
|
I'm seeing jobs failing to run with patch 9156. |
| Comment by Jinshan Xiong (Inactive) [ 07/Feb/14 ] |
|
failed due to exact the same reason, or a new one? |
| Comment by James A Simmons [ 17/Feb/14 ] |
|
Looks like new one. I never saw failures of job before the xyratex patch at small scale. Only large scale did we see problems. Either way I will be try the intel patch next |
| Comment by Jodi Levi (Inactive) [ 07/Mar/14 ] |
|
Patch has landed to Master. |
| Comment by James Nunez (Inactive) [ 24/Apr/14 ] |
|
b2_5 patch at http://review.whamcloud.com/#/c/9451/ |
| Comment by Bruno Faccini (Inactive) [ 24/Apr/14 ] |
|
James and all, I wonder why the patch-set #2/#3 of the b2_5 patch version mentioned (http://review.whamcloud.com/#/c/9451/) only affect osc/osc_request.c file vs much more in its patch-set #1 like in the master version (http://review.whamcloud.com/9175) ? |
| Comment by James A Simmons [ 29/Apr/14 ] |
|
The reason is Jinshan pointed the other part of the fix in master is related to the clio cleanup work. For b2_5 we only need the simpler fix. |
| Comment by Alexey Lyashkov [ 23/Oct/14 ] |
|
For the record - Jay changes in osc_cancel_for_recovery() re-introduced deadlock during recovery. |