[LU-4300] ptlrpcd threads deadlocked in cl_lock_mutex_get Created: 22/Nov/13  Updated: 06/Sep/19  Resolved: 07/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.2

Type: Bug Priority: Blocker
Reporter: Sergey Cheremencev Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: MB

Attachments: File ptlrpcd-rpc.info     File quick-fix.diff     Text File rpcs.log    
Issue Links:
Duplicate
duplicates LU-4552 osc_cache.c:899:osc_extent_wait() tim... Resolved
is duplicated by LU-465 sanity.sh test_99e failed with timeout Resolved
Related
is related to LU-5727 MDS OOMs with 2.5.3 clients and lru_s... Resolved
is related to LU-11518 lock_count is exceeding lru_size Resolved
Severity: 3
Rank (Obsolete): 11787

 Description   

TraveltimeLaunc thread held cl_lock and cause call of cl_lock_cancel. Lock cancel causes rpc OST_WRITE. But OST_WRITE can't be sent because all ptlrpcd are waiting for cl_lock.

So, OST_WRITE rpc sticks at new requests queue and can't be sent because ptlrpcd_19 is blocked by osc lock up call.

dump from ptlrpcd threads - each thread has a recovery set, and two "production" new and processed.
each rpc is dumped in format
$rpc_ptr : $opc : $rq_status

thread 19 have queue
print tread 19 - 0xffff881fe70fae70
new requests
rpc 0xffff880faf27e000 : 101 - 0
rpc 0xffff881feff5d800 : 4 - 0
request in processing

All 32 ptlrpcd threads are waiting with this stack trace:

> 2013-10-15T15:14:18.766544-05:00 c0-0c0s5n3 ptlrpcd_0       D 0000000000000000     0  4183      2 0x00000000
> 2013-10-15T15:14:18.791825-05:00 c0-0c0s5n3 ffff881fe7085aa0 0000000000000046 ffff881fe7084010 ffff881fea497040
> 2013-10-15T15:14:18.792017-05:00 c0-0c0s5n3 0000000000010b00 ffff881fe7085fd8 ffff881fe7085fd8 0000000000010b00
> 2013-10-15T15:14:18.817016-05:00 c0-0c0s5n3 0000000000000013 ffff881fea497040 0000000000000000 ffff881fe9800040
> 2013-10-15T15:14:18.817082-05:00 c0-0c0s5n3 Call Trace:
> 2013-10-15T15:14:18.817244-05:00 c0-0c0s5n3 [<ffffffff8138311f>] schedule+0x3f/0x60
> 2013-10-15T15:14:18.817388-05:00 c0-0c0s5n3 [<ffffffff81383f92>] __mutex_lock_slowpath+0x102/0x180
> 2013-10-15T15:14:18.842380-05:00 c0-0c0s5n3 [<ffffffff81383993>] mutex_lock+0x23/0x40
> 2013-10-15T15:14:18.842582-05:00 c0-0c0s5n3 [<ffffffffa0375806>] cl_lock_mutex_get+0x76/0xd0 [obdclass]
> 2013-10-15T15:14:18.867656-05:00 c0-0c0s5n3 [<ffffffffa07bbc08>] lovsub_parent_lock+0x48/0x120 [lov]
> 2013-10-15T15:14:18.867709-05:00 c0-0c0s5n3 [<ffffffffa07bc368>] lovsub_lock_state+0x68/0x1a0 [lov]
> 2013-10-15T15:14:18.867880-05:00 c0-0c0s5n3 [<ffffffffa0373558>] cl_lock_state_signal+0x68/0x160 [obdclass]
> 2013-10-15T15:14:18.893131-05:00 c0-0c0s5n3 [<ffffffffa0373e7a>] cl_lock_signal+0x5a/0x130 [obdclass]
> 2013-10-15T15:14:18.893285-05:00 c0-0c0s5n3 [<ffffffffa072279d>] osc_lock_upcall+0x25d/0x610 [osc]
> 2013-10-15T15:14:18.893419-05:00 c0-0c0s5n3 [<ffffffffa07041bd>] osc_enqueue_fini+0x9d/0x240 [osc]
> 2013-10-15T15:14:18.918520-05:00 c0-0c0s5n3 [<ffffffffa0706eb7>] osc_enqueue_interpret+0xe7/0x1d0 [osc]
> 2013-10-15T15:14:18.918702-05:00 c0-0c0s5n3 [<ffffffffa049733c>] ptlrpc_check_set+0x52c/0x1e20 [ptlrpc]
> 2013-10-15T15:14:18.943647-05:00 c0-0c0s5n3 [<ffffffffa04c4acb>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
> 2013-10-15T15:14:18.943885-05:00 c0-0c0s5n3 [<ffffffffa04c504b>] ptlrpcd+0x29b/0x3b0 [ptlrpc]
> 2013-10-15T15:14:18.943979-05:00 c0-0c0s5n3 [<ffffffff8138de94>] kernel_thread_helper+0x4/0x10

The following application process is waiting for an RPC to be sent:

> 2013-10-15T15:14:31.200904-05:00 c0-0c0s5n3 TraveltimeLaunc S 00000001050e94cc     0 16793  16777 0x00000000
> 2013-10-15T15:14:31.200927-05:00 c0-0c0s5n3 ffff880fe8979308 0000000000000082 ffff880fe8978010 ffff880faca8e040
> 2013-10-15T15:14:31.200967-05:00 c0-0c0s5n3 0000000000010b00 ffff880fe8979fd8 ffff880fe8979fd8 0000000000010b00
> 2013-10-15T15:14:31.226179-05:00 c0-0c0s5n3 0000000000000000 ffff880faca8e040 0000000000000000 ffffffff81667020
> 2013-10-15T15:14:31.226195-05:00 c0-0c0s5n3 Call Trace:
> 2013-10-15T15:14:31.226219-05:00 c0-0c0s5n3 [<ffffffff8138311f>] schedule+0x3f/0x60
> 2013-10-15T15:14:31.251476-05:00 c0-0c0s5n3 [<ffffffffa021d74e>] cfs_waitq_wait+0xe/0x10 [libcfs]
> 2013-10-15T15:14:31.251533-05:00 c0-0c0s5n3 [<ffffffffa0731e56>] osc_extent_wait+0x576/0x630 [osc]
> 2013-10-15T15:14:31.251563-05:00 c0-0c0s5n3 [<ffffffffa0732447>] osc_cache_wait_range+0x537/0x820 [osc]
> 2013-10-15T15:14:31.276843-05:00 c0-0c0s5n3 [<ffffffffa0733199>] osc_cache_writeback_range+0xa69/0x1030 [osc]
> 2013-10-15T15:14:31.276897-05:00 c0-0c0s5n3 [<ffffffffa071febe>] osc_lock_flush+0x7e/0x260 [osc]
> 2013-10-15T15:14:31.302146-05:00 c0-0c0s5n3 [<ffffffffa0720181>] osc_lock_cancel+0xe1/0x1c0 [osc]
> 2013-10-15T15:14:31.302187-05:00 c0-0c0s5n3 [<ffffffffa03730d5>] cl_lock_cancel0+0x75/0x160 [obdclass]
> 2013-10-15T15:14:31.302229-05:00 c0-0c0s5n3 [<ffffffffa0373e1b>] cl_lock_cancel+0x13b/0x140 [obdclass]
> 2013-10-15T15:14:31.327498-05:00 c0-0c0s5n3 [<ffffffffa072162c>] osc_ldlm_blocking_ast+0x20c/0x330 [osc]
> 2013-10-15T15:14:31.327566-05:00 c0-0c0s5n3 [<ffffffffa046baab>] ldlm_cancel_callback+0x6b/0x190 [ptlrpc]
> 2013-10-15T15:14:31.352863-05:00 c0-0c0s5n3 [<ffffffffa0479d1a>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc]
> 2013-10-15T15:14:31.352917-05:00 c0-0c0s5n3 [<ffffffffa047cfec>] ldlm_cli_cancel_list_local+0xec/0x290 [ptlrpc]
> 2013-10-15T15:14:31.378183-05:00 c0-0c0s5n3 [<ffffffffa047dee5>] ldlm_cancel_lru_local+0x35/0x40 [ptlrpc]
> 2013-10-15T15:14:31.378216-05:00 c0-0c0s5n3 [<ffffffffa047f36f>] ldlm_prep_elc_req+0x3df/0x490 [ptlrpc]
> 2013-10-15T15:14:31.403528-05:00 c0-0c0s5n3 [<ffffffffa047f448>] ldlm_prep_enqueue_req+0x28/0x30 [ptlrpc]
> 2013-10-15T15:14:31.403581-05:00 c0-0c0s5n3 [<ffffffffa0706963>] osc_enqueue_base+0x103/0x570 [osc]
> 2013-10-15T15:14:31.403617-05:00 c0-0c0s5n3 [<ffffffffa0720e36>] osc_lock_enqueue+0x506/0x900 [osc]
> 2013-10-15T15:14:31.403617-05:00 c0-0c0s5n3 [<ffffffffa0720e36>] osc_lock_enqueue+0x506/0x900 [osc]
> 2013-10-15T15:14:31.428880-05:00 c0-0c0s5n3 [<ffffffffa03777bb>] cl_enqueue_try+0xfb/0x320 [obdclass]
> 2013-10-15T15:14:31.428980-05:00 c0-0c0s5n3 [<ffffffffa07b4a54>] lov_lock_enqueue+0x1f4/0x890 [lov]
> 2013-10-15T15:14:31.429041-05:00 c0-0c0s5n3 [<ffffffffa03777bb>] cl_enqueue_try+0xfb/0x320 [obdclass]
> 2013-10-15T15:14:31.454182-05:00 c0-0c0s5n3 [<ffffffffa037869f>] cl_enqueue_locked+0x7f/0x1f0 [obdclass]
> 2013-10-15T15:14:31.454296-05:00 c0-0c0s5n3 [<ffffffffa03792ce>] cl_lock_request+0x7e/0x270 [obdclass]
> 2013-10-15T15:14:31.454354-05:00 c0-0c0s5n3 [<ffffffffa0883e9f>] cl_glimpse_lock+0x17f/0x490 [lustre]
> 2013-10-15T15:14:31.479540-05:00 c0-0c0s5n3 [<ffffffffa088434d>] cl_glimpse_size0+0x19d/0x1c0 [lustre]
> 2013-10-15T15:14:31.504845-05:00 c0-0c0s5n3 [<ffffffffa0837650>] ll_inode_revalidate_it+0x1b0/0x1d0 [lustre]
> 2013-10-15T15:14:31.504912-05:00 c0-0c0s5n3 [<ffffffffa08376b6>] ll_getattr_it+0x46/0x180 [lustre]
> 2013-10-15T15:14:31.504990-05:00 c0-0c0s5n3 [<ffffffffa083782c>] ll_getattr+0x3c/0x40 [lustre]
> 2013-10-15T15:14:31.505021-05:00 c0-0c0s5n3 [<ffffffff81147355>] vfs_getattr+0x25/0x50
> 2013-10-15T15:14:31.530127-05:00 c0-0c0s5n3 [<ffffffff81147920>] vfs_fstatat+0x80/0x90
> 2013-10-15T15:14:31.530209-05:00 c0-0c0s5n3 [<ffffffff81147a5b>] vfs_stat+0x1b/0x20
> 2013-10-15T15:14:31.530264-05:00 c0-0c0s5n3 [<ffffffff81147a84>] sys_newstat+0x24/0x50
> 2013-10-15T15:14:31.530321-05:00 c0-0c0s5n3 [<ffffffff8138cdab>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Alexey Lyashkov [ 22/Nov/13 ]

problem is simple.

GLIMPSE lock is in NEW state, to switch into ENQUEUE state it need to send requests for each stripe assigned to the file.. fine.
but when we send any lock enqueue request we are cancel unused + aged locks on same namespace, so we need to cancel an extent lock with dirty data. Before 2.4 (or better say) ptlrpcd "optimisation" lock enqueue send via normal thread but lock cancel send an BRW request via special brw thread and we have none dependencies between these requests, so brw finished and processed in own thread.. but now..
glimpse reply tried to take an top level cl_lock, but it's blocked by wanting an brw reply.. OK.. same as before.. but one difference.
brw request now in set_new_requests queue in same thread as tried to take top level mutex... and don't have ability to the send request over network.. so.. deadlocked.

Xyratex have a several ideas how it's need fixed.
1) simplest - restore an _brw threads. so that situation was resolved as designed in CLIO. but it's will be block an ptlrpcd threads probably for a long time and may bad in case one of OST's in down.
2( change a ELC/LRU policy to avoid cancelling an extent locks while taking an glimpse lock, or some tuning with ability to cancel locks only if they don't have dirty data and may be canceled locally without sending RPC's.
3) change CLIO to be closed to the older LOV design.. prepare a full set of requests before sending any request over wire. in that case,
glimpse replies will be never received while we hold a top level lock and flush a dirty data will be separated from and sending glimpse reply. so deadlock isn't possible.

We want to discus with Intel before we will be prepare a patch for that issue.

Comment by Andreas Dilger [ 27/Nov/13 ]

I think Jinshan and Bobijam are already working on simplifying cl_lock as part of the CLIO cleanup project. Have added them to the CC list of this bug to discuss possible solutions.

Comment by Alexey Lyashkov [ 28/Nov/13 ]

from my point view - we need implement an something like to 2+3.
it's don't need a large changes in cl_lock design - but need divide a any CLIO operation into two parts - preparing and executing.
it's will be close to the transaction on server side where we have declare and execute phases also.

as about 2 - it's changes just to avoid an any recursive dependence in preparing operation. Other solution in that area will be separate lock namespace completely. FLOCK's from IBITS, Extents from Glimpse ...

I think we need invite an Alex and Tappro to that discussion but i don't able to add new watches to that ticket.

Comment by Jinshan Xiong (Inactive) [ 02/Dec/13 ]

It seems that the 2nd option is a good bet. The glimpse requests should be sent out asap so it makes no sense to write some data back before doing that.

A rule thumb is never to block ptlrpc threads and we have taken this into account in the cl_lock rebuild project.

Comment by Alexey Lyashkov [ 03/Dec/13 ]

Jay,

what about separate a declare and execute phases for a any CLIO (and may be MD) operations ?

Comment by Vitaly Fertman [ 04/Dec/13 ]

What occurred with glimpse rpc may happen with regular enqueue rpc, so let's separate these problems:
1 - the current issue
2 - glimpse vs. lru_resize optimisation.

regarding 2 - we may want to drop lru cancellation on enqueue rpc or only on glimpse enqueue rpc or even leave it as is. the original idea was that shrinker needs to cancel aged locks from lru and instead of creating new rpc from time to time, we may distribute this job among existing enqueue rpc.

Comment by Alexey Lyashkov [ 05/Dec/13 ]

Vitaly,

i think we need to align LRU policy to avoid cancelling a locks with dirty data. It's will solve both issues - just make glimpse faster and avoid any waiting in lock enqueue. what you think about ?

Comment by Jinshan Xiong (Inactive) [ 05/Dec/13 ]

That sounds perfect in my opinion, or we can only cancel PR locks. Please take a look at osc_ldlm_weigh_ast() which can be revised to figure out if there is any dirty pages covered by a specific lock.

Comment by Cheng Shao (Inactive) [ 10/Dec/13 ]

Hi Jinshan, just to follow up with you - will you be working on it, or it is ok for Xyratex to provide a patch for you to review? Thanks.

Comment by Jinshan Xiong (Inactive) [ 10/Dec/13 ]

I'm overwhelmed so I will appreciate if you guys can create a patch for this.

Comment by Cheng Shao (Inactive) [ 11/Dec/13 ]

OK, thanks Jinshan.

Comment by Alexey Lyashkov [ 11/Jan/14 ]

That ticket affect 2.6 also.

Comment by Jinshan Xiong (Inactive) [ 29/Jan/14 ]

Hi, do you guys have a patch for this issue yet?

Comment by James A Simmons [ 04/Feb/14 ]

This problem is also impacting ORNL. Does a reproducer exit?

Comment by James A Simmons [ 04/Feb/14 ]

This is critical bug for us. Alexey if you have a patch can you push it so I can be tested and inspected.

Comment by Patrick Farrell (Inactive) [ 04/Feb/14 ]

James, Jinshan,

Cray's been working with Xyratex on this. Unfortunately, they haven't been able to come up with a patch for this issue yet. If someone from Intel is able to look at it, we can ask them to speak to you.

Comment by Alexey Lyashkov [ 04/Feb/14 ]

Ann think simple fix should be enough as workaround, until we will have complex fix. I will push a patch in next two days.

Comment by Shuichi Ihara (Inactive) [ 04/Feb/14 ]

James, agreed with you. This is critical and we also hit rancoramly client crash due to this issue. As a quick workaround, "echo 0 > /proc/fs/lustre/ldlm/namespaces/*/early_lock_cancel" on clients helped to avoid this issue in our case. Hope this helps in your case too.

Comment by Bruno Faccini (Inactive) [ 06/Feb/14 ]

Yes, it would be interesting to know if disabling ELC, the work-around we found for LU-4552, also works there. This is highly possible, since Lustre version, stack-traces, time-out situation are very similar, so, since the very beginning, reading both tickets I had quickly the strong assumption that they could be duplicates.

Comment by Alexey Lyashkov [ 06/Feb/14 ]

please try attached fix and say how it help.

Comment by James A Simmons [ 06/Feb/14 ]

Pushed the patch to gerrit so it can run through maloo.

http://review.whamcloud.com/#/c/9156

Comment by James A Simmons [ 06/Feb/14 ]

We also disabled ELC and our problems did stop.

Comment by Jinshan Xiong (Inactive) [ 07/Feb/14 ]

That patch is a good workaround but it will affect performance. I'm going to work out another solution to not do early cancel for glimpse lock ENQ.

Comment by Jinshan Xiong (Inactive) [ 07/Feb/14 ]

I pushed a new patch at: http://review.whamcloud.com/9175

I don't run any tests so please run sanity test and then see if it can fix the problem as well.

I will run test tomorrow.

Comment by James A Simmons [ 07/Feb/14 ]

I'm seeing jobs failing to run with patch 9156.

Comment by Jinshan Xiong (Inactive) [ 07/Feb/14 ]

failed due to exact the same reason, or a new one?

Comment by James A Simmons [ 17/Feb/14 ]

Looks like new one. I never saw failures of job before the xyratex patch at small scale. Only large scale did we see problems. Either way I will be try the intel patch next

Comment by Jodi Levi (Inactive) [ 07/Mar/14 ]

Patch has landed to Master.

Comment by James Nunez (Inactive) [ 24/Apr/14 ]

b2_5 patch at http://review.whamcloud.com/#/c/9451/

Comment by Bruno Faccini (Inactive) [ 24/Apr/14 ]

James and all, I wonder why the patch-set #2/#3 of the b2_5 patch version mentioned (http://review.whamcloud.com/#/c/9451/) only affect osc/osc_request.c file vs much more in its patch-set #1 like in the master version (http://review.whamcloud.com/9175) ?

Comment by James A Simmons [ 29/Apr/14 ]

The reason is Jinshan pointed the other part of the fix in master is related to the clio cleanup work. For b2_5 we only need the simpler fix.

Comment by Alexey Lyashkov [ 23/Oct/14 ]

For the record - Jay changes in osc_cancel_for_recovery() re-introduced deadlock during recovery.

Generated at Sat Feb 10 01:41:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.