[LU-5781] endless loop in osc_lock_weight() Created: 21/Oct/14 Updated: 01/Jul/16 Resolved: 20/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Zarochentsev | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | p4b, patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 16220 | ||||||||
| Description |
|
In our testing a lustre client hangs forever after OST failover. The client haв 64 RAM and was running a parallel file test app over set of 1G files. Right before a complete hang, ptlrpcd_rcv thread eat 100% cpu. A crashdump was taken. It shows the following stack trace for ptlrpcd_rcv: crash> foreach ptlrpcd_rcv bt
PID: 17113 TASK: ffff8806323f8ae0 CPU: 22 COMMAND: "ptlrpcd_rcv"
#0 [ffff880655547e90] crash_nmi_callback at ffffffff8102fee6
#1 [ffff880655547ea0] notifier_call_chain at ffffffff8152a965
#2 [ffff880655547ee0] atomic_notifier_call_chain at ffffffff8152a9ca
#3 [ffff880655547ef0] notify_die at ffffffff810a12be
#4 [ffff880655547f20] do_nmi at ffffffff8152862b
#5 [ffff880655547f50] nmi at ffffffff81527ef0
[exception RIP: cl_page_put+426]
RIP: ffffffffa05f428a RSP: ffff8805d85d9990 RFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff88047c921800 RCX: ffff8805d85d9ab8
RDX: 0000000000000000 RSI: ffff88047c921800 RDI: ffff880bc515ee60
RBP: ffff8805d85d99d0 R8: 0000000000000040 R9: ffff8805d85d99a0
R10: 0000000000009500 R11: ffff880303aac8f0 R12: ffff880bc515ee60
R13: 000000000000004a R14: ffff880bc51d2e90 R15: ffff880bc515ee60
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff8805d85d9990] cl_page_put at ffffffffa05f428a [obdclass]
#7 [ffff8805d85d99d8] osc_page_gang_lookup at ffffffffa0ce3fac [osc]
#8 [ffff8805d85d9a78] osc_ldlm_weigh_ast at ffffffffa0cdb1b7 [osc]
#9 [ffff8805d85d9af8] osc_cancel_weight at ffffffffa0cc1a3b [osc]
#10 [ffff8805d85d9b08] ldlm_cancel_no_wait_policy at ffffffffa07ecec1 [ptlrpc]
#11 [ffff8805d85d9b28] ldlm_prepare_lru_list at ffffffffa07f0f0b [ptlrpc]
#12 [ffff8805d85d9ba8] ldlm_cancel_lru_local at ffffffffa07f1324 [ptlrpc]
#13 [ffff8805d85d9bc8] ldlm_replay_locks at ffffffffa07f14ac [ptlrpc]
#14 [ffff8805d85d9c48] ptlrpc_import_recovery_state_machine at ffffffffa083cb67 [ptlrpc]
#15 [ffff8805d85d9ca8] ptlrpc_replay_interpret at ffffffffa0811a84 [ptlrpc]
#16 [ffff8805d85d9cd8] ptlrpc_check_set at ffffffffa08130ec [ptlrpc]
#17 [ffff8805d85d9d78] ptlrpcd_check at ffffffffa08404db [ptlrpc]
#18 [ffff8805d85d9dd8] ptlrpcd at ffffffffa0840b2b [ptlrpc]
#19 [ffff8805d85d9ee8] kthread at ffffffff8109ac66
#20 [ffff8805d85d9f48] kernel_thread at ffffffff8100c20a
crash>
the hang is caused by an endless loop in osc_lock_weight(), it is proven by setting ldlm.cancel_unused_locks_before_replay=0 at client before OST failover. It cures the hang. Moreover, I have a patch that makes osc_lock_weight() to restart with the last processed page index + 1 so endless loop is impossible. the patch helps too. |
| Comments |
| Comment by Alexander Zarochentsev [ 21/Oct/14 ] |
| Comment by Andreas Dilger [ 21/Oct/14 ] |
|
It looks like this code was introduced in http://review.whamcloud.com/7894 (commit 154fb1f7ce22). |
| Comment by Alexey Lyashkov [ 22/Oct/14 ] |
|
It's not a single problem in that area. commit 30d65b1f34d77742298e92247f71e7f38c292676
Author: Jinshan Xiong <jinshan.xiong@intel.com>
Date: Thu Feb 6 23:03:25 2014 -0800
LU-4300 ldlm: ELC picks locks in a safer policy
osc able to try cancel a WRITE lock during replay, but we a free to dirty any page under lock between check and cancel a lock as check release a per resource lock. it introduce a deadlock from bz=16774 when we tries a flush a lock with dirty data but it's deadlocked as rpc can't send as IMPORT state isn't FULL. as additional notice - it patch don't solve original problem from additional bug in that area also introduced by Jay in commit. commit 9fe0aef3c7a87071d69d082bd24824bdcdf3f18c
Author: Jinshan Xiong <jinshan.xiong@intel.com>
Date: Tue Nov 5 20:42:40 2013 -0800
LU-3321 osc: add weight function for DLM lock
changing simple atomic to gang lockup introduce a large performance problems if we have a read lock and many pages on client memory. These problems addressed to the ELC canceling also. |
| Comment by Jinshan Xiong (Inactive) [ 22/Oct/14 ] |
|
indeed, there is race window that this lock can be picked up for IO. We can solve the problem by marking the lock as CBPENDING once osc_cancel_weight() decides to cancel a lock. This problem has been existing for long time even before my patch was landed. Atomic is simple but that way was wrong. This problem can be solved by tagging radix tree by active IO, or we can extend osc write extents to IO extent based on the fact that READ and WRITE extents can never overlap. I'm working with vitaly for a solution. |
| Comment by Jinshan Xiong (Inactive) [ 22/Oct/14 ] |
|
Another good way to avoid the race is to check l_last_used and make sure it matches the value before calling pf(). |
| Comment by Alexey Lyashkov [ 23/Oct/14 ] |
|
Absolutely, That race don't exist before you patch, because we don't try to check a write locks. -static int osc_cancel_for_recovery(struct ldlm_lock *lock) +static int osc_cancel_weight(struct ldlm_lock *lock) { /* - * Cancel all unused extent lock in granted mode LCK_PR or LCK_CR. - * - * XXX as a future improvement, we can also cancel unused write lock - * if it doesn't have dirty data and active mmaps. + * Cancel all unused and granted extent lock. */ if (lock->l_resource->lr_type == LDLM_EXTENT && - (lock->l_granted_mode == LCK_PR || lock->l_granted_mode == LCK_CR)&& + lock->l_granted_mode == lock->l_req_mode && osc_ldlm_weigh_ast(lock) == 0) RETURN(1); so before that change we don't need know about any dirty pages and skip PW locks from cancel during recovery. same about performance problem, i don't think it's good idea to scan all pages in memory on each "aged" policy step. @@ -1544,12 +1548,17 @@ static ldlm_policy_res_t ldlm_cancel_aged_policy(struct ldlm_namespace *ns, int unused, int added, int count) { - /* Stop LRU processing if young lock is found and we reach past count */ - return ((added >= count) && - cfs_time_before(cfs_time_current(), - cfs_time_add(lock->l_last_used, - ns->ns_max_age))) ? - LDLM_POLICY_KEEP_LOCK : LDLM_POLICY_CANCEL_LOCK; + if (added >= count) + return LDLM_POLICY_KEEP_LOCK; + + if (cfs_time_before(cfs_time_current(), + cfs_time_add(lock->l_last_used, ns->ns_max_age))) + return LDLM_POLICY_KEEP_LOCK; + + if (ns->ns_cancel != NULL && ns->ns_cancel(lock) == 0) + return LDLM_POLICY_KEEP_LOCK; + + return LDLM_POLICY_CANCEL_LOCK; ns->cancel() now scan pages in memory as it's loop over aged locks. I don't think client have enough CPU power to do it time to time. Simple example - large lock with some pages at end and large IO queue. Other example - large file for read and only few pages changed, If that less than dirty pages limit in system - pdflush don't start a flush pages to the disk and as it's less dirty pages limit in OSC it's also don't forced to flush, but you will scan it on each lru loop. |
| Comment by Jinshan Xiong (Inactive) [ 23/Oct/14 ] |
Indeed, you're right about this. |
| Comment by Jinshan Xiong (Inactive) [ 06/Nov/14 ] |
|
I just pushed the patch http://review.whamcloud.com/12603 to solve the race condition of LRU policy check. |
| Comment by Vitaly Fertman [ 06/Nov/14 ] |
|
I think the proper solution for the ticket would be the following: 1. I was told that CLIO simplification is going to land within several weeks. and that this new code changes RA so that it takes lock out of lru. as the result - no need to check read-only pages under the lock - lock will not be in LRU until RA completion. 2. check for write pages should be replaced by sob osc_extent rbtree search. 3. once done - the check will be fast and policy call could be paced under spin lock. I have patches for 2,3 but they depend on 1 - when it will get landed I can submit them. 4. the original patch from Zam will not be needed for lock_weight, however we may run |
| Comment by Jinshan Xiong (Inactive) [ 07/Nov/14 ] |
This is patch http://review.whamcloud.com/10859
We may still need patch 12603 and 12362 for 2.7 - these two patches are short and straightforward, therefore it's easier to land. You can do your work based on patches 10859, 10362, and 12603. Does this make sense? |
| Comment by Antoine Percher [ 07/Nov/14 ] |
|
Do you have a plan to add these patchs on lustre 2.5.4 ? |
| Comment by Vitaly Fertman [ 07/Nov/14 ] |
yes, I can do it over 10859. but in this case 10362 will be completely removed and only (4) will be needed, what does not exist in that patch currently. 12603 is slightly separate question, but that patch is still racy and I would prefer to put pf under the spinlock call - the lock state check and the removal from LRU should be done atomically to avoid the race, but pf must be fast for that. |
| Comment by Sebastien Buisson (Inactive) [ 20/Nov/14 ] |
|
Hi, The solution for b2_5 is unclear. Should we backport patch 12603, or backport patch 10859? TIA, |
| Comment by Gerrit Updater [ 21/Nov/14 ] |
|
Vitaly Fertman (vitaly_fertman@xyratex.com) uploaded a new patch: http://review.whamcloud.com/12819 |
| Comment by Gerrit Updater [ 21/Nov/14 ] |
|
Vitaly Fertman (vitaly_fertman@xyratex.com) uploaded a new patch: http://review.whamcloud.com/12820 |
| Comment by Vitaly Fertman [ 21/Nov/14 ] |
I am not sure that the back porting of 10859 would be a good solution as this is a new "feature". thus, we decided just to revert " If there are no objections, I can submit patches for 2.5 as well. probably after the first round of inspections. |
| Comment by Peter Jones [ 22/Nov/14 ] |
|
Sebastien It is probably less confusing to return your support needs on b2_5 to the original ticket and so not conflate those with the approach needed on master. Vitaly Thanks for the suggestion but b2_5 does not include the Peter |
| Comment by Vitaly Fertman [ 24/Nov/14 ] |
|
Peter, it is better to not mix patches (not joining nor splitting) for landing to other branches but port as they landed originally, and better with the same ticket numbers. otherwise it is confusing later what is landed or missed in which branches as here. thus, lu-3321 landed to b2_5 as 0a6c6fc |
| Comment by Peter Jones [ 24/Nov/14 ] |
|
Vitaly I agree that such practices are confusing and it would be much simpler to have 1:1 tracking for each ticket and each code change. However, I do not think that we can completely eliminate such confusion - sometimes it is not apparent at the time of the initial development that a subset of the work will be necessary on an older branch. I think that then we rely on clear commit messages to explain the situation. While I am inclined to agree with you that it may have been more clear to have used the Peter |
| Comment by Gerrit Updater [ 26/Nov/14 ] |
|
Bobi Jam (bobijam@gmail.com) uploaded a new patch: http://review.whamcloud.com/12859 |
| Comment by Gerrit Updater [ 04/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12859/ |
| Comment by Gerrit Updater [ 04/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12362/ |
| Comment by Gerrit Updater [ 11/Jun/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12603/ |
| Comment by Cory Spitz [ 08/Jul/15 ] |
|
With the landing of http://review.whamcloud.com/12603, it seems that this ticket should be resolved. |
| Comment by Peter Jones [ 08/Jul/15 ] |
|
Cory There are still a couple of patches still tracked under this ticket - http://review.whamcloud.com/#/c/12819/ and http://review.whamcloud.com/#/c/12820/1. Can these two patches be abandoned? Peter |
| Comment by Cory Spitz [ 15/Jul/15 ] |
|
Peter, we think so, but we should ask Vitaly to be sure he agrees. |
| Comment by Vitaly Fertman [ 20/Jul/15 ] |
|
those 2 patches turned out to be not enough, as even in new RA pages could be still under a lock sitting in LRU. another patch would be needed in addition to these two - rb tree for read extents. such a patch would be over 1k LOCs and its benefits over the current fix are not clear. thus, could be closed. thx! |
| Comment by Peter Jones [ 20/Jul/15 ] |
|
Thanks Vitaly. I have closed this ticket and any further work can be tracked under a new ticket |
| Comment by Jay Lan (Inactive) [ 29/Jul/15 ] |
|
Is there a b2_5 port of http://review.whamcloud.com/12603 ? |
| Comment by Gerrit Updater [ 30/Jul/15 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15800 |