[LU-5781] endless loop in osc_lock_weight() Created: 21/Oct/14  Updated: 01/Jul/16  Resolved: 20/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Alexander Zarochentsev Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: p4b, patch

Issue Links:
Related
is related to LU-5787 ptlrpcd_rcv loop in osc_ldlm_weigh_ast Reopened
Severity: 3
Rank (Obsolete): 16220

 Description   

In our testing a lustre client hangs forever after OST failover. The client haв 64 RAM and was running a parallel file test app over set of 1G files. Right before a complete hang, ptlrpcd_rcv thread eat 100% cpu.

A crashdump was taken. It shows the following stack trace for ptlrpcd_rcv:

crash> foreach ptlrpcd_rcv bt
PID: 17113  TASK: ffff8806323f8ae0  CPU: 22  COMMAND: "ptlrpcd_rcv"
 #0 [ffff880655547e90] crash_nmi_callback at ffffffff8102fee6
 #1 [ffff880655547ea0] notifier_call_chain at ffffffff8152a965
 #2 [ffff880655547ee0] atomic_notifier_call_chain at ffffffff8152a9ca
 #3 [ffff880655547ef0] notify_die at ffffffff810a12be
 #4 [ffff880655547f20] do_nmi at ffffffff8152862b
 #5 [ffff880655547f50] nmi at ffffffff81527ef0
    [exception RIP: cl_page_put+426]
    RIP: ffffffffa05f428a  RSP: ffff8805d85d9990  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff88047c921800  RCX: ffff8805d85d9ab8
    RDX: 0000000000000000  RSI: ffff88047c921800  RDI: ffff880bc515ee60
    RBP: ffff8805d85d99d0   R8: 0000000000000040   R9: ffff8805d85d99a0
    R10: 0000000000009500  R11: ffff880303aac8f0  R12: ffff880bc515ee60
    R13: 000000000000004a  R14: ffff880bc51d2e90  R15: ffff880bc515ee60
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff8805d85d9990] cl_page_put at ffffffffa05f428a [obdclass]
 #7 [ffff8805d85d99d8] osc_page_gang_lookup at ffffffffa0ce3fac [osc]
 #8 [ffff8805d85d9a78] osc_ldlm_weigh_ast at ffffffffa0cdb1b7 [osc]
 #9 [ffff8805d85d9af8] osc_cancel_weight at ffffffffa0cc1a3b [osc]
#10 [ffff8805d85d9b08] ldlm_cancel_no_wait_policy at ffffffffa07ecec1 [ptlrpc]
#11 [ffff8805d85d9b28] ldlm_prepare_lru_list at ffffffffa07f0f0b [ptlrpc]
#12 [ffff8805d85d9ba8] ldlm_cancel_lru_local at ffffffffa07f1324 [ptlrpc]
#13 [ffff8805d85d9bc8] ldlm_replay_locks at ffffffffa07f14ac [ptlrpc]
#14 [ffff8805d85d9c48] ptlrpc_import_recovery_state_machine at ffffffffa083cb67 [ptlrpc]
#15 [ffff8805d85d9ca8] ptlrpc_replay_interpret at ffffffffa0811a84 [ptlrpc]
#16 [ffff8805d85d9cd8] ptlrpc_check_set at ffffffffa08130ec [ptlrpc]
#17 [ffff8805d85d9d78] ptlrpcd_check at ffffffffa08404db [ptlrpc]
#18 [ffff8805d85d9dd8] ptlrpcd at ffffffffa0840b2b [ptlrpc]
#19 [ffff8805d85d9ee8] kthread at ffffffff8109ac66
#20 [ffff8805d85d9f48] kernel_thread at ffffffff8100c20a
crash> 

the hang is caused by an endless loop in osc_lock_weight(),
when CLP_GANG_RESCHED is returned from osc_page_gang_lookup() and the page scan is restarted from the very beginning. Next round CLP_GANG_RESCHED is likely to happen again if number of scanned pages is enough large.

it is proven by setting ldlm.cancel_unused_locks_before_replay=0 at client before OST failover. It cures the hang.

Moreover, I have a patch that makes osc_lock_weight() to restart with the last processed page index + 1 so endless loop is impossible. the patch helps too.



 Comments   
Comment by Alexander Zarochentsev [ 21/Oct/14 ]

patch http://review.whamcloud.com/12362

Comment by Andreas Dilger [ 21/Oct/14 ]

It looks like this code was introduced in http://review.whamcloud.com/7894 (commit 154fb1f7ce22).

Comment by Alexey Lyashkov [ 22/Oct/14 ]

It's not a single problem in that area.
per commit

commit 30d65b1f34d77742298e92247f71e7f38c292676
Author: Jinshan Xiong <jinshan.xiong@intel.com>
Date:   Thu Feb 6 23:03:25 2014 -0800

    LU-4300 ldlm: ELC picks locks in a safer policy

osc able to try cancel a WRITE lock during replay, but we a free to dirty any page under lock between check and cancel a lock as check release a per resource lock. it introduce a deadlock from bz=16774 when we tries a flush a lock with dirty data but it's deadlocked as rpc can't send as IMPORT state isn't FULL.

as additional notice - it patch don't solve original problem from LU-4300 as dirty page may introduced at any time.

additional bug in that area also introduced by Jay in commit.

commit 9fe0aef3c7a87071d69d082bd24824bdcdf3f18c
Author: Jinshan Xiong <jinshan.xiong@intel.com>
Date:   Tue Nov 5 20:42:40 2013 -0800

    LU-3321 osc: add weight function for DLM lock

changing simple atomic to gang lockup introduce a large performance problems if we have a read lock and many pages on client memory. These problems addressed to the ELC canceling also.

Comment by Jinshan Xiong (Inactive) [ 22/Oct/14 ]

indeed, there is race window that this lock can be picked up for IO. We can solve the problem by marking the lock as CBPENDING once osc_cancel_weight() decides to cancel a lock. This problem has been existing for long time even before my patch was landed.

Atomic is simple but that way was wrong. This problem can be solved by tagging radix tree by active IO, or we can extend osc write extents to IO extent based on the fact that READ and WRITE extents can never overlap. I'm working with vitaly for a solution.

Comment by Jinshan Xiong (Inactive) [ 22/Oct/14 ]

Another good way to avoid the race is to check l_last_used and make sure it matches the value before calling pf().

Comment by Alexey Lyashkov [ 23/Oct/14 ]

Absolutely, That race don't exist before you patch, because we don't try to check a write locks.
i may remember you are change

-static int osc_cancel_for_recovery(struct ldlm_lock *lock)
+static int osc_cancel_weight(struct ldlm_lock *lock)
 {
        /*
-        * Cancel all unused extent lock in granted mode LCK_PR or LCK_CR.
-        *
-        * XXX as a future improvement, we can also cancel unused write lock
-        * if it doesn't have dirty data and active mmaps.
+        * Cancel all unused and granted extent lock.
         */
        if (lock->l_resource->lr_type == LDLM_EXTENT &&
-           (lock->l_granted_mode == LCK_PR || lock->l_granted_mode == LCK_CR)&&
+           lock->l_granted_mode == lock->l_req_mode &&
            osc_ldlm_weigh_ast(lock) == 0)
                RETURN(1);

so before that change we don't need know about any dirty pages and skip PW locks from cancel during recovery.
but that change introduce a bug.

same about performance problem, i don't think it's good idea to scan all pages in memory on each "aged" policy step.

@@ -1544,12 +1548,17 @@ static ldlm_policy_res_t ldlm_cancel_aged_policy(struct ldlm_namespace *ns,
                                                 int unused, int added,
                                                 int count)
 {
-       /* Stop LRU processing if young lock is found and we reach past count */
-        return ((added >= count) &&
-                cfs_time_before(cfs_time_current(),
-                                cfs_time_add(lock->l_last_used,
-                                             ns->ns_max_age))) ?
-                LDLM_POLICY_KEEP_LOCK : LDLM_POLICY_CANCEL_LOCK;
+        if (added >= count)
+               return LDLM_POLICY_KEEP_LOCK;
+
+       if (cfs_time_before(cfs_time_current(),
+                           cfs_time_add(lock->l_last_used, ns->ns_max_age)))
+                return LDLM_POLICY_KEEP_LOCK;
+
+        if (ns->ns_cancel != NULL && ns->ns_cancel(lock) == 0)
+                return LDLM_POLICY_KEEP_LOCK;
+
+       return LDLM_POLICY_CANCEL_LOCK;

ns->cancel() now scan pages in memory as it's loop over aged locks. I don't think client have enough CPU power to do it time to time. Simple example - large lock with some pages at end and large IO queue. Other example - large file for read and only few pages changed, If that less than dirty pages limit in system - pdflush don't start a flush pages to the disk and as it's less dirty pages limit in OSC it's also don't forced to flush, but you will scan it on each lru loop.

Comment by Jinshan Xiong (Inactive) [ 23/Oct/14 ]

so before that change we don't need know about any dirty pages and skip PW locks from cancel during recovery.
but that change introduce a bug.

Indeed, you're right about this.

Comment by Jinshan Xiong (Inactive) [ 06/Nov/14 ]

I just pushed the patch http://review.whamcloud.com/12603 to solve the race condition of LRU policy check.

Comment by Vitaly Fertman [ 06/Nov/14 ]

I think the proper solution for the ticket would be the following:

1. I was told that CLIO simplification is going to land within several weeks. and that this new code changes RA so that it takes lock out of lru. as the result - no need to check read-only pages under the lock - lock will not be in LRU until RA completion.

2. check for write pages should be replaced by sob osc_extent rbtree search.

3. once done - the check will be fast and policy call could be paced under spin lock.

I have patches for 2,3 but they depend on 1 - when it will get landed I can submit them.

4. the original patch from Zam will not be needed for lock_weight, however we may run
into the same issue in other gang_lookup calls - thus in check_and_discard_cb() we may have
largely overlapped locks and many pages may be transferred from 1 lock under the other one - so to be applied to other places.

Comment by Jinshan Xiong (Inactive) [ 07/Nov/14 ]

1. I was told that CLIO simplification is going to land within several weeks. and that this new code changes RA so that it takes lock out of lru. as the result - no need to check read-only pages under the lock - lock will not be in LRU until RA completion.

This is patch http://review.whamcloud.com/10859

4. the original patch from Zam will not be needed for lock_weight, however we may run
into the same issue in other gang_lookup calls - thus in check_and_discard_cb() we may have
largely overlapped locks and many pages may be transferred from 1 lock under the other one - so to be applied to other places.

We may still need patch 12603 and 12362 for 2.7 - these two patches are short and straightforward, therefore it's easier to land. You can do your work based on patches 10859, 10362, and 12603. Does this make sense?

Comment by Antoine Percher [ 07/Nov/14 ]

Do you have a plan to add these patchs on lustre 2.5.4 ?

Comment by Vitaly Fertman [ 07/Nov/14 ]

We may still need patch 12603 and 12362 for 2.7 - these two patches are short and straightforward, therefore it's easier to land. You can do your work based on patches 10859, 10362, and 12603. Does this make sense?

yes, I can do it over 10859. but in this case 10362 will be completely removed and only (4) will be needed, what does not exist in that patch currently. 12603 is slightly separate question, but that patch is still racy and I would prefer to put pf under the spinlock call - the lock state check and the removal from LRU should be done atomically to avoid the race, but pf must be fast for that.

Comment by Sebastien Buisson (Inactive) [ 20/Nov/14 ]

Hi,

The solution for b2_5 is unclear. Should we backport patch 12603, or backport patch 10859?

TIA,
Sebastien.

Comment by Gerrit Updater [ 21/Nov/14 ]

Vitaly Fertman (vitaly_fertman@xyratex.com) uploaded a new patch: http://review.whamcloud.com/12819
Subject: LU-5781 osc: check for extents under lock for lru cancel
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aea95ff5ff3cb7727e5e19848b57ee80276e29d6

Comment by Gerrit Updater [ 21/Nov/14 ]

Vitaly Fertman (vitaly_fertman@xyratex.com) uploaded a new patch: http://review.whamcloud.com/12820
Subject: LU-5781 ldlm: call LRU policy and set PENDING flag atomically
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6cc01871276a92e13b905cf0f1a67bdb460848e

Comment by Vitaly Fertman [ 21/Nov/14 ]

The solution for b2_5 is unclear. Should we backport patch 12603, or backport patch 10859?

I am not sure that the back porting of 10859 would be a good solution as this is a new "feature". thus, we decided just to revert "LU-3321 osc: add weight function for DLM lock" for our branches, i.e. get RA page refcounting back there, adding those 2 patches I have submitted on top of that revert. it looks faster and more reliable - enough for old maintenance branches.

If there are no objections, I can submit patches for 2.5 as well. probably after the first round of inspections.

Comment by Peter Jones [ 22/Nov/14 ]

Sebastien

It is probably less confusing to return your support needs on b2_5 to the original ticket and so not conflate those with the approach needed on master.

Vitaly

Thanks for the suggestion but b2_5 does not include the LU-3321 patches so the approach may be different. I agree that it makes sense to wait for some reviews on your latest master patch before back porting it to any maintenance branches.

Peter

Comment by Vitaly Fertman [ 24/Nov/14 ]

Peter, it is better to not mix patches (not joining nor splitting) for landing to other branches but port as they landed originally, and better with the same ticket numbers. otherwise it is confusing later what is landed or missed in which branches as here. thus, lu-3321 landed to b2_5 as 0a6c6fc LU-4786 osc: to not pick busy pages for ELC.

Comment by Peter Jones [ 24/Nov/14 ]

Vitaly

I agree that such practices are confusing and it would be much simpler to have 1:1 tracking for each ticket and each code change. However, I do not think that we can completely eliminate such confusion - sometimes it is not apparent at the time of the initial development that a subset of the work will be necessary on an older branch. I think that then we rely on clear commit messages to explain the situation. While I am inclined to agree with you that it may have been more clear to have used the LU-3321 ticket number to land the change in LU-4786 I suspect that this would have caused a different kind of confusion as people have have misinterpreted this fragment of that work landing as the complete work landing.

Peter

Comment by Gerrit Updater [ 26/Nov/14 ]

Bobi Jam (bobijam@gmail.com) uploaded a new patch: http://review.whamcloud.com/12859
Subject: LU-5781 osc: osc_lock_weight endless loop fix
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 12f4a0229593cab7183e4dbbe4cf39882a8ba5af

Comment by Gerrit Updater [ 04/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12859/
Subject: LU-5781 osc: osc_lock_weight endless loop fix
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: ef1758a0d3cb9ac3abbe0e60ac689cf3b2aa3a6e

Comment by Gerrit Updater [ 04/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12362/
Subject: LU-5781 osc: osc_lock_weight endless loop fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 74dfa6c3f6111750c773e2484b65302026af6a53

Comment by Gerrit Updater [ 11/Jun/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12603/
Subject: LU-5781 ldlm: Solve a race for LRU lock cancel
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8ed3105b261fcc0816b064d6308356f645c9e12b

Comment by Cory Spitz [ 08/Jul/15 ]

With the landing of http://review.whamcloud.com/12603, it seems that this ticket should be resolved.

Comment by Peter Jones [ 08/Jul/15 ]

Cory

There are still a couple of patches still tracked under this ticket - http://review.whamcloud.com/#/c/12819/ and http://review.whamcloud.com/#/c/12820/1. Can these two patches be abandoned?

Peter

Comment by Cory Spitz [ 15/Jul/15 ]

Peter, we think so, but we should ask Vitaly to be sure he agrees.

Comment by Vitaly Fertman [ 20/Jul/15 ]

those 2 patches turned out to be not enough, as even in new RA pages could be still under a lock sitting in LRU. another patch would be needed in addition to these two - rb tree for read extents. such a patch would be over 1k LOCs and its benefits over the current fix are not clear. thus, could be closed. thx!

Comment by Peter Jones [ 20/Jul/15 ]

Thanks Vitaly. I have closed this ticket and any further work can be tracked under a new ticket

Comment by Jay Lan (Inactive) [ 29/Jul/15 ]

Is there a b2_5 port of http://review.whamcloud.com/12603 ?
I have conflict in lustre/ldlm/ldlm_request.c

Comment by Gerrit Updater [ 30/Jul/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15800
Subject: LU-5781 ldlm: Solve a race for LRU lock cancel
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: fa78e0ef004d4569fb5e5bc08ecbe3580b42f1e1

Generated at Sat Feb 10 01:54:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.