[LU-6624] LBUG in osc_lru_reclaim Created: 21/May/15  Updated: 05/Jun/15  Resolved: 05/Jun/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Hiroya Nozaki Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

According to the existing code, I can guess that cl_client_cache->ccc_lru needs spin_lock when being referred but the below code looks violating the rule.

osc_lru_reclaim
long osc_lru_reclaim(struct client_obd *cli)
{
        struct cl_env_nest nest;
        struct lu_env *env;
        struct cl_client_cache *cache = cli->cl_cache;
        long rc = 0;
        int max_scans;
        ENTRY;

        LASSERT(cache != NULL);
        LASSERT(!list_empty(&cache->ccc_lru)); <--- HERE

        .....

        spin_lock(&cache->ccc_lru_lock);
                                  <---- The LASSERT should be here, isn't it ?
        cache->ccc_lru_shrinkers++;

        ....

Actually I sometimes see LBUG in osc_lru_reclaim when running multiple WRITEs in the same time. So I'm convinced this LASSERT should be moved to the locked section, or the LASSERT can touch ccc_lru while the other is doing linked list operation on ccc_lru.



 Comments   
Comment by Hiroya Nozaki [ 21/May/15 ]

I'll upload a trivial patch soon.

Comment by Gerrit Updater [ 21/May/15 ]

Hiroya Nozaki (nozaki.hiroya@jp.fujitsu.com) uploaded a new patch: http://review.whamcloud.com/14901
Subject: LU-6624 osc: LBUG in osc_lru_reclaim
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3782cd123a74302c22735db5f9c5cafda625280e

Comment by Jinshan Xiong (Inactive) [ 21/May/15 ]

cache->ccc_lru is the LRU list of all OSCs. Now that osc_lru_reclaim() is called, which means there exists at least one OSC, so this list shouldn't be NULL.

Can you post the backtrace to this ticket when you see it next time?

Comment by Hiroya Nozaki [ 22/May/15 ]

OK, I'll post the backtrace when this case is reproduced.
Btw, IMHO, if some cl_lru_osc()s are under list_move_tail() operation, ccc_lru can be empty temporarily, isn't it ?

Comment by Jinshan Xiong (Inactive) [ 22/May/15 ]

Good point, if there is only one OSC, it could be empty temporarily.

Comment by Gerrit Updater [ 05/Jun/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14901/
Subject: LU-6624 osc: LBUG in osc_lru_reclaim
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1fee634f1ebfeccb1770951ca7b576f8b6e733a0

Comment by Peter Jones [ 05/Jun/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:01:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.