[LU-2654] ll_imp_inval cpu lockup Created: 19/Jan/13  Updated: 03/Feb/13  Resolved: 03/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: LB

Severity: 3
Rank (Obsolete): 6196

 Description   

Just had a cpu lockup in import invalidation thread.

PID: 31247  TASK: ffff880080938300  CPU: 1   COMMAND: "ll_imp_inval"
 #0 [ffff8800456edcf0] schedule at ffffffff814f7c98
 #1 [ffff8800456edcf8] cfs_hash_for_each_relax at ffffffffa0db07af [libcfs]
 #2 [ffff8800456edd88] cfs_hash_for_each_nolock at ffffffffa0db217f [libcfs]
 #3 [ffff8800456eddb8] ldlm_namespace_cleanup at ffffffffa10a8870 [ptlrpc]
 #4 [ffff8800456edde8] osc_import_event at ffffffffa046ce96 [osc]
 #5 [ffff8800456ede78] ptlrpc_invalidate_import at ffffffffa110b93f [ptlrpc]
 #6 [ffff8800456edf28] ptlrpc_invalidate_import_thread at ffffffffa110bfaf [ptlrpc]
 #7 [ffff8800456edf48] kernel_thread at ffffffff8100c14a

Crashdump in /exports/crasdumps/t/imp-lockup.dmp modules in /exports/crashdumps/192.168.10.210-2013-01-18-21:37:33/modules



 Comments   
Comment by Alexey Lyashkov [ 19/Jan/13 ]

that is may be not a lookup, but tried to canceling a too much locks...

Comment by Oleg Drokin [ 21/Jan/13 ]

It lasts for over 12 hours in this state, so I think it's a lockup, I don't think I have this much locks.

Comment by Oleg Drokin [ 01/Feb/13 ]

Ok, this bug really frustrates my testing, so I dug into it fformore data. There's certainly some sort of a corrupted list happening.
What I get before the lockup looks like this:

Feb  1 18:52:12 centos6-12 kernel: [80511.281770] LustreError: 4466:0:(ldlm_reso
urce.c:1408:ldlm_resource_dump()) ### ### ns: ?? lock: ffff88008f76fdb0/0xffff88
008f76fdb0 lrc: 0/0,0 mode: --/PW res: ?? rrc=?? type: ??? flags: 0x96f400000000
 nid: local remote: 0xc6d634d7459f6287 expref: -99 pid: 4450 timeout: 0 lvb_type
: 1
Feb  1 18:52:12 centos6-12 kernel: [80511.282764] LustreError: 4466:0:(ldlm_resource.c:1408:ldlm_resource_dump()) Skipped 9092257 previous similar messages

So digging a bit more I found that we are stuck within this piece of code in ldlm_resource_dump:

                cfs_list_for_each_entry_reverse(lock, &res->lr_granted,
                                                l_res_link) {
                        LDLM_DEBUG_LIMIT(level, lock, "###");
                        if (!(level & D_CANTMASK) &&
                            ++granted > ldlm_dump_granted_max) {
                                CDEBUG(level, "only dump %d granted locks to "
                                       "avoid DDOS.\n", granted);
                                break;
                        }
                }

Sure enough, printing current lock's res_link, we see:

(gdb) p lock->l_res_link
$53 = {next = 0xffff88008f76fe38, prev = 0xffff88008f76fe38}
(gdb) p lock
$54 = (struct ldlm_lock *) 0xffff88008f76fdb0
(gdb) p &lock->l_res_link
$55 = (cfs_list_t *) 0xffff88008f76fe38

So as we were printing this lock content, it was removed.

So step one is to convert to safe list traversing to avoid that.
Step two for somebody: make sure the lock cannot disappear in the middle of printing.

Comment by Oleg Drokin [ 01/Feb/13 ]

Patch is in http://review.whamcloud.com/5254

Comment by Oleg Drokin [ 01/Feb/13 ]

On a second thought, I guess the proper way here is to ensure resource is locked when iterating, as just using safe iterator is racy.

Comment by Oleg Drokin [ 03/Feb/13 ]

Patch landed

Generated at Sat Feb 10 01:27:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.