[LU-2654] ll_imp_inval cpu lockup Created: 19/Jan/13 Updated: 03/Feb/13 Resolved: 03/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB | ||
| Severity: | 3 |
| Rank (Obsolete): | 6196 |
| Description |
|
Just had a cpu lockup in import invalidation thread. PID: 31247 TASK: ffff880080938300 CPU: 1 COMMAND: "ll_imp_inval" #0 [ffff8800456edcf0] schedule at ffffffff814f7c98 #1 [ffff8800456edcf8] cfs_hash_for_each_relax at ffffffffa0db07af [libcfs] #2 [ffff8800456edd88] cfs_hash_for_each_nolock at ffffffffa0db217f [libcfs] #3 [ffff8800456eddb8] ldlm_namespace_cleanup at ffffffffa10a8870 [ptlrpc] #4 [ffff8800456edde8] osc_import_event at ffffffffa046ce96 [osc] #5 [ffff8800456ede78] ptlrpc_invalidate_import at ffffffffa110b93f [ptlrpc] #6 [ffff8800456edf28] ptlrpc_invalidate_import_thread at ffffffffa110bfaf [ptlrpc] #7 [ffff8800456edf48] kernel_thread at ffffffff8100c14a Crashdump in /exports/crasdumps/t/imp-lockup.dmp modules in /exports/crashdumps/192.168.10.210-2013-01-18-21:37:33/modules |
| Comments |
| Comment by Alexey Lyashkov [ 19/Jan/13 ] |
|
that is may be not a lookup, but tried to canceling a too much locks... |
| Comment by Oleg Drokin [ 21/Jan/13 ] |
|
It lasts for over 12 hours in this state, so I think it's a lockup, I don't think I have this much locks. |
| Comment by Oleg Drokin [ 01/Feb/13 ] |
|
Ok, this bug really frustrates my testing, so I dug into it fformore data. There's certainly some sort of a corrupted list happening. Feb 1 18:52:12 centos6-12 kernel: [80511.281770] LustreError: 4466:0:(ldlm_reso urce.c:1408:ldlm_resource_dump()) ### ### ns: ?? lock: ffff88008f76fdb0/0xffff88 008f76fdb0 lrc: 0/0,0 mode: --/PW res: ?? rrc=?? type: ??? flags: 0x96f400000000 nid: local remote: 0xc6d634d7459f6287 expref: -99 pid: 4450 timeout: 0 lvb_type : 1 Feb 1 18:52:12 centos6-12 kernel: [80511.282764] LustreError: 4466:0:(ldlm_resource.c:1408:ldlm_resource_dump()) Skipped 9092257 previous similar messages So digging a bit more I found that we are stuck within this piece of code in ldlm_resource_dump: cfs_list_for_each_entry_reverse(lock, &res->lr_granted,
l_res_link) {
LDLM_DEBUG_LIMIT(level, lock, "###");
if (!(level & D_CANTMASK) &&
++granted > ldlm_dump_granted_max) {
CDEBUG(level, "only dump %d granted locks to "
"avoid DDOS.\n", granted);
break;
}
}
Sure enough, printing current lock's res_link, we see: (gdb) p lock->l_res_link
$53 = {next = 0xffff88008f76fe38, prev = 0xffff88008f76fe38}
(gdb) p lock
$54 = (struct ldlm_lock *) 0xffff88008f76fdb0
(gdb) p &lock->l_res_link
$55 = (cfs_list_t *) 0xffff88008f76fe38
So as we were printing this lock content, it was removed. So step one is to convert to safe list traversing to avoid that. |
| Comment by Oleg Drokin [ 01/Feb/13 ] |
|
Patch is in http://review.whamcloud.com/5254 |
| Comment by Oleg Drokin [ 01/Feb/13 ] |
|
On a second thought, I guess the proper way here is to ensure resource is locked when iterating, as just using safe iterator is racy. |
| Comment by Oleg Drokin [ 03/Feb/13 ] |
|
Patch landed |