[LU-2272] BUG: spinlock cpu recursion on CPU#2, ll_sa_30972/30992 Created: 03/Nov/12 Updated: 11/Dec/15 Resolved: 22/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 5431 | ||||
| Description |
|
Running racer with a debugging kernel on master I hit this condition: [ 8529.226634] BUG: spinlock cpu recursion on CPU#2, ll_sa_30972/30992 (Not tainted) [ 8529.227125] lock: ffff88009cb38eb8, .magic: dead4ead, .owner: ptlrpcd_2/7457, .owner_cpu: 2 [ 8529.227595] Pid: 30992, comm: ll_sa_30972 Not tainted 2.6.32-debug #6 [ 8529.227874] Call Trace: [ 8529.228084] [<ffffffff8128098a>] ? spin_bug+0xaa/0x100 [ 8529.228342] [<ffffffff81280ba1>] ? _raw_spin_lock+0x121/0x180 [ 8529.228610] [<ffffffff814fafde>] ? _spin_lock+0xe/0x10 [ 8529.228892] [<ffffffffa0d3192c>] ? do_statahead_interpret+0x4c/0xdd0 [lustre] [ 8529.229426] [<ffffffff8109011c>] ? remove_wait_queue+0x3c/0x50 [ 8529.229718] [<ffffffffa0d3690a>] ? ll_statahead_thread+0xcda/0xf40 [lustre] [ 8529.230008] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 [ 8529.230293] [<ffffffffa0d35c30>] ? ll_statahead_thread+0x0/0xf40 [lustre] [ 8529.230577] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [ 8529.230844] [<ffffffffa0d35c30>] ? ll_statahead_thread+0x0/0xf40 [lustre] [ 8529.231142] [<ffffffffa0d35c30>] ? ll_statahead_thread+0x0/0xf40 [lustre] [ 8529.231429] [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Oleg Drokin [ 03/Nov/12 ] |
|
Seems to be related to statahead, so FanYong, can you please take a look? |
| Comment by nasf (Inactive) [ 04/Nov/12 ] |
|
OK, try to reproduce by myself and dump the log. |
| Comment by Oleg Drokin [ 11/Dec/12 ] |
|
I just had this hit again in my racer testing on veryfresh master: Dec 11 15:15:42 centos6-8 kernel: [226594.953228] BUG: spinlock cpu recursion on CPU#0, ll_sa_13669/13959 (Not tainted) Dec 11 15:15:42 centos6-8 kernel: [226594.953702] lock: ffff880018c19eb8, .magic: dead4ead, .owner: ptlrpcd_2/26580, .owner_cpu: 0 Dec 11 15:15:42 centos6-8 kernel: [226594.954248] Pid: 13959, comm: ll_sa_13669 Not tainted 2.6.32-debug #6 Dec 11 15:15:42 centos6-8 kernel: [226594.954815] Call Trace: Dec 11 15:15:42 centos6-8 kernel: [226594.955034] [<ffffffff8128098a>] ? spin_bug+0xaa/0x100 Dec 11 15:15:42 centos6-8 kernel: [226594.955285] [<ffffffff81280ba1>] ? _raw_spin_lock+0x121/0x180 Dec 11 15:15:42 centos6-8 kernel: [226594.955548] [<ffffffff814fafde>] ? _spin_lock+0xe/0x10 Dec 11 15:15:42 centos6-8 kernel: [226594.955851] [<ffffffffa0e368dc>] ? do_statahead_interpret+0x4c/0xdd0 [lustre] Dec 11 15:15:42 centos6-8 kernel: [226594.957300] [<ffffffff8109011c>] ? remove_wait_queue+0x3c/0x50 Dec 11 15:15:42 centos6-8 kernel: [226594.957612] [<ffffffffa0e3b90a>] ? ll_statahead_thread+0xcda/0xf40 [lustre] Dec 11 15:15:42 centos6-8 kernel: [226594.957977] [<ffffffff81057d60>] ? default_wake_function+0x0/0x20 Dec 11 15:15:42 centos6-8 kernel: [226594.958317] [<ffffffffa0e3ac30>] ? ll_statahead_thread+0x0/0xf40 [lustre] Dec 11 15:15:42 centos6-8 kernel: [226594.959502] [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
| Comment by Oleg Drokin [ 21/May/13 ] |
|
Ok, I finally got to the root of it. crash> bt 21485 PID: 21485 TASK: ffff8800b2906540 CPU: 7 COMMAND: "ptlrpcd_4" #0 [ffff8800b098f930] schedule at ffffffff814fae3a #1 [ffff8800b098f9f8] __cond_resched at ffffffff810644ea #2 [ffff8800b098fa18] _cond_resched at ffffffff814fb840 #3 [ffff8800b098fa28] __kmalloc at ffffffff811686d0 #4 [ffff8800b098fa78] cfs_alloc at ffffffffa0b0cb90 [libcfs] #5 [ffff8800b098faa8] ldlm_bl_to_thread at ffffffffa1268cc1 [ptlrpc] #6 [ffff8800b098fbb8] ldlm_bl_to_thread_lock at ffffffffa1269219 [ptlrpc] #7 [ffff8800b098fbc8] ldlm_lock_decref_internal at ffffffffa1246ccd [ptlrpc] #8 [ffff8800b098fc28] ldlm_lock_decref at ffffffffa1247d69 [ptlrpc] #9 [ffff8800b098fc58] ll_intent_drop_lock at ffffffffa07dac8d [lustre] #10 [ffff8800b098fc88] ll_statahead_interpret at ffffffffa0838e66 [lustre] #11 [ffff8800b098fce8] mdc_intent_getattr_async_interpret at ffffffffa0dea4a2 [mdc] #12 [ffff8800b098fd68] ptlrpc_check_set at ffffffffa12804e2 [ptlrpc] #13 [ffff8800b098fe08] ptlrpcd_check at ffffffffa12adc5b [ptlrpc] #14 [ffff8800b098fe68] ptlrpcd at ffffffffa12ae1a3 [ptlrpc] #15 [ffff8800b098ff48] kernel_thread at ffffffff8100c10a While I mostly hit the warnings like in this ticket, right now I got a deadlock, and that's how I finally found this issue. entry->se_minfo = minfo;
entry->se_req = ptlrpc_request_addref(req);
/* Release the async ibits lock ASAP to avoid deadlock
* when statahead thread tries to enqueue lock on parent
* for readpage and other tries to enqueue lock on child
* with parent's lock held, for example: unlink. */
entry->se_handle = it->d.lustre.it_lock_handle;
can sleep!!! => ll_intent_drop_lock(it);
wakeup = sa_received_empty(sai);
cfs_list_add_tail(&entry->se_list,
&sai->sai_entries_received);
}
sai->sai_replied++;
spin_unlock(&lli->lli_sa_lock);
|
| Comment by Peter Jones [ 21/May/13 ] |
|
Lai Are you able to advise on this one? Thanks Peter |
| Comment by Lai Siyao [ 22/May/13 ] |
|
The fix will modify the same code of patch for |
| Comment by Lai Siyao [ 10/Sep/13 ] |
|
The above patch is abandoned, and the fix is included in http://review.whamcloud.com/#/c/6392/. |
| Comment by Peter Jones [ 22/Sep/13 ] |
|
Lai So can we duplicate this ticket into Peter |
| Comment by Lai Siyao [ 22/Sep/13 ] |
|
Yes, Peter. |
| Comment by Peter Jones [ 22/Sep/13 ] |
|
ok thanks! |
| Comment by Gerrit Updater [ 24/Feb/15 ] |
|
Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/13846 |