[LU-4197] deadlock in recovery Created: 01/Nov/13 Updated: 15/Sep/15 Resolved: 15/Sep/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Alexey Lyashkov | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | patch | ||
| Environment: |
RHEL6 |
||
| Severity: | 3 |
| Rank (Obsolete): | 11384 |
| Description |
|
Bug originally hit during xyratex testing. BUG: spinlock lockup on CPU#3, tgt_recov/8159, ffff880099c1ca90 (Tainted: G W ---------------- ) Pid: 8159, comm: tgt_recov Tainted: G W ---------------- 2.6.32-131.17.1-lustre #0 Call Trace: [<ffffffff8128c2da>] ? _raw_spin_lock+0x16a/0x180 [<ffffffff81500ff6>] ? _spin_lock+0x56/0x70 [<ffffffffa056668a>] ? class_export_recovery_cleanup+0x3a/0x230 [obdclass] [<ffffffffa03ea572>] ? cfs_hash_del+0xa2/0x1d0 [libcfs] [<ffffffffa056668a>] ? class_export_recovery_cleanup+0x3a/0x230 [obdclass] [<ffffffffa056875d>] ? class_disconnect+0x15d/0x3d0 [obdclass] [<ffffffffa06bfd17>] ? server_disconnect_export+0x37/0x1a0 [ptlrpc] [<ffffffffa0c9630f>] ? filter_disconnect+0xbf/0x380 [obdfilter] [<ffffffffa056db97>] ? class_disconnect_export_list+0x347/0x680 [obdclass] [<ffffffffa056e027>] ? class_disconnect_stale_exports+0x157/0x380 [obdclass] [<ffffffffa06bc180>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] [<ffffffffa06bc490>] ? check_for_clients+0x0/0x80 [ptlrpc] [<ffffffffa06bf04b>] ? target_recovery_overseer+0x15b/0x2d0 [ptlrpc] [<ffffffffa06bc180>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] [<ffffffff81091a80>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa06c4b90>] ? target_recovery_thread+0x460/0x15d0 [ptlrpc] [<ffffffff810563bd>] ? finish_task_switch+0x7d/0x110 [<ffffffffa06c4730>] ? target_recovery_thread+0x0/0x15d0 [ptlrpc] [<ffffffff8100c2ca>] ? child_rip+0xa/0x20 [<ffffffff81500d50>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff8100bc10>] ? restore_args+0x0/0x30 [<ffffffffa06c4730>] ? target_recovery_thread+0x0/0x15d0 [ptlrpc] [<ffffffff8100c2c0>] ? child_rip+0x0/0x20 discovering an bug found an commit with backporting an cfs_spin_lock(&target->obd_recovery_task_lock);
if (target->obd_recovering && !export->exp_in_recovery &&
!export->exp_disconnected) {
cfs_spin_lock(&export->exp_lock);
/* possible race with class_disconnect_stale_exports,
* export may be already in the eviction process */
if (export->exp_failed) {
cfs_spin_unlock(&export->exp_lock);
GOTO(out, rc = -ENODEV);
}
so if we have race with disconnect stale export we will exit from obd_recovery_task_lock held, that kill recovery and node at all. |
| Comments |
| Comment by Sergey Cheremencev [ 05/Nov/13 ] |
|
Patch to solve this issue: http://review.whamcloud.com/#/c/8178/ |
| Comment by Oleg Drokin [ 03/Sep/15 ] |
|
Is this patch needed anywhere else but 2.1.x codebase? 2.1.x is long unused, so if the issue is unique to 2.1.6, let's just close this ticket as WONTFIX |
| Comment by Sergey Cheremencev [ 04/Sep/15 ] |
|
It seems issue is unique for 2.1. I think we can close it. |
| Comment by Andreas Dilger [ 15/Sep/15 ] |
|
No longer being seen beyond 2.1. |