[LU-4197] deadlock in recovery Created: 01/Nov/13  Updated: 15/Sep/15  Resolved: 15/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Alexey Lyashkov Assignee: Dmitry Eremin (Inactive)
Resolution: Won't Fix Votes: 0
Labels: patch
Environment:

RHEL6


Severity: 3
Rank (Obsolete): 11384

 Description   

Bug originally hit during xyratex testing.

BUG: spinlock lockup on CPU#3, tgt_recov/8159, ffff880099c1ca90 (Tainted: G        W  ----------------  )
Pid: 8159, comm: tgt_recov Tainted: G        W  ----------------   2.6.32-131.17.1-lustre #0
Call Trace:
 [<ffffffff8128c2da>] ? _raw_spin_lock+0x16a/0x180
 [<ffffffff81500ff6>] ? _spin_lock+0x56/0x70
 [<ffffffffa056668a>] ? class_export_recovery_cleanup+0x3a/0x230 [obdclass]
 [<ffffffffa03ea572>] ? cfs_hash_del+0xa2/0x1d0 [libcfs]
 [<ffffffffa056668a>] ? class_export_recovery_cleanup+0x3a/0x230 [obdclass]
 [<ffffffffa056875d>] ? class_disconnect+0x15d/0x3d0 [obdclass]
 [<ffffffffa06bfd17>] ? server_disconnect_export+0x37/0x1a0 [ptlrpc]
 [<ffffffffa0c9630f>] ? filter_disconnect+0xbf/0x380 [obdfilter]
 [<ffffffffa056db97>] ? class_disconnect_export_list+0x347/0x680 [obdclass]
 [<ffffffffa056e027>] ? class_disconnect_stale_exports+0x157/0x380 [obdclass]
 [<ffffffffa06bc180>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
 [<ffffffffa06bc490>] ? check_for_clients+0x0/0x80 [ptlrpc]
 [<ffffffffa06bf04b>] ? target_recovery_overseer+0x15b/0x2d0 [ptlrpc]
 [<ffffffffa06bc180>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
 [<ffffffff81091a80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa06c4b90>] ? target_recovery_thread+0x460/0x15d0 [ptlrpc]
 [<ffffffff810563bd>] ? finish_task_switch+0x7d/0x110
 [<ffffffffa06c4730>] ? target_recovery_thread+0x0/0x15d0 [ptlrpc]
 [<ffffffff8100c2ca>] ? child_rip+0xa/0x20
 [<ffffffff81500d50>] ? _spin_unlock_irq+0x30/0x40
 [<ffffffff8100bc10>] ? restore_args+0x0/0x30
 [<ffffffffa06c4730>] ? target_recovery_thread+0x0/0x15d0 [ptlrpc]
 [<ffffffff8100c2c0>] ? child_rip+0x0/0x20

discovering an bug found an commit with backporting an LU-1522.
quick look say that bug exist in target_handle_connect() function on b2_1 also

        cfs_spin_lock(&target->obd_recovery_task_lock);
        if (target->obd_recovering && !export->exp_in_recovery &&
            !export->exp_disconnected) {
                cfs_spin_lock(&export->exp_lock);
                /* possible race with class_disconnect_stale_exports,
                 * export may be already in the eviction process */
                if (export->exp_failed) {
                        cfs_spin_unlock(&export->exp_lock);
                        GOTO(out, rc = -ENODEV);
                }

so if we have race with disconnect stale export we will exit from obd_recovery_task_lock held, that kill recovery and node at all.



 Comments   
Comment by Sergey Cheremencev [ 05/Nov/13 ]

Patch to solve this issue: http://review.whamcloud.com/#/c/8178/

Comment by Oleg Drokin [ 03/Sep/15 ]

Is this patch needed anywhere else but 2.1.x codebase? 2.1.x is long unused, so if the issue is unique to 2.1.6, let's just close this ticket as WONTFIX

Comment by Sergey Cheremencev [ 04/Sep/15 ]

It seems issue is unique for 2.1. I think we can close it.

Comment by Andreas Dilger [ 15/Sep/15 ]

No longer being seen beyond 2.1.

Generated at Sat Feb 10 01:40:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.