Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.1.1
-
None
-
Server: rhel6.2, lustre-2.1.1, ofed-1.5.3.1
Client: sles11sp1, lustre-2.1.1, ofed-1.5.3.1
Git repo at https://github.com/jlan/lustre-nas/commits/nas-2.1.1
-
3
-
4514
Description
Our lustre server crashed multiple times a day. This is one of the failures:
<3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f8a9a000 x1404476495656183/t0(0) o-1->da67355c-78b9-3337-cb94-359b564bc4aa@NET_0x500000a972885_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654050 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
<3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f3996000 x1404476930188631/t0(0) o-1->736da151-8a99-44ed-0646-bb0e3daa974e@NET_0x500000a970f63_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654056 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
<3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) Skipped 147 previous similar messages
<4>Lustre: 7606:0:(ldlm_lib.c:1562:target_recovery_overseer()) recovery is aborted, evict exports in recovery
<0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed
<0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) LBUG
<4>Pid: 7606, comm: tgt_recov
<4>
<4>Call Trace:
<4> [<ffffffffa0578855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0578e95>] lbug_with_loc+0x75/0xe0 [libcfs]
<4> [<ffffffffa0583da6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
<4> [<ffffffffa0732d53>] target_recovery_thread+0xed3/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c14a>] child_rip+0xa/0x20
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 7606, comm: tgt_recov Not tainted 2.6.32-220.4.1.el6.20120130.x86_64.lustre211 #1
<4>Call Trace:
<4> [<ffffffff81520c76>] ? panic+0x78/0x164
<4> [<ffffffffa0578eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
<4> [<ffffffffa0583da6>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
<4> [<ffffffffa0732d53>] ? target_recovery_thread+0xed3/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c14a>] ? child_rip+0xa/0x20
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
[22]kdb>
Here is the line that LBUG'ed
LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0);
in *target_next_replay_req():
static struct ptlrpc_request *target_next_replay_req(struct obd_device *obd)
{
struct ptlrpc_request *req = NULL;
ENTRY;
CDEBUG(D_HA, "Waiting for transno "LPD64"\n",
obd->obd_next_recovery_transno);
if (target_recovery_overseer(obd, check_for_next_transno,
exp_req_replay_healthy))
cfs_spin_lock(&obd->obd_recovery_task_lock);
if (!cfs_list_empty(&obd->obd_req_replay_queue))
else
{ cfs_spin_unlock(&obd->obd_recovery_task_lock); LASSERT(cfs_list_empty(&obd->obd_req_replay_queue)); LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0); <======= /** evict exports failed VBR */ class_disconnect_stale_exports(obd, exp_vbr_healthy); } RETURN(req);
}
Attachments
Issue Links
- is related to
-
LU-1166 recovery never finished
- Resolved
- Trackbacks
-
Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....