[LU-1522] ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3
Affects Version/s: Lustre 2.1.1
Labels:
None
Environment:
Server: rhel6.2, lustre-2.1.1, ofed-1.5.3.1
Client: sles11sp1, lustre-2.1.1, ofed-1.5.3.1
Git repo at https://github.com/jlan/lustre-nas/commits/nas-2.1.1

Severity:
3
Rank (Obsolete):
4514

Description

Our lustre server crashed multiple times a day. This is one of the failures:

<3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f8a9a000 x1404476495656183/t0(0) o-1->da67355c-78b9-3337-cb94-359b564bc4aa@NET_0x500000a972885_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654050 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
<3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f3996000 x1404476930188631/t0(0) o-1->736da151-8a99-44ed-0646-bb0e3daa974e@NET_0x500000a970f63_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654056 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
<3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) Skipped 147 previous similar messages
<4>Lustre: 7606:0:(ldlm_lib.c:1562:target_recovery_overseer()) recovery is aborted, evict exports in recovery
<0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed
<0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) LBUG
<4>Pid: 7606, comm: tgt_recov
<4>
<4>Call Trace:
<4> [<ffffffffa0578855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0578e95>] lbug_with_loc+0x75/0xe0 [libcfs]
<4> [<ffffffffa0583da6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
<4> [<ffffffffa0732d53>] target_recovery_thread+0xed3/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c14a>] child_rip+0xa/0x20
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 7606, comm: tgt_recov Not tainted 2.6.32-220.4.1.el6.20120130.x86_64.lustre211 #1
<4>Call Trace:
<4> [<ffffffff81520c76>] ? panic+0x78/0x164
<4> [<ffffffffa0578eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
<4> [<ffffffffa0583da6>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
<4> [<ffffffffa0732d53>] ? target_recovery_thread+0xed3/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c14a>] ? child_rip+0xa/0x20
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
[22]kdb>

Here is the line that LBUG'ed
LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0);
in *target_next_replay_req():

static struct ptlrpc_request *target_next_replay_req(struct obd_device *obd)
{
struct ptlrpc_request *req = NULL;
ENTRY;

CDEBUG(D_HA, "Waiting for transno "LPD64"\n",
obd->obd_next_recovery_transno);

if (target_recovery_overseer(obd, check_for_next_transno,
exp_req_replay_healthy))

{ abort_req_replay_queue(obd); abort_lock_replay_queue(obd); }

cfs_spin_lock(&obd->obd_recovery_task_lock);
if (!cfs_list_empty(&obd->obd_req_replay_queue))

{ req = cfs_list_entry(obd->obd_req_replay_queue.next, struct ptlrpc_request, rq_list); cfs_list_del_init(&req->rq_list); obd->obd_requests_queued_for_recovery--; cfs_spin_unlock(&obd->obd_recovery_task_lock); }

else

{ cfs_spin_unlock(&obd->obd_recovery_task_lock); LASSERT(cfs_list_empty(&obd->obd_req_replay_queue)); LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0); <======= /** evict exports failed VBR */ class_disconnect_stale_exports(obd, exp_vbr_healthy); }

RETURN(req);
}

Attachments

Issue Links

is related to

LU-1166 recovery never finished

Resolved

Trackbacks

Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....

Activity

[LU-1522] ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed

Peter Jones added a comment - 25/Aug/12 10:32 PM

Landed for 2.1.3 and 2.3

Peter Jones added a comment - 25/Aug/12 10:32 PM Landed for 2.1.3 and 2.3

Jay Lan (Inactive) added a comment - 23/Jul/12 3:43 PM

The patch set 2 of review #3145 was landed to b2_1, but not master.
The patch of ~~LU-1432~~ was landed to master, but not b2_1.

We had a mds crash after applying review #3122, which is essentially the same as patch set 1 of #3145. After the crash, I cherry-picked the ~~LU-1432~~ patch to our b2_1 and is running in our production systems without a crash for several weeks now.

So, please comment if I should have both ~~LU-1432~~ and patch set 2 of #3145? Thanks!

Jay Lan (Inactive) added a comment - 23/Jul/12 3:43 PM The patch set 2 of review #3145 was landed to b2_1, but not master. The patch of LU-1432 was landed to master, but not b2_1. We had a mds crash after applying review #3122, which is essentially the same as patch set 1 of #3145. After the crash, I cherry-picked the LU-1432 patch to our b2_1 and is running in our production systems without a crash for several weeks now. So, please comment if I should have both LU-1432 and patch set 2 of #3145? Thanks!

Jay Lan (Inactive) added a comment - 19/Jun/12 3:45 PM

We installed 2.1.1-2.1nasS build version to service160. It crashed on booting up. Since it is a production machine, control room put 2.1.1-2nasS version in and booted the service160 (an MDS) back up.

The difference between 2nasS and 2.1nasS was that I replaced Di Wang's #3115 with #3122.

Jay Lan (Inactive) added a comment - 19/Jun/12 3:45 PM We installed 2.1.1-2.1nasS build version to service160. It crashed on booting up. Since it is a production machine, control room put 2.1.1-2nasS version in and booted the service160 (an MDS) back up. The difference between 2nasS and 2.1nasS was that I replaced Di Wang's #3115 with #3122.

Jay Lan (Inactive) added a comment - 19/Jun/12 3:36 PM

No, I do not remember seeing that. Not on ASSERTION(cfs_list_empty(&top->loh_lru)).

Jay Lan (Inactive) added a comment - 19/Jun/12 3:36 PM No, I do not remember seeing that. Not on ASSERTION(cfs_list_empty(&top->loh_lru)).

Mikhail Pershin added a comment - 19/Jun/12 3:31 PM

Jay, that LBUG doesn't look related, do you see it always?

Mikhail Pershin added a comment - 19/Jun/12 3:31 PM Jay, that LBUG doesn't look related, do you see it always?

Mikhail Pershin added a comment - 19/Jun/12 3:27 PM

Bob, you are right, that lock doesn't exist in master and I missed it for b2_1. I will update patch.

Mikhail Pershin added a comment - 19/Jun/12 3:27 PM Bob, you are right, that lock doesn't exist in master and I missed it for b2_1. I will update patch.

Jay Lan (Inactive) added a comment - 19/Jun/12 2:58 PM

I compared my patch adjusted from review #3122 with #3145, they are essentially identical except my patch also moved class_export_recovery_cleanup() to new location as would do in #3122.

Jay Lan (Inactive) added a comment - 19/Jun/12 2:58 PM I compared my patch adjusted from review #3122 with #3145, they are essentially identical except my patch also moved class_export_recovery_cleanup() to new location as would do in #3122.

Jay Lan (Inactive) added a comment - 19/Jun/12 2:48 PM

After applying http://review.whamcloud.com/3122
the mds LBUG'ed:

LustreError: 10878:0:(mdt_handler.c:5529:mdt_iocontrol()) Aborting recovery for device nbp2-MDT0000^M
LustreError: 11533:0:(lu_object.c:113:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed^M
LustreError: 11533:0:(lu_object.c:113:lu_object_put()) LBUG^M
Pid: 11533, comm: mdt_rdpg_07^M
^M
Call Trace:^M
[<ffffffffa056b855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
[<ffffffffa056be95>] lbug_with_loc+0x75/0xe0 [libcfs]^M
[<ffffffffa0576da6>] libcfs_assertion_failed+0x66/0x70 [libcfs]^M
^M

Jay Lan (Inactive) added a comment - 19/Jun/12 2:48 PM After applying http://review.whamcloud.com/3122 the mds LBUG'ed: LustreError: 10878:0:(mdt_handler.c:5529:mdt_iocontrol()) Aborting recovery for device nbp2-MDT0000^M LustreError: 11533:0:(lu_object.c:113:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed^M LustreError: 11533:0:(lu_object.c:113:lu_object_put()) LBUG^M Pid: 11533, comm: mdt_rdpg_07^M ^M Call Trace:^M [<ffffffffa056b855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] ^M [<ffffffffa056be95>] lbug_with_loc+0x75/0xe0 [libcfs] ^M [<ffffffffa0576da6>] libcfs_assertion_failed+0x66/0x70 [libcfs] ^M ^M

Bob Glossman (Inactive) added a comment - 19/Jun/12 1:41 PM

Mikhail,
Maybe I'm wrong but it looks to me like your mod to ldlm_lib.c in http://review.whamcloud.com/3145 now allows an error exit to the routine that leaves &target->obd_recovery_task_lock still locked. Did you mean to do that?

Bob Glossman (Inactive) added a comment - 19/Jun/12 1:41 PM Mikhail, Maybe I'm wrong but it looks to me like your mod to ldlm_lib.c in http://review.whamcloud.com/3145 now allows an error exit to the routine that leaves &target->obd_recovery_task_lock still locked. Did you mean to do that?

Mikhail Pershin added a comment - 19/Jun/12 1:19 PM

Jay, check this one: http://review.whamcloud.com/3145

Mikhail Pershin added a comment - 19/Jun/12 1:19 PM Jay, check this one: http://review.whamcloud.com/3145

People

Assignee:: Mikhail Pershin

Reporter:: Jay Lan (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Jun/12 6:05 PM

Updated:: 25/Aug/12 10:32 PM

Resolved:: 25/Aug/12 10:32 PM