[LU-1522] ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed Created: 14/Jun/12  Updated: 25/Aug/12  Resolved: 25/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

Server: rhel6.2, lustre-2.1.1, ofed-1.5.3.1
Client: sles11sp1, lustre-2.1.1, ofed-1.5.3.1
Git repo at https://github.com/jlan/lustre-nas/commits/nas-2.1.1


Issue Links:
Related
is related to LU-1166 recovery never finished Resolved
Severity: 3
Rank (Obsolete): 4514

 Description   

Our lustre server crashed multiple times a day. This is one of the failures:

<3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f8a9a000 x1404476495656183/t0(0) o-1->da67355c-78b9-3337-cb94-359b564bc4aa@NET_0x500000a972885_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654050 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
<3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f3996000 x1404476930188631/t0(0) o-1->736da151-8a99-44ed-0646-bb0e3daa974e@NET_0x500000a970f63_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654056 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1
<3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) Skipped 147 previous similar messages
<4>Lustre: 7606:0:(ldlm_lib.c:1562:target_recovery_overseer()) recovery is aborted, evict exports in recovery
<0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed
<0>LustreError: 7606:0:(ldlm_lib.c:1612:target_next_replay_req()) LBUG
<4>Pid: 7606, comm: tgt_recov
<4>
<4>Call Trace:
<4> [<ffffffffa0578855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0578e95>] lbug_with_loc+0x75/0xe0 [libcfs]
<4> [<ffffffffa0583da6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
<4> [<ffffffffa0732d53>] target_recovery_thread+0xed3/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c14a>] child_rip+0xa/0x20
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 7606, comm: tgt_recov Not tainted 2.6.32-220.4.1.el6.20120130.x86_64.lustre211 #1
<4>Call Trace:
<4> [<ffffffff81520c76>] ? panic+0x78/0x164
<4> [<ffffffffa0578eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]
<4> [<ffffffffa0583da6>] ? libcfs_assertion_failed+0x66/0x70 [libcfs]
<4> [<ffffffffa0732d53>] ? target_recovery_thread+0xed3/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c14a>] ? child_rip+0xa/0x20
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffffa0731e80>] ? target_recovery_thread+0x0/0xf50 [ptlrpc]
<4> [<ffffffff8100c140>] ? child_rip+0x0/0x20
[22]kdb>

Here is the line that LBUG'ed
LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0);
in *target_next_replay_req():

static struct ptlrpc_request *target_next_replay_req(struct obd_device *obd)
{
struct ptlrpc_request *req = NULL;
ENTRY;

CDEBUG(D_HA, "Waiting for transno "LPD64"\n",
obd->obd_next_recovery_transno);

if (target_recovery_overseer(obd, check_for_next_transno,
exp_req_replay_healthy))

{ abort_req_replay_queue(obd); abort_lock_replay_queue(obd); }

cfs_spin_lock(&obd->obd_recovery_task_lock);
if (!cfs_list_empty(&obd->obd_req_replay_queue))

{ req = cfs_list_entry(obd->obd_req_replay_queue.next, struct ptlrpc_request, rq_list); cfs_list_del_init(&req->rq_list); obd->obd_requests_queued_for_recovery--; cfs_spin_unlock(&obd->obd_recovery_task_lock); }

else

{ cfs_spin_unlock(&obd->obd_recovery_task_lock); LASSERT(cfs_list_empty(&obd->obd_req_replay_queue)); LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0); <======= /** evict exports failed VBR */ class_disconnect_stale_exports(obd, exp_vbr_healthy); }

RETURN(req);
}



 Comments   
Comment by Peter Jones [ 14/Jun/12 ]

Niu

Could you please comment on this one?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 14/Jun/12 ]

Niu,
I think this may be a dup of LU-1166. If I'm correct it may already be fixed by commits 042980026c596ff08c97764bbcf7a1e710fd4f5a and abdd09fe58961fe071612b6884faeca2379ba341 to b2_1. Commits were done after 2.1.1, should be present in 2.1.2

Comment by Jinshan Xiong (Inactive) [ 14/Jun/12 ]

Di talked about this problem several days before, but I don't know if he made any progress.

Comment by Di Wang [ 14/Jun/12 ]

Ah, Yes. the problem is indeed brought in by this patch http://review.whamcloud.com/#change,2255 (LU-1166). The reason is that the obd_req_replay_clients and obd_replay_lock_clients are not being decreased during the recovery abort.

Comment by Di Wang [ 14/Jun/12 ]

Here is a workaround fix.

commit 427bcf9eff0a931f64c0986c062d2fea7f87f983
Author: Wang Di <di.wang@whamcloud.com>
Date: Tue May 29 15:25:30 2012 -0700

LU-1166 ptlrpc: Do replay export cleanup during class_disconnect

Since the exports might be hold for some reason, so do
replay export cleanup during class_disconnect, instead of
final export put.

Change-Id: I048b66b9c645fa772c34096791a02b6c210cfc23
Signed-off-by: Wang Di <di.wang@whamcloud.com>

diff --git a/lustre/obdclass/genops.c b/lustre/obdclass/genops.c
index a759eed..d1944a2 100644
— a/lustre/obdclass/genops.c
+++ b/lustre/obdclass/genops.c
@@ -839,7 +839,7 @@ void class_export_put(struct obd_export *exp)

/* release nid stat refererence */
lprocfs_exp_cleanup(exp);

  • class_export_recovery_cleanup(exp);
    + //class_export_recovery_cleanup(exp);

obd_zombie_export_add(exp);
}
@@ -1200,6 +1200,7 @@ int class_disconnect(struct obd_export *export)
&export->exp_nid_hash);

class_unlink_export(export);
+ class_export_recovery_cleanup(export);
no_disconn:
class_export_put(export);
RETURN(0);

Mike said he will have a new patch.

Comment by Jay Lan (Inactive) [ 15/Jun/12 ]

Is this WA safe to pick up? I need to rebuild lustre server for production to deal with a large number of LBUG crashes and freeze on our production systems.

Comment by Jay Lan (Inactive) [ 15/Jun/12 ]

Well, change my question a bit. It is unfair to ask you to say "it is safe" without going through sanity testing. I like to know if you believe the fix is supposed to be a right fix and would avoid some LBUG or freeze?

Comment by Di Wang [ 15/Jun/12 ]

Well, I actually thought it is a right fix, and stable enough. At least in my local sanity test. Hmm, maybe I should submit it a maloon, and review and test there.

Comment by Niu Yawei (Inactive) [ 15/Jun/12 ]

Mike, are you working on a new patch? any comments to Jay's question? Thanks.

Comment by Di Wang [ 15/Jun/12 ]

http://review.whamcloud.com/#change,3115

Comment by Mikhail Pershin [ 16/Jun/12 ]

This fix brings us back to LU-1166, the proper fix is just the reverting of LU-1166. Otherwise we will see both 1166 and 1522 are "fixed" but actually 1166 will be just returned back.

I am not working on new patch now, but will think about proper fix.

Comment by Mikhail Pershin [ 18/Jun/12 ]

LU-1166 fix which should be safe from LU-1522:
http://review.whamcloud.com/3122

Comment by Mikhail Pershin [ 18/Jun/12 ]

caused by LU-1166 fix

Comment by Jay Lan (Inactive) [ 18/Jun/12 ]

Is http://review.whamcloud.com/3122 supposed to be on top of the two fixes committed in LU-1166 (to b2_1 branch)?

Comment by Jay Lan (Inactive) [ 18/Jun/12 ]

It seems to be a replacement of one of the LU-1166 commit 0429800?

Comment by Jay Lan (Inactive) [ 18/Jun/12 ]

Ah, OK, it was indeed supposed to be applied on top of the two commits in LU-1166! I resolved the conflicts.

Comment by Mikhail Pershin [ 18/Jun/12 ]

Should be just on top of previous patches

Comment by Jay Lan (Inactive) [ 18/Jun/12 ]

The patch was against master branch. Compilation failed on b2_1 branch on incompatible pointer type:

/usr/src/redhat/BUILD/lustre-2.1.1/lustre/obdclass/genops.c: In function 'class_export_recovery_cleanup':
/usr/src/redhat/BUILD/lustre-2.1.1/lustre/obdclass/genops.c:1092: error: passing argument 1 of 'atomic_read' from incompatible pointer type
/usr/src/kernels/2.6.32-220.4.1.el6.20120130.x86_64.lustre211/arch/x86/include/asm/atomic_64.h:21: note: expected 'const struct atomic_t *' but argument is of type 'int *'
/usr/src/redhat/BUILD/lustre-2.1.1/lustre/obdclass/genops.c:1092: error: passing argument 1 of 'atomic_read' from incompatible pointer type
/usr/src/kernels/2.6.32-220.4.1.el6.20120130.x86_64.lustre211/arch/x86/include/asm/atomic_64.h:21: note: expected 'const struct atomic_t *' but argument is of type 'int *'
/usr/src/redhat/BUILD/lustre-2.1.1/lustre/obdclass/genops.c:1093: error: passing argument 1 of 'atomic_dec' from incompatible pointer type
/usr/src/kernels/2.6.32-220.4.1.el6.20120130.x86_64.lustre211/arch/x86/include/asm/atomic_64.h:104: note: expected 'struct atomic_t *' but argument is of type 'int *'

Comment by Jay Lan (Inactive) [ 18/Jun/12 ]

OK, you simply moved the routine to a different location, so I can do the same to the b2_1 code.

Comment by Mikhail Pershin [ 19/Jun/12 ]

Jay, it is not just moved routine, the major part is also exp_failed setting/checking, btw, you can just keep class_export_recovery_cleanup() where it is and keep other code. I can prepare patch for b2_1 a bit later

Comment by Niu Yawei (Inactive) [ 19/Jun/12 ]

Reassign to Mike.

Comment by Mikhail Pershin [ 19/Jun/12 ]

Jay, check this one: http://review.whamcloud.com/3145

Comment by Bob Glossman (Inactive) [ 19/Jun/12 ]

Mikhail,
Maybe I'm wrong but it looks to me like your mod to ldlm_lib.c in http://review.whamcloud.com/3145 now allows an error exit to the routine that leaves &target->obd_recovery_task_lock still locked. Did you mean to do that?

Comment by Jay Lan (Inactive) [ 19/Jun/12 ]

After applying http://review.whamcloud.com/3122
the mds LBUG'ed:

LustreError: 10878:0:(mdt_handler.c:5529:mdt_iocontrol()) Aborting recovery for device nbp2-MDT0000^M
LustreError: 11533:0:(lu_object.c:113:lu_object_put()) ASSERTION(cfs_list_empty(&top->loh_lru)) failed^M
LustreError: 11533:0:(lu_object.c:113:lu_object_put()) LBUG^M
Pid: 11533, comm: mdt_rdpg_07^M
^M
Call Trace:^M
[<ffffffffa056b855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
[<ffffffffa056be95>] lbug_with_loc+0x75/0xe0 [libcfs]^M
[<ffffffffa0576da6>] libcfs_assertion_failed+0x66/0x70 [libcfs]^M
^M

Comment by Jay Lan (Inactive) [ 19/Jun/12 ]

I compared my patch adjusted from review #3122 with #3145, they are essentially identical except my patch also moved class_export_recovery_cleanup() to new location as would do in #3122.

Comment by Mikhail Pershin [ 19/Jun/12 ]

Bob, you are right, that lock doesn't exist in master and I missed it for b2_1. I will update patch.

Comment by Mikhail Pershin [ 19/Jun/12 ]

Jay, that LBUG doesn't look related, do you see it always?

Comment by Jay Lan (Inactive) [ 19/Jun/12 ]

No, I do not remember seeing that. Not on ASSERTION(cfs_list_empty(&top->loh_lru)).

Comment by Jay Lan (Inactive) [ 19/Jun/12 ]

We installed 2.1.1-2.1nasS build version to service160. It crashed on booting up. Since it is a production machine, control room put 2.1.1-2nasS version in and booted the service160 (an MDS) back up.

The difference between 2nasS and 2.1nasS was that I replaced Di Wang's #3115 with #3122.

Comment by Jay Lan (Inactive) [ 23/Jul/12 ]

The patch set 2 of review #3145 was landed to b2_1, but not master.
The patch of LU-1432 was landed to master, but not b2_1.

We had a mds crash after applying review #3122, which is essentially the same as patch set 1 of #3145. After the crash, I cherry-picked the LU-1432 patch to our b2_1 and is running in our production systems without a crash for several weeks now.

So, please comment if I should have both LU-1432 and patch set 2 of #3145? Thanks!

Comment by Peter Jones [ 25/Aug/12 ]

Landed for 2.1.3 and 2.3

Generated at Sat Feb 10 01:17:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.