[LU-1522] ASSERTION(cfs_atomic_read(&obd->obd_req_replay_clients) == 0) failed Created: 14/Jun/12 Updated: 25/Aug/12 Resolved: 25/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.3 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jay Lan (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Server: rhel6.2, lustre-2.1.1, ofed-1.5.3.1 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 4514 | ||||||||
| Description |
|
Our lustre server crashed multiple times a day. This is one of the failures: <3>LustreError: 7606:0:(ldlm_lib.c:1259:abort_lock_replay_queue()) @@@ aborted: req@ffff8806f8a9a000 x1404476495656183/t0(0) o-1->da67355c-78b9-3337-cb94-359b564bc4aa@NET_0x500000a972885_UUID:0/0 lens 296/0 e 26 to 0 dl 1339654050 ref 1 fl Complete:/ffffffff/ffffffff rc 0/-1 Here is the line that LBUG'ed static struct ptlrpc_request *target_next_replay_req(struct obd_device *obd) CDEBUG(D_HA, "Waiting for transno "LPD64"\n", if (target_recovery_overseer(obd, check_for_next_transno, cfs_spin_lock(&obd->obd_recovery_task_lock); else { cfs_spin_unlock(&obd->obd_recovery_task_lock); LASSERT(cfs_list_empty(&obd->obd_req_replay_queue)); LASSERT(cfs_atomic_read(&obd->obd_req_replay_clients) == 0); <======= /** evict exports failed VBR */ class_disconnect_stale_exports(obd, exp_vbr_healthy); } RETURN(req); |
| Comments |
| Comment by Peter Jones [ 14/Jun/12 ] |
|
Niu Could you please comment on this one? Thanks Peter |
| Comment by Bob Glossman (Inactive) [ 14/Jun/12 ] |
|
Niu, |
| Comment by Jinshan Xiong (Inactive) [ 14/Jun/12 ] |
|
Di talked about this problem several days before, but I don't know if he made any progress. |
| Comment by Di Wang [ 14/Jun/12 ] |
|
Ah, Yes. the problem is indeed brought in by this patch http://review.whamcloud.com/#change,2255 ( |
| Comment by Di Wang [ 14/Jun/12 ] |
|
Here is a workaround fix. commit 427bcf9eff0a931f64c0986c062d2fea7f87f983 Since the exports might be hold for some reason, so do Change-Id: I048b66b9c645fa772c34096791a02b6c210cfc23 diff --git a/lustre/obdclass/genops.c b/lustre/obdclass/genops.c /* release nid stat refererence */
obd_zombie_export_add(exp); class_unlink_export(export); Mike said he will have a new patch. |
| Comment by Jay Lan (Inactive) [ 15/Jun/12 ] |
|
Is this WA safe to pick up? I need to rebuild lustre server for production to deal with a large number of LBUG crashes and freeze on our production systems. |
| Comment by Jay Lan (Inactive) [ 15/Jun/12 ] |
|
Well, change my question a bit. It is unfair to ask you to say "it is safe" without going through sanity testing. I like to know if you believe the fix is supposed to be a right fix and would avoid some LBUG or freeze? |
| Comment by Di Wang [ 15/Jun/12 ] |
|
Well, I actually thought it is a right fix, and stable enough. At least in my local sanity test. Hmm, maybe I should submit it a maloon, and review and test there. |
| Comment by Niu Yawei (Inactive) [ 15/Jun/12 ] |
|
Mike, are you working on a new patch? any comments to Jay's question? Thanks. |
| Comment by Di Wang [ 15/Jun/12 ] |
| Comment by Mikhail Pershin [ 16/Jun/12 ] |
|
This fix brings us back to I am not working on new patch now, but will think about proper fix. |
| Comment by Mikhail Pershin [ 18/Jun/12 ] |
|
|
| Comment by Mikhail Pershin [ 18/Jun/12 ] |
|
caused by |
| Comment by Jay Lan (Inactive) [ 18/Jun/12 ] |
|
Is http://review.whamcloud.com/3122 supposed to be on top of the two fixes committed in |
| Comment by Jay Lan (Inactive) [ 18/Jun/12 ] |
|
It seems to be a replacement of one of the |
| Comment by Jay Lan (Inactive) [ 18/Jun/12 ] |
|
Ah, OK, it was indeed supposed to be applied on top of the two commits in |
| Comment by Mikhail Pershin [ 18/Jun/12 ] |
|
Should be just on top of previous patches |
| Comment by Jay Lan (Inactive) [ 18/Jun/12 ] |
|
The patch was against master branch. Compilation failed on b2_1 branch on incompatible pointer type: /usr/src/redhat/BUILD/lustre-2.1.1/lustre/obdclass/genops.c: In function 'class_export_recovery_cleanup': |
| Comment by Jay Lan (Inactive) [ 18/Jun/12 ] |
|
OK, you simply moved the routine to a different location, so I can do the same to the b2_1 code. |
| Comment by Mikhail Pershin [ 19/Jun/12 ] |
|
Jay, it is not just moved routine, the major part is also exp_failed setting/checking, btw, you can just keep class_export_recovery_cleanup() where it is and keep other code. I can prepare patch for b2_1 a bit later |
| Comment by Niu Yawei (Inactive) [ 19/Jun/12 ] |
|
Reassign to Mike. |
| Comment by Mikhail Pershin [ 19/Jun/12 ] |
|
Jay, check this one: http://review.whamcloud.com/3145 |
| Comment by Bob Glossman (Inactive) [ 19/Jun/12 ] |
|
Mikhail, |
| Comment by Jay Lan (Inactive) [ 19/Jun/12 ] |
|
After applying http://review.whamcloud.com/3122 LustreError: 10878:0:(mdt_handler.c:5529:mdt_iocontrol()) Aborting recovery for device nbp2-MDT0000^M |
| Comment by Jay Lan (Inactive) [ 19/Jun/12 ] |
|
I compared my patch adjusted from review #3122 with #3145, they are essentially identical except my patch also moved class_export_recovery_cleanup() to new location as would do in #3122. |
| Comment by Mikhail Pershin [ 19/Jun/12 ] |
|
Bob, you are right, that lock doesn't exist in master and I missed it for b2_1. I will update patch. |
| Comment by Mikhail Pershin [ 19/Jun/12 ] |
|
Jay, that LBUG doesn't look related, do you see it always? |
| Comment by Jay Lan (Inactive) [ 19/Jun/12 ] |
|
No, I do not remember seeing that. Not on ASSERTION(cfs_list_empty(&top->loh_lru)). |
| Comment by Jay Lan (Inactive) [ 19/Jun/12 ] |
|
We installed 2.1.1-2.1nasS build version to service160. It crashed on booting up. Since it is a production machine, control room put 2.1.1-2nasS version in and booted the service160 (an MDS) back up. The difference between 2nasS and 2.1nasS was that I replaced Di Wang's #3115 with #3122. |
| Comment by Jay Lan (Inactive) [ 23/Jul/12 ] |
|
The patch set 2 of review #3145 was landed to b2_1, but not master. We had a mds crash after applying review #3122, which is essentially the same as patch set 1 of #3145. After the crash, I cherry-picked the So, please comment if I should have both |
| Comment by Peter Jones [ 25/Aug/12 ] |
|
Landed for 2.1.3 and 2.3 |