[LU-1592] ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1 Created: 02/Jul/12 Updated: 22/Dec/12 Resolved: 31/Aug/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.2 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jay Lan (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.1.2 servers and clients. |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 4468 |
| Description |
|
We had three mds/mgs crashed within an hour. Unfortunately, we were not able to get a vmcore. Kdump crashed. Service110: We were not able to product a vmcore. Kdump crashed. Service150 (and also service170): |
| Comments |
| Comment by Jay Lan (Inactive) [ 02/Jul/12 ] |
|
S110 runs 2.1.2 but s150 and s170 run 2.1.1. We actually had a vmcore on s150. Let me know what crash command output you want me to provide. |
| Comment by Jay Lan (Inactive) [ 02/Jul/12 ] |
|
Here is the stack trace of the running process (on service150): PID: 3694 TASK: ffff880c1ecda0c0 CPU: 0 COMMAND: "ll_mgs_01" |
| Comment by Jay Lan (Inactive) [ 02/Jul/12 ] |
|
Service170 crashed two more times. One is like the trace above in class_import_put(). The other is another variant. So far we have seen three different variants. But all triggered by and eventually would crash. Here is the third variant: Call Trace: Kernel panic - not syncing: LBUG We had a vmcore of this third variant. The git source for service170 and service150 is at: Service170's tag 2.1.1-2nasS is on this commit (June 18, 2012) The git source for service160 is at: |
| Comment by Jinshan Xiong (Inactive) [ 03/Jul/12 ] |
|
I took a look at this issue, from the backtrace, it seems one extra refcount of exp_imp_reverse was dropped and all of variants you have seen are related to this. I guess this issue is due to defect in reconnect handling, need to investigate more. |
| Comment by Peter Jones [ 03/Jul/12 ] |
|
Bobijam is looking it this one |
| Comment by Zhenyu Xu [ 03/Jul/12 ] |
|
this looks like obd cleanup and target_handle_connect race, like |
| Comment by Jay Lan (Inactive) [ 03/Jul/12 ] |
|
Of the five mgs crashes yesterday, 3 on S170, 1 on S110, and 1 on S150. Of them S110 runs 2.1.2 but S170 and S150 runs two slightly different versions of 2.1.1. Of the three filesystems, the nobackupp1 (ie S17*) is the most heavily used. That probably explain why it crashed 3 times yesterday. The other very heavily used filesystem is nobackupp2, which survived yesterday. The nobackupp2 runs 2.1.2 (same version as S110). We have a planned upgrade of nobackupp1 to 2.1.2 today and we will proceed as planned. I want to document that here so that we all know that S170 will upgrade to 2.1.2 today, not the same code that crashed 3 times yesterday. However, since S110 also crashed yesterday, i believe whatever the problem still exists in 2.1.2. |
| Comment by Jay Lan (Inactive) [ 05/Jul/12 ] |
|
I cherry-picked the patch from |
| Comment by Diego Moreno (Inactive) [ 06/Jul/12 ] |
|
At Bull we also hit this issue. We're going to wait for patch approval before installing it. |
| Comment by Ian Colle (Inactive) [ 31/Jul/12 ] |
|
Also observed at LLNL on Orion |
| Comment by Peter Jones [ 13/Aug/12 ] |
|
Closing as a duplicate of |
| Comment by Jay Lan (Inactive) [ 13/Aug/12 ] |
|
Please reopen this ticket. Last Friday 2 mds in our production systems crashed on this bug. Both system ran 2.1.2-2nasS, which contains patch of The console showed the messages: LustreError: 3669:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M |
| Comment by Zhenyu Xu [ 13/Aug/12 ] |
|
from the descriptions, there should be somewhere class_import_put() does not match its get operation, also the two threads calling class_import_put() should be in race contidion, since at the beginning of class_import_put() there has two assertions to make sure there is no additional put be called after its last refcount reaches 0 LASSERT(cfs_list_empty(&imp->imp_zombie_chain)); still investigating... |
| Comment by Zhenyu Xu [ 15/Aug/12 ] |
|
Jay Lan, Can you grab and upload all thread stacks when this happens, so that we can know what is racing the import destroy. |
| Comment by Jay Lan (Inactive) [ 15/Aug/12 ] |
|
The output of "bt -a" command from crash. |
| Comment by Zhenyu Xu [ 16/Aug/12 ] |
|
patch tracking at http://review.whamcloud.com/3684 patch description LU-1592 ldlm: protect obd_export:exp_imp_reverse's change * Protect obd_export::exp_imp_reverse from reconnect and destroy race. * Add an assertion in class_import_put() to catch race in the first place. |
| Comment by Peter Jones [ 31/Aug/12 ] |
|
Landed for 2.3 and 2.4 |
| Comment by Jay Lan (Inactive) [ 04/Sep/12 ] |
|
Could you please land this patch to b2_1 branch? Thanks! |
| Comment by Zhenyu Xu [ 04/Sep/12 ] |
|
b2_1 patch port tracking at http://review.whamcloud.com/3869 |