[LU-1592] ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.4
Affects Version/s: Lustre 2.3.0, Lustre 2.1.2
Labels:
None
Environment:
2.1.2 servers and clients.

Severity:
2
Rank (Obsolete):
4468

Description

We had three mds/mgs crashed within an hour.
The service110 crashed in class_import_destroy() while service150 and service170 crashed in class_import_put(). The class_import_destroy() failed on an ASSERT because the refcount was -1. The other two cases look like ORI-710.

Unfortunately, we were not able to get a vmcore. Kdump crashed.

Service110:
Lustre: nbp4-MDT0000: Export ffff8806560aa400 already connecting from 10.151.5.8@o2ib^M
Lustre: nbp4-MDT0000: denying duplicate export for 81811d25-ee59-5ea0-fbaf-31ee49f5aeb7, -114^M
Lustre: Skipped 1 previous similar message^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.46.203@o2ib rejected: o2iblnd fatal error^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 7 previous similar messages^M
LustreError: 4621:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
Lustre: nbp4-MDT0000: denying duplicate export for 000566b2-2e24-9e9d-b38c-24016bb34ecd, -114^M
Lustre: nbp3-MDT0000: Export ffff880756260c00 already connecting from 10.151.4.136@o2ib^M
Lustre: Skipped 3 previous similar messages^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.17.141@o2ib rejected: o2iblnd fatal error^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 6 previous similar messages^M
LustreError: 4650:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
LustreError: 3541:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M
Lustre: nbp4-MDT0000: denying duplicate export for b7f2dcde-c1be-3701-2f14-fec0e7c1b513, -114^M
LustreError: 3541:0:(genops.c:930:class_import_destroy()) LBUG^M
Pid: 3541, comm: obd_zombid^M

We were not able to product a vmcore. Kdump crashed.

Service150 (and also service170):
(The crash on both MDS looks like ORI-710)
Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1404442415660130 sent from nbp5-OST002c-osc-MDT0000 to NID 10.151.25.241@o2ib has timed out for sent delay: [sent 1341262190] [real_sent 0] [current 1341262295] [deadline 105s] [delay 0s] req@ffff880549250800 x1404442415660130/t0(0) o-1->nbp5-OST002c_UUID@10.151.25.241@o2ib:28/4 lens 368/512 e 0 to 1 dl 1341262295 ref 2 fl Rpc:XN/ffffffff/ffffffff rc 0/-1^M
Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 599 previous similar messages^M
Lustre: nbp5-MDT0000: haven't heard from client b04675d7-083f-bc2b-0fa5-2863afc271db (at 10.151.41.133@o2ib) in 279 seconds. I think it's dead, and I am evicting it. exp ffff880bfaff3800, cur 1341262407 expire 1341262257 last 1341262128^M
Lustre: Skipped 43832 previous similar messages^M
LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) 10.151.13.180@o2ib rejected: o2iblnd fatal error^M
LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) Skipped 3013 previous similar messages^M
LustreError: 3694:0:(genops.c:934:class_import_put()) ASSERTION(__v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a)) failed: value: 0^M
LustreError: 12396:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 6^M
LustreError: 12389:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 7^M
LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) MGS handle cmd=250 rc=-114^M
LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) Skipped 1 previous similar message^M
LustreError: 3694:0:(genops.c:934:class_import_put()) LBUG^M
Pid: 3694, comm: ll_mgs_01^M

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt-a.out
14 kB
15/Aug/12 7:24 PM

Activity

[LU-1592] ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1

Zhenyu Xu added a comment - 16/Aug/12 3:53 AM

patch tracking at http://review.whamcloud.com/3684

patch description

LU-1592 ldlm: protect obd_export:exp_imp_reverse's change

* Protect obd_export::exp_imp_reverse from reconnect and destroy race.
* Add an assertion in class_import_put() to catch race in the first
  place.

Zhenyu Xu added a comment - 16/Aug/12 3:53 AM patch tracking at http://review.whamcloud.com/3684 patch description LU-1592 ldlm: protect obd_export:exp_imp_reverse's change * Protect obd_export::exp_imp_reverse from reconnect and destroy race. * Add an assertion in class_import_put() to catch race in the first place.

Jay Lan (Inactive) added a comment - 15/Aug/12 7:24 PM

The output of "bt -a" command from crash.

Jay Lan (Inactive) added a comment - 15/Aug/12 7:24 PM The output of "bt -a" command from crash.

Zhenyu Xu added a comment - 15/Aug/12 2:18 AM

Jay Lan,

Can you grab and upload all thread stacks when this happens, so that we can know what is racing the import destroy.

Zhenyu Xu added a comment - 15/Aug/12 2:18 AM Jay Lan, Can you grab and upload all thread stacks when this happens, so that we can know what is racing the import destroy.

Zhenyu Xu added a comment - 13/Aug/12 10:01 PM

from the descriptions, there should be somewhere class_import_put() does not match its get operation, also the two threads calling class_import_put() should be in race contidion, since at the beginning of class_import_put() there has two assertions to make sure there is no additional put be called after its last refcount reaches 0

LASSERT(cfs_list_empty(&imp->imp_zombie_chain));
LASSERT_ATOMIC_GT_LT(&imp->imp_refcount, 0, LI_POISON);

still investigating...

Zhenyu Xu added a comment - 13/Aug/12 10:01 PM from the descriptions, there should be somewhere class_import_put() does not match its get operation, also the two threads calling class_import_put() should be in race contidion, since at the beginning of class_import_put() there has two assertions to make sure there is no additional put be called after its last refcount reaches 0 LASSERT(cfs_list_empty(&imp->imp_zombie_chain)); LASSERT_ATOMIC_GT_LT(&imp->imp_refcount, 0, LI_POISON); still investigating...

Jay Lan (Inactive) added a comment - 13/Aug/12 1:30 PM

Please reopen this ticket. Last Friday 2 mds in our production systems crashed on this bug. Both system ran 2.1.2-2nasS, which contains patch of ~~LU-1432~~.
https://github.com/jlan/lustre-nas/commits/nas-2.1.2

The console showed the messages:

LustreError: 3669:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M
LustreError: 3669:0:(genops.c:930:class_import_destroy()) LBUG^M
Pid: 3669, comm: obd_zombid^M
^M
Call Trace:^M
[<ffffffffa0598855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
[ <7ff fofutf fofffa 80 59cp8ue9s5 >in] lkdbbug,_ wwaiitht_ilngo cf+0oxr 7t5h/0ex re0e st[,l ibtcimfeso]^Mu
t in 10 second(s)^M
[<ffffffffa0663a06>] class_import_destroy+0x3a6/0x3b0 [obdclass]^M
[<ffffffffa06678ba>] obd_zombie_impexp_cull+0xda/0x5a0 [obdclass]^M
[<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M
[<ffffffffa0667e85>] obd_zombie_impexp_thread+0x105/0x270 [obdclass]^M
[<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffff8100c14a>] child_rip+0xa/0x20^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffff8100c140>] ? child_rip+0x0/0x20^M
^M
Kernel panic - not syncing: LBUG^M
Pid: 3669, comm: obd_zombid Tainted: G I---------------- 2.6.32-220.4.1.el6.20120607.x86_64.lustre212 #1^M
Call Trace:^M
[<ffffffff81520c56>] ? panic+0x78/0x164^M
[<ffffffffa0598eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M
[<ffffffffa0663a06>] ? class_import_destroy+0x3a6/0x3b0 [obdclass]^M
[<ffffffffa06678ba>] ? obd_zombie_impexp_cull+0xda/0x5a0 [obdclass]^M
[<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M
[<ffffffffa0667e85>] ? obd_zombie_impexp_thread+0x105/0x270 [obdclass]^M
[<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffff8100c14a>] ? child_rip+0xa/0x20^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffff8100c140>] ? child_rip+0x0/0x20^M

Jay Lan (Inactive) added a comment - 13/Aug/12 1:30 PM Please reopen this ticket. Last Friday 2 mds in our production systems crashed on this bug. Both system ran 2.1.2-2nasS, which contains patch of LU-1432 . https://github.com/jlan/lustre-nas/commits/nas-2.1.2 The console showed the messages: LustreError: 3669:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M LustreError: 3669:0:(genops.c:930:class_import_destroy()) LBUG^M Pid: 3669, comm: obd_zombid^M ^M Call Trace:^M [<ffffffffa0598855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] ^M [ <7ff fofutf fofffa 80 59cp8ue9s5 >in] lkdbbug,_ wwaiitht_ilngo cf+0oxr 7t5h/0ex re0e st [,l ibtcimfeso] ^Mu t in 10 second(s)^M [<ffffffffa0663a06>] class_import_destroy+0x3a6/0x3b0 [obdclass] ^M [<ffffffffa06678ba>] obd_zombie_impexp_cull+0xda/0x5a0 [obdclass] ^M [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M [<ffffffffa0667e85>] obd_zombie_impexp_thread+0x105/0x270 [obdclass] ^M [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c14a>] child_rip+0xa/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M ^M Kernel panic - not syncing: LBUG^M Pid: 3669, comm: obd_zombid Tainted: G I---------------- 2.6.32-220.4.1.el6.20120607.x86_64.lustre212 #1^M Call Trace:^M [<ffffffff81520c56>] ? panic+0x78/0x164^M [<ffffffffa0598eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs] ^M [<ffffffffa0663a06>] ? class_import_destroy+0x3a6/0x3b0 [obdclass] ^M [<ffffffffa06678ba>] ? obd_zombie_impexp_cull+0xda/0x5a0 [obdclass] ^M [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M [<ffffffffa0667e85>] ? obd_zombie_impexp_thread+0x105/0x270 [obdclass] ^M [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M

Peter Jones added a comment - 13/Aug/12 10:27 AM

Closing as a duplicate of ~~LU-1432~~ which landed for 2.3 on July 19th. Please reopen if this issue is encountered with that code in place

Peter Jones added a comment - 13/Aug/12 10:27 AM Closing as a duplicate of LU-1432 which landed for 2.3 on July 19th. Please reopen if this issue is encountered with that code in place

Ian Colle (Inactive) added a comment - 31/Jul/12 7:04 PM

Also observed at LLNL on Orion

Ian Colle (Inactive) added a comment - 31/Jul/12 7:04 PM Also observed at LLNL on Orion

Diego Moreno (Inactive) added a comment - 06/Jul/12 9:45 AM

At Bull we also hit this issue.

We're going to wait for patch approval before installing it.

Diego Moreno (Inactive) added a comment - 06/Jul/12 9:45 AM At Bull we also hit this issue. We're going to wait for patch approval before installing it.

Jay Lan (Inactive) added a comment - 05/Jul/12 5:49 PM

I cherry-picked the patch from ~~LU-1432~~ and ~~LU-1428~~ and built a new set of images. The new images will take effect next time our Lustre servers LBUG'ed or have to reboot for whatever reasons.

Jay Lan (Inactive) added a comment - 05/Jul/12 5:49 PM I cherry-picked the patch from LU-1432 and LU-1428 and built a new set of images. The new images will take effect next time our Lustre servers LBUG'ed or have to reboot for whatever reasons.

Jay Lan (Inactive) added a comment - 03/Jul/12 1:28 PM

Of the five mgs crashes yesterday, 3 on S170, 1 on S110, and 1 on S150. Of them S110 runs 2.1.2 but S170 and S150 runs two slightly different versions of 2.1.1. Of the three filesystems, the nobackupp1 (ie S17*) is the most heavily used. That probably explain why it crashed 3 times yesterday.

The other very heavily used filesystem is nobackupp2, which survived yesterday. The nobackupp2 runs 2.1.2 (same version as S110). We have a planned upgrade of nobackupp1 to 2.1.2 today and we will proceed as planned.

I want to document that here so that we all know that S170 will upgrade to 2.1.2 today, not the same code that crashed 3 times yesterday. However, since S110 also crashed yesterday, i believe whatever the problem still exists in 2.1.2.

Jay Lan (Inactive) added a comment - 03/Jul/12 1:28 PM Of the five mgs crashes yesterday, 3 on S170, 1 on S110, and 1 on S150. Of them S110 runs 2.1.2 but S170 and S150 runs two slightly different versions of 2.1.1. Of the three filesystems, the nobackupp1 (ie S17*) is the most heavily used. That probably explain why it crashed 3 times yesterday. The other very heavily used filesystem is nobackupp2, which survived yesterday. The nobackupp2 runs 2.1.2 (same version as S110). We have a planned upgrade of nobackupp1 to 2.1.2 today and we will proceed as planned. I want to document that here so that we all know that S170 will upgrade to 2.1.2 today, not the same code that crashed 3 times yesterday. However, since S110 also crashed yesterday, i believe whatever the problem still exists in 2.1.2.

Zhenyu Xu added a comment - 03/Jul/12 5:01 AM

this looks like obd cleanup and target_handle_connect race, like ~~LU-1432~~, http://review.whamcloud.com/#change,3244 should avoid this kind of race.

Zhenyu Xu added a comment - 03/Jul/12 5:01 AM this looks like obd cleanup and target_handle_connect race, like LU-1432 , http://review.whamcloud.com/#change,3244 should avoid this kind of race.

People

Assignee:: Zhenyu Xu

Reporter:: Jay Lan (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 02/Jul/12 6:02 PM

Updated:: 22/Dec/12 10:37 AM

Resolved:: 31/Aug/12 8:02 PM