[LU-1592] ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1 Created: 02/Jul/12  Updated: 22/Dec/12  Resolved: 31/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.2
Fix Version/s: Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.4

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

2.1.2 servers and clients.


Attachments: File bt-a.out    
Severity: 2
Rank (Obsolete): 4468

 Description   

We had three mds/mgs crashed within an hour.
The service110 crashed in class_import_destroy() while service150 and service170 crashed in class_import_put(). The class_import_destroy() failed on an ASSERT because the refcount was -1. The other two cases look like ORI-710.

Unfortunately, we were not able to get a vmcore. Kdump crashed.

Service110:
Lustre: nbp4-MDT0000: Export ffff8806560aa400 already connecting from 10.151.5.8@o2ib^M
Lustre: nbp4-MDT0000: denying duplicate export for 81811d25-ee59-5ea0-fbaf-31ee49f5aeb7, -114^M
Lustre: Skipped 1 previous similar message^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.46.203@o2ib rejected: o2iblnd fatal error^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 7 previous similar messages^M
LustreError: 4621:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
Lustre: nbp4-MDT0000: denying duplicate export for 000566b2-2e24-9e9d-b38c-24016bb34ecd, -114^M
Lustre: nbp3-MDT0000: Export ffff880756260c00 already connecting from 10.151.4.136@o2ib^M
Lustre: Skipped 3 previous similar messages^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.17.141@o2ib rejected: o2iblnd fatal error^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 6 previous similar messages^M
LustreError: 4650:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
LustreError: 3541:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M
Lustre: nbp4-MDT0000: denying duplicate export for b7f2dcde-c1be-3701-2f14-fec0e7c1b513, -114^M
LustreError: 3541:0:(genops.c:930:class_import_destroy()) LBUG^M
Pid: 3541, comm: obd_zombid^M

We were not able to product a vmcore. Kdump crashed.

Service150 (and also service170):
(The crash on both MDS looks like ORI-710)
Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1404442415660130 sent from nbp5-OST002c-osc-MDT0000 to NID 10.151.25.241@o2ib has timed out for sent delay: [sent 1341262190] [real_sent 0] [current 1341262295] [deadline 105s] [delay 0s] req@ffff880549250800 x1404442415660130/t0(0) o-1->nbp5-OST002c_UUID@10.151.25.241@o2ib:28/4 lens 368/512 e 0 to 1 dl 1341262295 ref 2 fl Rpc:XN/ffffffff/ffffffff rc 0/-1^M
Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 599 previous similar messages^M
Lustre: nbp5-MDT0000: haven't heard from client b04675d7-083f-bc2b-0fa5-2863afc271db (at 10.151.41.133@o2ib) in 279 seconds. I think it's dead, and I am evicting it. exp ffff880bfaff3800, cur 1341262407 expire 1341262257 last 1341262128^M
Lustre: Skipped 43832 previous similar messages^M
LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) 10.151.13.180@o2ib rejected: o2iblnd fatal error^M
LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) Skipped 3013 previous similar messages^M
LustreError: 3694:0:(genops.c:934:class_import_put()) ASSERTION(__v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a)) failed: value: 0^M
LustreError: 12396:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 6^M
LustreError: 12389:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 7^M
LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) MGS handle cmd=250 rc=-114^M
LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) Skipped 1 previous similar message^M
LustreError: 3694:0:(genops.c:934:class_import_put()) LBUG^M
Pid: 3694, comm: ll_mgs_01^M



 Comments   
Comment by Jay Lan (Inactive) [ 02/Jul/12 ]

S110 runs 2.1.2 but s150 and s170 run 2.1.1.

We actually had a vmcore on s150. Let me know what crash command output you want me to provide.

Comment by Jay Lan (Inactive) [ 02/Jul/12 ]

Here is the stack trace of the running process (on service150):

PID: 3694 TASK: ffff880c1ecda0c0 CPU: 0 COMMAND: "ll_mgs_01"
#0 [ffff880c1e7836b8] machine_kexec at ffffffff8103204b
#1 [ffff880c1e783718] crash_kexec at ffffffff810b8472
#2 [ffff880c1e7837e8] kdb_kdump_check at ffffffff8127a53f
#3 [ffff880c1e7837f8] kdb_main_loop at ffffffff8127d757
#4 [ffff880c1e783908] kdb_save_running at ffffffff81277aae
#5 [ffff880c1e783918] kdba_main_loop at ffffffff8144a538
#6 [ffff880c1e783958] kdb at ffffffff8127aa57
#7 [ffff880c1e7839c8] panic at ffffffff81520c97
#8 [ffff880c1e783a48] lbug_with_loc at ffffffffa05e3eeb
#9 [ffff880c1e783a98] class_import_put at ffffffffa06afb48
#10 [ffff880c1e783ae8] client_destroy_import at ffffffffa079e92e
#11 [ffff880c1e783b08] target_handle_connect at ffffffffa07a058e
#12 [ffff880c1e783cc8] mgs_handle at ffffffffa0b94cbd
#13 [ffff880c1e783d88] ptlrpc_main at ffffffffa07e842e
#14 [ffff880c1e783f48] kernel_thread at ffffffff8100c14a

Comment by Jay Lan (Inactive) [ 02/Jul/12 ]

Service170 crashed two more times. One is like the trace above in class_import_put(). The other is another variant.

So far we have seen three different variants. But all triggered by
Lustre: MGS: denying duplicate export for 63509613-c870-1438-aa35-858e9606c728, -114
Lustre: MGS: Export ffff88079183c400 already connecting from 10.151.26.31@o2ib

and eventually would crash.

Here is the third variant:
Lustre: MGS: denying duplicate export for 63509613-c870-1438-aa35-858e9606c728, -114
Lustre: MGS: Export ffff88079183c400 already connecting from 10.151.26.31@o2ib
Lustre: Skipped 4 previous similar messages
LustreError: 4756:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114
LustreError: 4761:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114
Lustre: MGS: Client 63509613-c870-1438-aa35-858e9606c728 (at 10.151.26.31@o2ib) reconnecting
LustreError: 4490:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114
LustreError: 4761:0:(obd_class.h:501:obd_set_info_async()) obd_set_info_async: dev 0 no operation
LustreError: 4761:0:(obd_class.h:501:obd_set_info_async()) Skipped 5 previous similar messages
LustreError: 4761:0:(genops.c:1586:obd_zombie_import_add()) ASSERTION(imp->imp_sec == NULL) failed
LustreError: 4761:0:(genops.c:1586:obd_zombie_import_add()) LBUG
Pid: 4761, comm: ll_mgs_07

Call Trace:
[<ffffffffa05e0855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa05e0e95>] lbug_with_loc+0x75/0xe0 [libcfs]
Lustre: MGS: Export ffff88079183c400 already connecting from 10.151.26.31@o2ib
[<ffffffffa05ebda6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
[<ffffffffa06b560b>] class_import_put+0x2cb/0x300 [obdclass]
[<ffffffffa07a6787>] target_handle_connect+0x10a7/0x3070 [ptlrpc]
[<ffffffffa0b9fdcd>] mgs_handle+0x3fd/0x19b0 [mgs]
[<ffffffffa06e9f0f>] ? keys_fill+0x6f/0x180 [obdclass]
[<ffffffffa07ddce4>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
[<ffffffffa07ee7be>] ptlrpc_main+0xb7e/0x18f0 [ptlrpc]
[<ffffffffa07edc40>] ? ptlrpc_main+0x0/0x18f0 [ptlrpc]
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffffa07edc40>] ? ptlrpc_main+0x0/0x18f0 [ptlrpc]
[<ffffffffa07edc40>] ? ptlrpc_main+0x0/0x18f0 [ptlrpc]
[<ffffffff8100c140>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 4761, comm: ll_mgs_07 Not tainted 2.6.32-220.4.1.el6.20120130.x86_64.lustre211 #1
...

We had a vmcore of this third variant.

The git source for service170 and service150 is at:
https://github.com/jlan/lustre-nas/commits/nas-2.1.1

Service170's tag 2.1.1-2nasS is on this commit (June 18, 2012)
"LU-685 obdclass: lu_object reclamation is inefficient"

The git source for service160 is at:
https://github.com/jlan/lustre-nas/commits/nas-2.1.2
Service160's tag 2.1.2-1nasS is on this commit (June 19, 2012)
"NAS: Provide meanful lustre version information to procfs"

Comment by Jinshan Xiong (Inactive) [ 03/Jul/12 ]

I took a look at this issue, from the backtrace, it seems one extra refcount of exp_imp_reverse was dropped and all of variants you have seen are related to this.

I guess this issue is due to defect in reconnect handling, need to investigate more.

Comment by Peter Jones [ 03/Jul/12 ]

Bobijam is looking it this one

Comment by Zhenyu Xu [ 03/Jul/12 ]

this looks like obd cleanup and target_handle_connect race, like LU-1432, http://review.whamcloud.com/#change,3244 should avoid this kind of race.

Comment by Jay Lan (Inactive) [ 03/Jul/12 ]

Of the five mgs crashes yesterday, 3 on S170, 1 on S110, and 1 on S150. Of them S110 runs 2.1.2 but S170 and S150 runs two slightly different versions of 2.1.1. Of the three filesystems, the nobackupp1 (ie S17*) is the most heavily used. That probably explain why it crashed 3 times yesterday.

The other very heavily used filesystem is nobackupp2, which survived yesterday. The nobackupp2 runs 2.1.2 (same version as S110). We have a planned upgrade of nobackupp1 to 2.1.2 today and we will proceed as planned.

I want to document that here so that we all know that S170 will upgrade to 2.1.2 today, not the same code that crashed 3 times yesterday. However, since S110 also crashed yesterday, i believe whatever the problem still exists in 2.1.2.

Comment by Jay Lan (Inactive) [ 05/Jul/12 ]

I cherry-picked the patch from LU-1432 and LU-1428 and built a new set of images. The new images will take effect next time our Lustre servers LBUG'ed or have to reboot for whatever reasons.

Comment by Diego Moreno (Inactive) [ 06/Jul/12 ]

At Bull we also hit this issue.

We're going to wait for patch approval before installing it.

Comment by Ian Colle (Inactive) [ 31/Jul/12 ]

Also observed at LLNL on Orion

Comment by Peter Jones [ 13/Aug/12 ]

Closing as a duplicate of LU-1432 which landed for 2.3 on July 19th. Please reopen if this issue is encountered with that code in place

Comment by Jay Lan (Inactive) [ 13/Aug/12 ]

Please reopen this ticket. Last Friday 2 mds in our production systems crashed on this bug. Both system ran 2.1.2-2nasS, which contains patch of LU-1432.
https://github.com/jlan/lustre-nas/commits/nas-2.1.2

The console showed the messages:

LustreError: 3669:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M
LustreError: 3669:0:(genops.c:930:class_import_destroy()) LBUG^M
Pid: 3669, comm: obd_zombid^M
^M
Call Trace:^M
[<ffffffffa0598855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
[ <7ff fofutf fofffa 80 59cp8ue9s5 >in] lkdbbug,_ wwaiitht_ilngo cf+0oxr 7t5h/0ex re0e st[,l ibtcimfeso]^Mu
t in 10 second(s)^M
[<ffffffffa0663a06>] class_import_destroy+0x3a6/0x3b0 [obdclass]^M
[<ffffffffa06678ba>] obd_zombie_impexp_cull+0xda/0x5a0 [obdclass]^M
[<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M
[<ffffffffa0667e85>] obd_zombie_impexp_thread+0x105/0x270 [obdclass]^M
[<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffff8100c14a>] child_rip+0xa/0x20^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffff8100c140>] ? child_rip+0x0/0x20^M
^M
Kernel panic - not syncing: LBUG^M
Pid: 3669, comm: obd_zombid Tainted: G I---------------- 2.6.32-220.4.1.el6.20120607.x86_64.lustre212 #1^M
Call Trace:^M
[<ffffffff81520c56>] ? panic+0x78/0x164^M
[<ffffffffa0598eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M
[<ffffffffa0663a06>] ? class_import_destroy+0x3a6/0x3b0 [obdclass]^M
[<ffffffffa06678ba>] ? obd_zombie_impexp_cull+0xda/0x5a0 [obdclass]^M
[<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M
[<ffffffffa0667e85>] ? obd_zombie_impexp_thread+0x105/0x270 [obdclass]^M
[<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffff8100c14a>] ? child_rip+0xa/0x20^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
[<ffffffff8100c140>] ? child_rip+0x0/0x20^M

Comment by Zhenyu Xu [ 13/Aug/12 ]

from the descriptions, there should be somewhere class_import_put() does not match its get operation, also the two threads calling class_import_put() should be in race contidion, since at the beginning of class_import_put() there has two assertions to make sure there is no additional put be called after its last refcount reaches 0

LASSERT(cfs_list_empty(&imp->imp_zombie_chain));
LASSERT_ATOMIC_GT_LT(&imp->imp_refcount, 0, LI_POISON);

still investigating...

Comment by Zhenyu Xu [ 15/Aug/12 ]

Jay Lan,

Can you grab and upload all thread stacks when this happens, so that we can know what is racing the import destroy.

Comment by Jay Lan (Inactive) [ 15/Aug/12 ]

The output of "bt -a" command from crash.

Comment by Zhenyu Xu [ 16/Aug/12 ]

patch tracking at http://review.whamcloud.com/3684

patch description
LU-1592 ldlm: protect obd_export:exp_imp_reverse's change

* Protect obd_export::exp_imp_reverse from reconnect and destroy race.
* Add an assertion in class_import_put() to catch race in the first
  place.
Comment by Peter Jones [ 31/Aug/12 ]

Landed for 2.3 and 2.4

Comment by Jay Lan (Inactive) [ 04/Sep/12 ]

Could you please land this patch to b2_1 branch? Thanks!

Comment by Zhenyu Xu [ 04/Sep/12 ]

b2_1 patch port tracking at http://review.whamcloud.com/3869

Generated at Sat Feb 10 01:18:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.