Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.3.0, Lustre 2.1.2
-
None
-
2.1.2 servers and clients.
-
2
-
4468
Description
We had three mds/mgs crashed within an hour.
The service110 crashed in class_import_destroy() while service150 and service170 crashed in class_import_put(). The class_import_destroy() failed on an ASSERT because the refcount was -1. The other two cases look like ORI-710.
Unfortunately, we were not able to get a vmcore. Kdump crashed.
Service110:
Lustre: nbp4-MDT0000: Export ffff8806560aa400 already connecting from 10.151.5.8@o2ib^M
Lustre: nbp4-MDT0000: denying duplicate export for 81811d25-ee59-5ea0-fbaf-31ee49f5aeb7, -114^M
Lustre: Skipped 1 previous similar message^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.46.203@o2ib rejected: o2iblnd fatal error^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 7 previous similar messages^M
LustreError: 4621:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
Lustre: nbp4-MDT0000: denying duplicate export for 000566b2-2e24-9e9d-b38c-24016bb34ecd, -114^M
Lustre: nbp3-MDT0000: Export ffff880756260c00 already connecting from 10.151.4.136@o2ib^M
Lustre: Skipped 3 previous similar messages^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.17.141@o2ib rejected: o2iblnd fatal error^M
LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 6 previous similar messages^M
LustreError: 4650:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
LustreError: 3541:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M
Lustre: nbp4-MDT0000: denying duplicate export for b7f2dcde-c1be-3701-2f14-fec0e7c1b513, -114^M
LustreError: 3541:0:(genops.c:930:class_import_destroy()) LBUG^M
Pid: 3541, comm: obd_zombid^M
We were not able to product a vmcore. Kdump crashed.
Service150 (and also service170):
(The crash on both MDS looks like ORI-710)
Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1404442415660130 sent from nbp5-OST002c-osc-MDT0000 to NID 10.151.25.241@o2ib has timed out for sent delay: [sent 1341262190] [real_sent 0] [current 1341262295] [deadline 105s] [delay 0s] req@ffff880549250800 x1404442415660130/t0(0) o-1->nbp5-OST002c_UUID@10.151.25.241@o2ib:28/4 lens 368/512 e 0 to 1 dl 1341262295 ref 2 fl Rpc:XN/ffffffff/ffffffff rc 0/-1^M
Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 599 previous similar messages^M
Lustre: nbp5-MDT0000: haven't heard from client b04675d7-083f-bc2b-0fa5-2863afc271db (at 10.151.41.133@o2ib) in 279 seconds. I think it's dead, and I am evicting it. exp ffff880bfaff3800, cur 1341262407 expire 1341262257 last 1341262128^M
Lustre: Skipped 43832 previous similar messages^M
LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) 10.151.13.180@o2ib rejected: o2iblnd fatal error^M
LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) Skipped 3013 previous similar messages^M
LustreError: 3694:0:(genops.c:934:class_import_put()) ASSERTION(__v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a)) failed: value: 0^M
LustreError: 12396:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 6^M
LustreError: 12389:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 7^M
LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) MGS handle cmd=250 rc=-114^M
LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) Skipped 1 previous similar message^M
LustreError: 3694:0:(genops.c:934:class_import_put()) LBUG^M
Pid: 3694, comm: ll_mgs_01^M