Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1592

ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0, Lustre 2.1.2
    • None
    • 2.1.2 servers and clients.
    • 2
    • 4468

    Description

      We had three mds/mgs crashed within an hour.
      The service110 crashed in class_import_destroy() while service150 and service170 crashed in class_import_put(). The class_import_destroy() failed on an ASSERT because the refcount was -1. The other two cases look like ORI-710.

      Unfortunately, we were not able to get a vmcore. Kdump crashed.

      Service110:
      Lustre: nbp4-MDT0000: Export ffff8806560aa400 already connecting from 10.151.5.8@o2ib^M
      Lustre: nbp4-MDT0000: denying duplicate export for 81811d25-ee59-5ea0-fbaf-31ee49f5aeb7, -114^M
      Lustre: Skipped 1 previous similar message^M
      LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.46.203@o2ib rejected: o2iblnd fatal error^M
      LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 7 previous similar messages^M
      LustreError: 4621:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
      Lustre: nbp4-MDT0000: denying duplicate export for 000566b2-2e24-9e9d-b38c-24016bb34ecd, -114^M
      Lustre: nbp3-MDT0000: Export ffff880756260c00 already connecting from 10.151.4.136@o2ib^M
      Lustre: Skipped 3 previous similar messages^M
      LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.17.141@o2ib rejected: o2iblnd fatal error^M
      LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 6 previous similar messages^M
      LustreError: 4650:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
      LustreError: 3541:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M
      Lustre: nbp4-MDT0000: denying duplicate export for b7f2dcde-c1be-3701-2f14-fec0e7c1b513, -114^M
      LustreError: 3541:0:(genops.c:930:class_import_destroy()) LBUG^M
      Pid: 3541, comm: obd_zombid^M

      We were not able to product a vmcore. Kdump crashed.

      Service150 (and also service170):
      (The crash on both MDS looks like ORI-710)
      Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1404442415660130 sent from nbp5-OST002c-osc-MDT0000 to NID 10.151.25.241@o2ib has timed out for sent delay: [sent 1341262190] [real_sent 0] [current 1341262295] [deadline 105s] [delay 0s] req@ffff880549250800 x1404442415660130/t0(0) o-1->nbp5-OST002c_UUID@10.151.25.241@o2ib:28/4 lens 368/512 e 0 to 1 dl 1341262295 ref 2 fl Rpc:XN/ffffffff/ffffffff rc 0/-1^M
      Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 599 previous similar messages^M
      Lustre: nbp5-MDT0000: haven't heard from client b04675d7-083f-bc2b-0fa5-2863afc271db (at 10.151.41.133@o2ib) in 279 seconds. I think it's dead, and I am evicting it. exp ffff880bfaff3800, cur 1341262407 expire 1341262257 last 1341262128^M
      Lustre: Skipped 43832 previous similar messages^M
      LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) 10.151.13.180@o2ib rejected: o2iblnd fatal error^M
      LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) Skipped 3013 previous similar messages^M
      LustreError: 3694:0:(genops.c:934:class_import_put()) ASSERTION(__v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a)) failed: value: 0^M
      LustreError: 12396:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 6^M
      LustreError: 12389:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 7^M
      LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) MGS handle cmd=250 rc=-114^M
      LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) Skipped 1 previous similar message^M
      LustreError: 3694:0:(genops.c:934:class_import_put()) LBUG^M
      Pid: 3694, comm: ll_mgs_01^M

      Attachments

        Activity

          [LU-1592] ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1
          bobijam Zhenyu Xu added a comment -

          b2_1 patch port tracking at http://review.whamcloud.com/3869

          bobijam Zhenyu Xu added a comment - b2_1 patch port tracking at http://review.whamcloud.com/3869

          Could you please land this patch to b2_1 branch? Thanks!

          jaylan Jay Lan (Inactive) added a comment - Could you please land this patch to b2_1 branch? Thanks!
          pjones Peter Jones added a comment -

          Landed for 2.3 and 2.4

          pjones Peter Jones added a comment - Landed for 2.3 and 2.4
          bobijam Zhenyu Xu added a comment -

          patch tracking at http://review.whamcloud.com/3684

          patch description
          LU-1592 ldlm: protect obd_export:exp_imp_reverse's change
          
          * Protect obd_export::exp_imp_reverse from reconnect and destroy race.
          * Add an assertion in class_import_put() to catch race in the first
            place.
          
          bobijam Zhenyu Xu added a comment - patch tracking at http://review.whamcloud.com/3684 patch description LU-1592 ldlm: protect obd_export:exp_imp_reverse's change * Protect obd_export::exp_imp_reverse from reconnect and destroy race. * Add an assertion in class_import_put() to catch race in the first place.

          The output of "bt -a" command from crash.

          jaylan Jay Lan (Inactive) added a comment - The output of "bt -a" command from crash.
          bobijam Zhenyu Xu added a comment -

          Jay Lan,

          Can you grab and upload all thread stacks when this happens, so that we can know what is racing the import destroy.

          bobijam Zhenyu Xu added a comment - Jay Lan, Can you grab and upload all thread stacks when this happens, so that we can know what is racing the import destroy.
          bobijam Zhenyu Xu added a comment -

          from the descriptions, there should be somewhere class_import_put() does not match its get operation, also the two threads calling class_import_put() should be in race contidion, since at the beginning of class_import_put() there has two assertions to make sure there is no additional put be called after its last refcount reaches 0

          LASSERT(cfs_list_empty(&imp->imp_zombie_chain));
          LASSERT_ATOMIC_GT_LT(&imp->imp_refcount, 0, LI_POISON);

          still investigating...

          bobijam Zhenyu Xu added a comment - from the descriptions, there should be somewhere class_import_put() does not match its get operation, also the two threads calling class_import_put() should be in race contidion, since at the beginning of class_import_put() there has two assertions to make sure there is no additional put be called after its last refcount reaches 0 LASSERT(cfs_list_empty(&imp->imp_zombie_chain)); LASSERT_ATOMIC_GT_LT(&imp->imp_refcount, 0, LI_POISON); still investigating...

          Please reopen this ticket. Last Friday 2 mds in our production systems crashed on this bug. Both system ran 2.1.2-2nasS, which contains patch of LU-1432.
          https://github.com/jlan/lustre-nas/commits/nas-2.1.2

          The console showed the messages:

          LustreError: 3669:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M
          LustreError: 3669:0:(genops.c:930:class_import_destroy()) LBUG^M
          Pid: 3669, comm: obd_zombid^M
          ^M
          Call Trace:^M
          [<ffffffffa0598855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
          [ <7ff fofutf fofffa 80 59cp8ue9s5 >in] lkdbbug,_ wwaiitht_ilngo cf+0oxr 7t5h/0ex re0e st[,l ibtcimfeso]^Mu
          t in 10 second(s)^M
          [<ffffffffa0663a06>] class_import_destroy+0x3a6/0x3b0 [obdclass]^M
          [<ffffffffa06678ba>] obd_zombie_impexp_cull+0xda/0x5a0 [obdclass]^M
          [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M
          [<ffffffffa0667e85>] obd_zombie_impexp_thread+0x105/0x270 [obdclass]^M
          [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffff8100c14a>] child_rip+0xa/0x20^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffff8100c140>] ? child_rip+0x0/0x20^M
          ^M
          Kernel panic - not syncing: LBUG^M
          Pid: 3669, comm: obd_zombid Tainted: G I---------------- 2.6.32-220.4.1.el6.20120607.x86_64.lustre212 #1^M
          Call Trace:^M
          [<ffffffff81520c56>] ? panic+0x78/0x164^M
          [<ffffffffa0598eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M
          [<ffffffffa0663a06>] ? class_import_destroy+0x3a6/0x3b0 [obdclass]^M
          [<ffffffffa06678ba>] ? obd_zombie_impexp_cull+0xda/0x5a0 [obdclass]^M
          [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M
          [<ffffffffa0667e85>] ? obd_zombie_impexp_thread+0x105/0x270 [obdclass]^M
          [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffff8100c140>] ? child_rip+0x0/0x20^M

          jaylan Jay Lan (Inactive) added a comment - Please reopen this ticket. Last Friday 2 mds in our production systems crashed on this bug. Both system ran 2.1.2-2nasS, which contains patch of LU-1432 . https://github.com/jlan/lustre-nas/commits/nas-2.1.2 The console showed the messages: LustreError: 3669:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M LustreError: 3669:0:(genops.c:930:class_import_destroy()) LBUG^M Pid: 3669, comm: obd_zombid^M ^M Call Trace:^M [<ffffffffa0598855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] ^M [ <7ff fofutf fofffa 80 59cp8ue9s5 >in] lkdbbug,_ wwaiitht_ilngo cf+0oxr 7t5h/0ex re0e st [,l ibtcimfeso] ^Mu t in 10 second(s)^M [<ffffffffa0663a06>] class_import_destroy+0x3a6/0x3b0 [obdclass] ^M [<ffffffffa06678ba>] obd_zombie_impexp_cull+0xda/0x5a0 [obdclass] ^M [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M [<ffffffffa0667e85>] obd_zombie_impexp_thread+0x105/0x270 [obdclass] ^M [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c14a>] child_rip+0xa/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M ^M Kernel panic - not syncing: LBUG^M Pid: 3669, comm: obd_zombid Tainted: G I---------------- 2.6.32-220.4.1.el6.20120607.x86_64.lustre212 #1^M Call Trace:^M [<ffffffff81520c56>] ? panic+0x78/0x164^M [<ffffffffa0598eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs] ^M [<ffffffffa0663a06>] ? class_import_destroy+0x3a6/0x3b0 [obdclass] ^M [<ffffffffa06678ba>] ? obd_zombie_impexp_cull+0xda/0x5a0 [obdclass] ^M [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M [<ffffffffa0667e85>] ? obd_zombie_impexp_thread+0x105/0x270 [obdclass] ^M [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M
          pjones Peter Jones added a comment -

          Closing as a duplicate of LU-1432 which landed for 2.3 on July 19th. Please reopen if this issue is encountered with that code in place

          pjones Peter Jones added a comment - Closing as a duplicate of LU-1432 which landed for 2.3 on July 19th. Please reopen if this issue is encountered with that code in place

          Also observed at LLNL on Orion

          ian Ian Colle (Inactive) added a comment - Also observed at LLNL on Orion

          People

            bobijam Zhenyu Xu
            jaylan Jay Lan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: