Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1592

ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0, Lustre 2.1.2
    • None
    • 2.1.2 servers and clients.
    • 2
    • 4468

    Description

      We had three mds/mgs crashed within an hour.
      The service110 crashed in class_import_destroy() while service150 and service170 crashed in class_import_put(). The class_import_destroy() failed on an ASSERT because the refcount was -1. The other two cases look like ORI-710.

      Unfortunately, we were not able to get a vmcore. Kdump crashed.

      Service110:
      Lustre: nbp4-MDT0000: Export ffff8806560aa400 already connecting from 10.151.5.8@o2ib^M
      Lustre: nbp4-MDT0000: denying duplicate export for 81811d25-ee59-5ea0-fbaf-31ee49f5aeb7, -114^M
      Lustre: Skipped 1 previous similar message^M
      LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.46.203@o2ib rejected: o2iblnd fatal error^M
      LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 7 previous similar messages^M
      LustreError: 4621:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
      Lustre: nbp4-MDT0000: denying duplicate export for 000566b2-2e24-9e9d-b38c-24016bb34ecd, -114^M
      Lustre: nbp3-MDT0000: Export ffff880756260c00 already connecting from 10.151.4.136@o2ib^M
      Lustre: Skipped 3 previous similar messages^M
      LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 10.151.17.141@o2ib rejected: o2iblnd fatal error^M
      LustreError: 1439:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 6 previous similar messages^M
      LustreError: 4650:0:(mgs_handler.c:782:mgs_handle()) MGS handle cmd=250 rc=-114^M
      LustreError: 3541:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M
      Lustre: nbp4-MDT0000: denying duplicate export for b7f2dcde-c1be-3701-2f14-fec0e7c1b513, -114^M
      LustreError: 3541:0:(genops.c:930:class_import_destroy()) LBUG^M
      Pid: 3541, comm: obd_zombid^M

      We were not able to product a vmcore. Kdump crashed.

      Service150 (and also service170):
      (The crash on both MDS looks like ORI-710)
      Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1404442415660130 sent from nbp5-OST002c-osc-MDT0000 to NID 10.151.25.241@o2ib has timed out for sent delay: [sent 1341262190] [real_sent 0] [current 1341262295] [deadline 105s] [delay 0s] req@ffff880549250800 x1404442415660130/t0(0) o-1->nbp5-OST002c_UUID@10.151.25.241@o2ib:28/4 lens 368/512 e 0 to 1 dl 1341262295 ref 2 fl Rpc:XN/ffffffff/ffffffff rc 0/-1^M
      Lustre: 3437:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 599 previous similar messages^M
      Lustre: nbp5-MDT0000: haven't heard from client b04675d7-083f-bc2b-0fa5-2863afc271db (at 10.151.41.133@o2ib) in 279 seconds. I think it's dead, and I am evicting it. exp ffff880bfaff3800, cur 1341262407 expire 1341262257 last 1341262128^M
      Lustre: Skipped 43832 previous similar messages^M
      LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) 10.151.13.180@o2ib rejected: o2iblnd fatal error^M
      LustreError: 1812:0:(o2iblnd_cb.c:2613:kiblnd_rejected()) Skipped 3013 previous similar messages^M
      LustreError: 3694:0:(genops.c:934:class_import_put()) ASSERTION(__v > 0 && __v < ((int)0x5a5a5a5a5a5a5a5a)) failed: value: 0^M
      LustreError: 12396:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 6^M
      LustreError: 12389:0:(ldlm_lib.c:965:target_handle_connect()) ee0eaddd-4f30-488b-720c-5ffbddbd6ae9: 10.151.25.237@o2ib already connected at higher conn_cnt: 8 > 7^M
      LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) MGS handle cmd=250 rc=-114^M
      LustreError: 12396:0:(mgs_handler.c:783:mgs_handle()) Skipped 1 previous similar message^M
      LustreError: 3694:0:(genops.c:934:class_import_put()) LBUG^M
      Pid: 3694, comm: ll_mgs_01^M

      Attachments

        Activity

          [LU-1592] ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1
          bobijam Zhenyu Xu added a comment -

          patch tracking at http://review.whamcloud.com/3684

          patch description
          LU-1592 ldlm: protect obd_export:exp_imp_reverse's change
          
          * Protect obd_export::exp_imp_reverse from reconnect and destroy race.
          * Add an assertion in class_import_put() to catch race in the first
            place.
          
          bobijam Zhenyu Xu added a comment - patch tracking at http://review.whamcloud.com/3684 patch description LU-1592 ldlm: protect obd_export:exp_imp_reverse's change * Protect obd_export::exp_imp_reverse from reconnect and destroy race. * Add an assertion in class_import_put() to catch race in the first place.

          The output of "bt -a" command from crash.

          jaylan Jay Lan (Inactive) added a comment - The output of "bt -a" command from crash.
          bobijam Zhenyu Xu added a comment -

          Jay Lan,

          Can you grab and upload all thread stacks when this happens, so that we can know what is racing the import destroy.

          bobijam Zhenyu Xu added a comment - Jay Lan, Can you grab and upload all thread stacks when this happens, so that we can know what is racing the import destroy.
          bobijam Zhenyu Xu added a comment -

          from the descriptions, there should be somewhere class_import_put() does not match its get operation, also the two threads calling class_import_put() should be in race contidion, since at the beginning of class_import_put() there has two assertions to make sure there is no additional put be called after its last refcount reaches 0

          LASSERT(cfs_list_empty(&imp->imp_zombie_chain));
          LASSERT_ATOMIC_GT_LT(&imp->imp_refcount, 0, LI_POISON);

          still investigating...

          bobijam Zhenyu Xu added a comment - from the descriptions, there should be somewhere class_import_put() does not match its get operation, also the two threads calling class_import_put() should be in race contidion, since at the beginning of class_import_put() there has two assertions to make sure there is no additional put be called after its last refcount reaches 0 LASSERT(cfs_list_empty(&imp->imp_zombie_chain)); LASSERT_ATOMIC_GT_LT(&imp->imp_refcount, 0, LI_POISON); still investigating...

          Please reopen this ticket. Last Friday 2 mds in our production systems crashed on this bug. Both system ran 2.1.2-2nasS, which contains patch of LU-1432.
          https://github.com/jlan/lustre-nas/commits/nas-2.1.2

          The console showed the messages:

          LustreError: 3669:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M
          LustreError: 3669:0:(genops.c:930:class_import_destroy()) LBUG^M
          Pid: 3669, comm: obd_zombid^M
          ^M
          Call Trace:^M
          [<ffffffffa0598855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
          [ <7ff fofutf fofffa 80 59cp8ue9s5 >in] lkdbbug,_ wwaiitht_ilngo cf+0oxr 7t5h/0ex re0e st[,l ibtcimfeso]^Mu
          t in 10 second(s)^M
          [<ffffffffa0663a06>] class_import_destroy+0x3a6/0x3b0 [obdclass]^M
          [<ffffffffa06678ba>] obd_zombie_impexp_cull+0xda/0x5a0 [obdclass]^M
          [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M
          [<ffffffffa0667e85>] obd_zombie_impexp_thread+0x105/0x270 [obdclass]^M
          [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffff8100c14a>] child_rip+0xa/0x20^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffff8100c140>] ? child_rip+0x0/0x20^M
          ^M
          Kernel panic - not syncing: LBUG^M
          Pid: 3669, comm: obd_zombid Tainted: G I---------------- 2.6.32-220.4.1.el6.20120607.x86_64.lustre212 #1^M
          Call Trace:^M
          [<ffffffff81520c56>] ? panic+0x78/0x164^M
          [<ffffffffa0598eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs]^M
          [<ffffffffa0663a06>] ? class_import_destroy+0x3a6/0x3b0 [obdclass]^M
          [<ffffffffa06678ba>] ? obd_zombie_impexp_cull+0xda/0x5a0 [obdclass]^M
          [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M
          [<ffffffffa0667e85>] ? obd_zombie_impexp_thread+0x105/0x270 [obdclass]^M
          [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass]^M
          [<ffffffff8100c140>] ? child_rip+0x0/0x20^M

          jaylan Jay Lan (Inactive) added a comment - Please reopen this ticket. Last Friday 2 mds in our production systems crashed on this bug. Both system ran 2.1.2-2nasS, which contains patch of LU-1432 . https://github.com/jlan/lustre-nas/commits/nas-2.1.2 The console showed the messages: LustreError: 3669:0:(genops.c:930:class_import_destroy()) ASSERTION(cfs_atomic_read(&imp->imp_refcount) == 0) failed: value: -1^M LustreError: 3669:0:(genops.c:930:class_import_destroy()) LBUG^M Pid: 3669, comm: obd_zombid^M ^M Call Trace:^M [<ffffffffa0598855>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] ^M [ <7ff fofutf fofffa 80 59cp8ue9s5 >in] lkdbbug,_ wwaiitht_ilngo cf+0oxr 7t5h/0ex re0e st [,l ibtcimfeso] ^Mu t in 10 second(s)^M [<ffffffffa0663a06>] class_import_destroy+0x3a6/0x3b0 [obdclass] ^M [<ffffffffa06678ba>] obd_zombie_impexp_cull+0xda/0x5a0 [obdclass] ^M [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M [<ffffffffa0667e85>] obd_zombie_impexp_thread+0x105/0x270 [obdclass] ^M [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c14a>] child_rip+0xa/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M ^M Kernel panic - not syncing: LBUG^M Pid: 3669, comm: obd_zombid Tainted: G I---------------- 2.6.32-220.4.1.el6.20120607.x86_64.lustre212 #1^M Call Trace:^M [<ffffffff81520c56>] ? panic+0x78/0x164^M [<ffffffffa0598eeb>] ? lbug_with_loc+0xcb/0xe0 [libcfs] ^M [<ffffffffa0663a06>] ? class_import_destroy+0x3a6/0x3b0 [obdclass] ^M [<ffffffffa06678ba>] ? obd_zombie_impexp_cull+0xda/0x5a0 [obdclass] ^M [<ffffffff810903ac>] ? remove_wait_queue+0x3c/0x50^M [<ffffffffa0667e85>] ? obd_zombie_impexp_thread+0x105/0x270 [obdclass] ^M [<ffffffff8105fff0>] ? default_wake_function+0x0/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c14a>] ? child_rip+0xa/0x20^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffffa0667d80>] ? obd_zombie_impexp_thread+0x0/0x270 [obdclass] ^M [<ffffffff8100c140>] ? child_rip+0x0/0x20^M
          pjones Peter Jones added a comment -

          Closing as a duplicate of LU-1432 which landed for 2.3 on July 19th. Please reopen if this issue is encountered with that code in place

          pjones Peter Jones added a comment - Closing as a duplicate of LU-1432 which landed for 2.3 on July 19th. Please reopen if this issue is encountered with that code in place

          Also observed at LLNL on Orion

          ian Ian Colle (Inactive) added a comment - Also observed at LLNL on Orion

          At Bull we also hit this issue.

          We're going to wait for patch approval before installing it.

          dmoreno Diego Moreno (Inactive) added a comment - At Bull we also hit this issue. We're going to wait for patch approval before installing it.

          I cherry-picked the patch from LU-1432 and LU-1428 and built a new set of images. The new images will take effect next time our Lustre servers LBUG'ed or have to reboot for whatever reasons.

          jaylan Jay Lan (Inactive) added a comment - I cherry-picked the patch from LU-1432 and LU-1428 and built a new set of images. The new images will take effect next time our Lustre servers LBUG'ed or have to reboot for whatever reasons.

          Of the five mgs crashes yesterday, 3 on S170, 1 on S110, and 1 on S150. Of them S110 runs 2.1.2 but S170 and S150 runs two slightly different versions of 2.1.1. Of the three filesystems, the nobackupp1 (ie S17*) is the most heavily used. That probably explain why it crashed 3 times yesterday.

          The other very heavily used filesystem is nobackupp2, which survived yesterday. The nobackupp2 runs 2.1.2 (same version as S110). We have a planned upgrade of nobackupp1 to 2.1.2 today and we will proceed as planned.

          I want to document that here so that we all know that S170 will upgrade to 2.1.2 today, not the same code that crashed 3 times yesterday. However, since S110 also crashed yesterday, i believe whatever the problem still exists in 2.1.2.

          jaylan Jay Lan (Inactive) added a comment - Of the five mgs crashes yesterday, 3 on S170, 1 on S110, and 1 on S150. Of them S110 runs 2.1.2 but S170 and S150 runs two slightly different versions of 2.1.1. Of the three filesystems, the nobackupp1 (ie S17*) is the most heavily used. That probably explain why it crashed 3 times yesterday. The other very heavily used filesystem is nobackupp2, which survived yesterday. The nobackupp2 runs 2.1.2 (same version as S110). We have a planned upgrade of nobackupp1 to 2.1.2 today and we will proceed as planned. I want to document that here so that we all know that S170 will upgrade to 2.1.2 today, not the same code that crashed 3 times yesterday. However, since S110 also crashed yesterday, i believe whatever the problem still exists in 2.1.2.
          bobijam Zhenyu Xu added a comment -

          this looks like obd cleanup and target_handle_connect race, like LU-1432, http://review.whamcloud.com/#change,3244 should avoid this kind of race.

          bobijam Zhenyu Xu added a comment - this looks like obd cleanup and target_handle_connect race, like LU-1432 , http://review.whamcloud.com/#change,3244 should avoid this kind of race.

          People

            bobijam Zhenyu Xu
            jaylan Jay Lan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: