Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1085

ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed

Details

    • 3
    • 6467

    Description

      We have multiple Lustre 2.1 OSS nodes crashing repeatedly during recovery. This is on our classified Lustre cluster which was updated from 1.8 to 2.1 on Tuesday. The summary is one observed symptom. We have also seen these assertions appearing together.

      ASSERTION(cfs_list_empty(&imp->imp_zombie_chain)) failed
      ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed

      We don't have backtraces for the assertions because STONITH kicked in before the crash dump completed.

      Other OSS nodes are crashing in kernel string handling functions with stacks like

      machine_kexec
      crash_kexec
      oops_end
      die
      do_general_protection
      general_protection
      (exception RIP: strlen+9)
      strlen
      string
      vsnprintf
      libcfs_debug_vmsg2
      _debug_req
      target_send_replay_msg
      target_send_reply
      ost_handle
      ptlrpc_main

      So it appears we are passing a bad value in a debug message.

      Another stack trace:

      BUG: unable to handle kernel NULL ptr dereference at 000...38
      IP: [<fffffffa0a8706>] filter_export_stats_init+0x1f1/0x500 [obdfilter]

      machine_kexec
      crash_kexec
      oops_end
      no_context
      __bad_area_semaphore
      bad_area_semaphore
      __do_page_fault
      do_page_fault
      page_fault
      filter_reconnect
      target_handle_connect
      ost_handle
      ptlrpc_main

      We have multiple symptoms here that may or not be due to the same bug. We may need to open a separate issue to track the root cause. Note that our branch contains LU-874 patches that touched the ptlrpc queue management code, so we should be on the lookout for any races introduced there. Also note we can't send debug data from this system.

      Attachments

        Activity

          [LU-1085] ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed
          pjones Peter Jones added a comment -

          Believed to be a duplicate of LU-1092

          pjones Peter Jones added a comment - Believed to be a duplicate of LU-1092

          Yes, we'll pull the patch in to our tree, and it will eventually get rolled out to our production systems.

          nedbass Ned Bass (Inactive) added a comment - Yes, we'll pull the patch in to our tree, and it will eventually get rolled out to our production systems.
          green Oleg Drokin added a comment -

          Yes, this looks related.
          Any chance you can try it?

          green Oleg Drokin added a comment - Yes, this looks related. Any chance you can try it?

          Oleg,

          Do you think the LU-1092 patch will help with these assertions? Mikhail made a comment to that effect in LU-1336.

          http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=893cf2014a38c5bd94890d3522fafe55f024a958

          nedbass Ned Bass (Inactive) added a comment - Oleg, Do you think the LU-1092 patch will help with these assertions? Mikhail made a comment to that effect in LU-1336 . http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=893cf2014a38c5bd94890d3522fafe55f024a958

          Here's a log message and backtrace for the cfs_list_empty assertion.

          LustreError: 24458:0:(genops.c:931:class_import_put()) ASSERTINO(cfs_list_empty(&imp->imp_zombie_chain)) failed

          COMMAND: "ll_ost_54"
          #0 machine_kexec
          #1 crash_kexec
          #2 panic
          #3 lbug_with_loc
          #4 libcfs_assertion_failed
          #5 class_import_put
          #6 client_destroy_import
          #7 target_handle_connect
          #8 ost_handle
          #9 ptlrpc_main
          #10 kernel_thread

          nedbass Ned Bass (Inactive) added a comment - Here's a log message and backtrace for the cfs_list_empty assertion. LustreError: 24458:0:(genops.c:931:class_import_put()) ASSERTINO(cfs_list_empty(&imp->imp_zombie_chain)) failed COMMAND: "ll_ost_54" #0 machine_kexec #1 crash_kexec #2 panic #3 lbug_with_loc #4 libcfs_assertion_failed #5 class_import_put #6 client_destroy_import #7 target_handle_connect #8 ost_handle #9 ptlrpc_main #10 kernel_thread

          Here's a log message and backtrace for the exp_refcount assertion.

          LustreError: 24253:0:(genops.c:717:class_export_destory()) ASSERTION(cfs_atomic_read(&exp->exp_refounct) == 0) failed: value: 1

          COMMAND: "obd_zombid"
          #0 machine_kexec
          #1 crash_kexec
          #2 panic
          #3 lbug_with_loc
          #4 obd_zombie_impexp_cull
          #5 obd_zombie_impexp_thread
          #6 kernel_thread

          nedbass Ned Bass (Inactive) added a comment - Here's a log message and backtrace for the exp_refcount assertion. LustreError: 24253:0:(genops.c:717:class_export_destory()) ASSERTION(cfs_atomic_read(&exp->exp_refounct) == 0) failed: value: 1 COMMAND: "obd_zombid" #0 machine_kexec #1 crash_kexec #2 panic #3 lbug_with_loc #4 obd_zombie_impexp_cull #5 obd_zombie_impexp_thread #6 kernel_thread

          Value reported has been either 1, 2, or 3.

          nedbass Ned Bass (Inactive) added a comment - Value reported has been either 1, 2, or 3.
          green Oleg Drokin added a comment -

          the LASSERT_ATOMIC_ZERO is defined as LASSERTF internally:

          #define LASSERT_ATOMIC_EQ(a, v)                                 \
          do {                                                            \
                  LASSERTF(cfs_atomic_read(a) == v,                       \
                           "value: %d\n", cfs_atomic_read((a)));          \
          } while (0)
          #define LASSERT_ATOMIC_ZERO(a)                  LASSERT_ATOMIC_EQ(a, 0)
          

          What was the value reported?

          green Oleg Drokin added a comment - the LASSERT_ATOMIC_ZERO is defined as LASSERTF internally: #define LASSERT_ATOMIC_EQ(a, v) \ do { \ LASSERTF(cfs_atomic_read(a) == v, \ "value: %d\n" , cfs_atomic_read((a))); \ } while (0) #define LASSERT_ATOMIC_ZERO(a) LASSERT_ATOMIC_EQ(a, 0) What was the value reported?

          So with many more crashes, any luck in getting the backtraces for the first two assertions? Any succesful crashdumps?

          I will review the logs and crash dumps and let you know.

          There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put?

          We hit the one in class_import_put.

          As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see:

          Here's the second one in our tree:

          https://github.com/chaos/lustre/blob/2.1.1-3chaos/lustre/obdclass/genops.c#L717

          nedbass Ned Bass (Inactive) added a comment - So with many more crashes, any luck in getting the backtraces for the first two assertions? Any succesful crashdumps? I will review the logs and crash dumps and let you know. There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put? We hit the one in class_import_put. As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see: Here's the second one in our tree: https://github.com/chaos/lustre/blob/2.1.1-3chaos/lustre/obdclass/genops.c#L717
          green Oleg Drokin added a comment -

          So with many more crashes, any luck in getting the backtraces for the first two assertions?
          Any succesful crashdumps?

          There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put?

          As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see:

          Oleg-Drokins-MacBook-Pro-2:lustre green$ grep -r exp_refcount * | grep ASSERT
          lustre/obdclass/genops.c:        LASSERT_ATOMIC_ZERO(&exp->exp_refcount);
          lustre/obdclass/genops.c:        LASSERT_ATOMIC_GT_LT(&exp->exp_refcount, 0, 0x5a5a5a);
          lustre/obdecho/echo_client.c:        LASSERT(cfs_atomic_read(&ec->ec_exp->exp_refcount) > 0);
          

          The first one is actually defined into LASSERTF so it does not exacty match your output I think?
          Can you elaborate a bit more on where that might have come from?

          green Oleg Drokin added a comment - So with many more crashes, any luck in getting the backtraces for the first two assertions? Any succesful crashdumps? There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put? As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see: Oleg-Drokins-MacBook-Pro-2:lustre green$ grep -r exp_refcount * | grep ASSERT lustre/obdclass/genops.c: LASSERT_ATOMIC_ZERO(&exp->exp_refcount); lustre/obdclass/genops.c: LASSERT_ATOMIC_GT_LT(&exp->exp_refcount, 0, 0x5a5a5a); lustre/obdecho/echo_client.c: LASSERT(cfs_atomic_read(&ec->ec_exp->exp_refcount) > 0); The first one is actually defined into LASSERTF so it does not exacty match your output I think? Can you elaborate a bit more on where that might have come from?

          People

            green Oleg Drokin
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: