Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1085

ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed

    XMLWordPrintable

Details

    • 3
    • 6467

    Description

      We have multiple Lustre 2.1 OSS nodes crashing repeatedly during recovery. This is on our classified Lustre cluster which was updated from 1.8 to 2.1 on Tuesday. The summary is one observed symptom. We have also seen these assertions appearing together.

      ASSERTION(cfs_list_empty(&imp->imp_zombie_chain)) failed
      ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed

      We don't have backtraces for the assertions because STONITH kicked in before the crash dump completed.

      Other OSS nodes are crashing in kernel string handling functions with stacks like

      machine_kexec
      crash_kexec
      oops_end
      die
      do_general_protection
      general_protection
      (exception RIP: strlen+9)
      strlen
      string
      vsnprintf
      libcfs_debug_vmsg2
      _debug_req
      target_send_replay_msg
      target_send_reply
      ost_handle
      ptlrpc_main

      So it appears we are passing a bad value in a debug message.

      Another stack trace:

      BUG: unable to handle kernel NULL ptr dereference at 000...38
      IP: [<fffffffa0a8706>] filter_export_stats_init+0x1f1/0x500 [obdfilter]

      machine_kexec
      crash_kexec
      oops_end
      no_context
      __bad_area_semaphore
      bad_area_semaphore
      __do_page_fault
      do_page_fault
      page_fault
      filter_reconnect
      target_handle_connect
      ost_handle
      ptlrpc_main

      We have multiple symptoms here that may or not be due to the same bug. We may need to open a separate issue to track the root cause. Note that our branch contains LU-874 patches that touched the ptlrpc queue management code, so we should be on the lookout for any races introduced there. Also note we can't send debug data from this system.

      Attachments

        Activity

          People

            green Oleg Drokin
            nedbass Ned Bass (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: