[LU-1085] ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.1.1
Labels:
- paj
Environment:
RHEL 6.2
Our branch: https://github.com/chaos/lustre/commits/2.1.0-llnl

Severity:
3
Rank (Obsolete):
6467

Description

We have multiple Lustre 2.1 OSS nodes crashing repeatedly during recovery. This is on our classified Lustre cluster which was updated from 1.8 to 2.1 on Tuesday. The summary is one observed symptom. We have also seen these assertions appearing together.

ASSERTION(cfs_list_empty(&imp->imp_zombie_chain)) failed
ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed

We don't have backtraces for the assertions because STONITH kicked in before the crash dump completed.

Other OSS nodes are crashing in kernel string handling functions with stacks like

machine_kexec
crash_kexec
oops_end
die
do_general_protection
general_protection
(exception RIP: strlen+9)
strlen
string
vsnprintf
libcfs_debug_vmsg2
_debug_req
target_send_replay_msg
target_send_reply
ost_handle
ptlrpc_main

So it appears we are passing a bad value in a debug message.

Another stack trace:

BUG: unable to handle kernel NULL ptr dereference at 000...38
IP: [<fffffffa0a8706>] filter_export_stats_init+0x1f1/0x500 [obdfilter]

machine_kexec
crash_kexec
oops_end
no_context
__bad_area_semaphore
bad_area_semaphore
__do_page_fault
do_page_fault
page_fault
filter_reconnect
target_handle_connect
ost_handle
ptlrpc_main

We have multiple symptoms here that may or not be due to the same bug. We may need to open a separate issue to track the root cause. Note that our branch contains ~~LU-874~~ patches that touched the ptlrpc queue management code, so we should be on the lookout for any races introduced there. Also note we can't send debug data from this system.

Attachments

Activity

[LU-1085] ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed

Peter Jones added a comment - 30/Apr/12 11:56 AM

Believed to be a duplicate of ~~LU-1092~~

Peter Jones added a comment - 30/Apr/12 11:56 AM Believed to be a duplicate of LU-1092

Ned Bass (Inactive) added a comment - 19/Apr/12 11:57 AM

Yes, we'll pull the patch in to our tree, and it will eventually get rolled out to our production systems.

Ned Bass (Inactive) added a comment - 19/Apr/12 11:57 AM Yes, we'll pull the patch in to our tree, and it will eventually get rolled out to our production systems.

Oleg Drokin added a comment - 19/Apr/12 11:51 AM

Yes, this looks related.
Any chance you can try it?

Oleg Drokin added a comment - 19/Apr/12 11:51 AM Yes, this looks related. Any chance you can try it?

Ned Bass (Inactive) added a comment - 19/Apr/12 11:41 AM

Oleg,

Do you think the ~~LU-1092~~ patch will help with these assertions? Mikhail made a comment to that effect in ~~LU-1336~~.

http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=893cf2014a38c5bd94890d3522fafe55f024a958

Ned Bass (Inactive) added a comment - 19/Apr/12 11:41 AM Oleg, Do you think the LU-1092 patch will help with these assertions? Mikhail made a comment to that effect in LU-1336 . http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=893cf2014a38c5bd94890d3522fafe55f024a958

Ned Bass (Inactive) added a comment - 18/Apr/12 8:04 PM

Here's a log message and backtrace for the cfs_list_empty assertion.

LustreError: 24458:0:(genops.c:931:class_import_put()) ASSERTINO(cfs_list_empty(&imp->imp_zombie_chain)) failed

COMMAND: "ll_ost_54"
#0 machine_kexec
#1 crash_kexec
#2 panic
#3 lbug_with_loc
#4 libcfs_assertion_failed
#5 class_import_put
#6 client_destroy_import
#7 target_handle_connect
#8 ost_handle
#9 ptlrpc_main
#10 kernel_thread

Ned Bass (Inactive) added a comment - 18/Apr/12 8:04 PM Here's a log message and backtrace for the cfs_list_empty assertion. LustreError: 24458:0:(genops.c:931:class_import_put()) ASSERTINO(cfs_list_empty(&imp->imp_zombie_chain)) failed COMMAND: "ll_ost_54" #0 machine_kexec #1 crash_kexec #2 panic #3 lbug_with_loc #4 libcfs_assertion_failed #5 class_import_put #6 client_destroy_import #7 target_handle_connect #8 ost_handle #9 ptlrpc_main #10 kernel_thread

Ned Bass (Inactive) added a comment - 18/Apr/12 8:01 PM

Here's a log message and backtrace for the exp_refcount assertion.

LustreError: 24253:0:(genops.c:717:class_export_destory()) ASSERTION(cfs_atomic_read(&exp->exp_refounct) == 0) failed: value: 1

COMMAND: "obd_zombid"
#0 machine_kexec
#1 crash_kexec
#2 panic
#3 lbug_with_loc
#4 obd_zombie_impexp_cull
#5 obd_zombie_impexp_thread
#6 kernel_thread

Ned Bass (Inactive) added a comment - 18/Apr/12 8:01 PM Here's a log message and backtrace for the exp_refcount assertion. LustreError: 24253:0:(genops.c:717:class_export_destory()) ASSERTION(cfs_atomic_read(&exp->exp_refounct) == 0) failed: value: 1 COMMAND: "obd_zombid" #0 machine_kexec #1 crash_kexec #2 panic #3 lbug_with_loc #4 obd_zombie_impexp_cull #5 obd_zombie_impexp_thread #6 kernel_thread

Ned Bass (Inactive) added a comment - 18/Apr/12 6:52 PM

Value reported has been either 1, 2, or 3.

Ned Bass (Inactive) added a comment - 18/Apr/12 6:52 PM Value reported has been either 1, 2, or 3.

Oleg Drokin added a comment - 18/Apr/12 6:48 PM

the LASSERT_ATOMIC_ZERO is defined as LASSERTF internally:

#define LASSERT_ATOMIC_EQ(a, v)                                 \
do {                                                            \
        LASSERTF(cfs_atomic_read(a) == v,                       \
                 "value: %d\n", cfs_atomic_read((a)));          \
} while (0)
#define LASSERT_ATOMIC_ZERO(a)                  LASSERT_ATOMIC_EQ(a, 0)

What was the value reported?

Oleg Drokin added a comment - 18/Apr/12 6:48 PM the LASSERT_ATOMIC_ZERO is defined as LASSERTF internally: #define LASSERT_ATOMIC_EQ(a, v) \ do { \ LASSERTF(cfs_atomic_read(a) == v, \ "value: %d\n" , cfs_atomic_read((a))); \ } while (0) #define LASSERT_ATOMIC_ZERO(a) LASSERT_ATOMIC_EQ(a, 0) What was the value reported?

Ned Bass (Inactive) added a comment - 18/Apr/12 5:00 PM

So with many more crashes, any luck in getting the backtraces for the first two assertions? Any succesful crashdumps?

I will review the logs and crash dumps and let you know.

There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put?

We hit the one in class_import_put.

As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see:

Here's the second one in our tree:

https://github.com/chaos/lustre/blob/2.1.1-3chaos/lustre/obdclass/genops.c#L717

Ned Bass (Inactive) added a comment - 18/Apr/12 5:00 PM So with many more crashes, any luck in getting the backtraces for the first two assertions? Any succesful crashdumps? I will review the logs and crash dumps and let you know. There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put? We hit the one in class_import_put. As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see: Here's the second one in our tree: https://github.com/chaos/lustre/blob/2.1.1-3chaos/lustre/obdclass/genops.c#L717

Oleg Drokin added a comment - 17/Apr/12 10:18 PM

So with many more crashes, any luck in getting the backtraces for the first two assertions?
Any succesful crashdumps?

There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put?

As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see:

Oleg-Drokins-MacBook-Pro-2:lustre green$ grep -r exp_refcount * | grep ASSERT
lustre/obdclass/genops.c:        LASSERT_ATOMIC_ZERO(&exp->exp_refcount);
lustre/obdclass/genops.c:        LASSERT_ATOMIC_GT_LT(&exp->exp_refcount, 0, 0x5a5a5a);
lustre/obdecho/echo_client.c:        LASSERT(cfs_atomic_read(&ec->ec_exp->exp_refcount) > 0);

The first one is actually defined into LASSERTF so it does not exacty match your output I think?
Can you elaborate a bit more on where that might have come from?

Oleg Drokin added a comment - 17/Apr/12 10:18 PM So with many more crashes, any luck in getting the backtraces for the first two assertions? Any succesful crashdumps? There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put? As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see: Oleg-Drokins-MacBook-Pro-2:lustre green$ grep -r exp_refcount * | grep ASSERT lustre/obdclass/genops.c: LASSERT_ATOMIC_ZERO(&exp->exp_refcount); lustre/obdclass/genops.c: LASSERT_ATOMIC_GT_LT(&exp->exp_refcount, 0, 0x5a5a5a); lustre/obdecho/echo_client.c: LASSERT(cfs_atomic_read(&ec->ec_exp->exp_refcount) > 0); The first one is actually defined into LASSERTF so it does not exacty match your output I think? Can you elaborate a bit more on where that might have come from?

People

Assignee:: Oleg Drokin

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 09/Feb/12 3:43 PM

Updated:: 30/Apr/12 11:56 AM

Resolved:: 30/Apr/12 11:56 AM