[LU-1085] ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.1.1
Labels:
- paj
Environment:
RHEL 6.2
Our branch: https://github.com/chaos/lustre/commits/2.1.0-llnl

Severity:
3
Rank (Obsolete):
6467

Description

We have multiple Lustre 2.1 OSS nodes crashing repeatedly during recovery. This is on our classified Lustre cluster which was updated from 1.8 to 2.1 on Tuesday. The summary is one observed symptom. We have also seen these assertions appearing together.

ASSERTION(cfs_list_empty(&imp->imp_zombie_chain)) failed
ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed

We don't have backtraces for the assertions because STONITH kicked in before the crash dump completed.

Other OSS nodes are crashing in kernel string handling functions with stacks like

machine_kexec
crash_kexec
oops_end
die
do_general_protection
general_protection
(exception RIP: strlen+9)
strlen
string
vsnprintf
libcfs_debug_vmsg2
_debug_req
target_send_replay_msg
target_send_reply
ost_handle
ptlrpc_main

So it appears we are passing a bad value in a debug message.

Another stack trace:

BUG: unable to handle kernel NULL ptr dereference at 000...38
IP: [<fffffffa0a8706>] filter_export_stats_init+0x1f1/0x500 [obdfilter]

machine_kexec
crash_kexec
oops_end
no_context
__bad_area_semaphore
bad_area_semaphore
__do_page_fault
do_page_fault
page_fault
filter_reconnect
target_handle_connect
ost_handle
ptlrpc_main

We have multiple symptoms here that may or not be due to the same bug. We may need to open a separate issue to track the root cause. Note that our branch contains ~~LU-874~~ patches that touched the ptlrpc queue management code, so we should be on the lookout for any races introduced there. Also note we can't send debug data from this system.

Attachments

Activity

[LU-1085] ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0) failed

Oleg Drokin added a comment - 18/Apr/12 6:48 PM

the LASSERT_ATOMIC_ZERO is defined as LASSERTF internally:

#define LASSERT_ATOMIC_EQ(a, v)                                 \
do {                                                            \
        LASSERTF(cfs_atomic_read(a) == v,                       \
                 "value: %d\n", cfs_atomic_read((a)));          \
} while (0)
#define LASSERT_ATOMIC_ZERO(a)                  LASSERT_ATOMIC_EQ(a, 0)

What was the value reported?

Oleg Drokin added a comment - 18/Apr/12 6:48 PM the LASSERT_ATOMIC_ZERO is defined as LASSERTF internally: #define LASSERT_ATOMIC_EQ(a, v) \ do { \ LASSERTF(cfs_atomic_read(a) == v, \ "value: %d\n" , cfs_atomic_read((a))); \ } while (0) #define LASSERT_ATOMIC_ZERO(a) LASSERT_ATOMIC_EQ(a, 0) What was the value reported?

Ned Bass (Inactive) added a comment - 18/Apr/12 5:00 PM

So with many more crashes, any luck in getting the backtraces for the first two assertions? Any succesful crashdumps?

I will review the logs and crash dumps and let you know.

There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put?

We hit the one in class_import_put.

As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see:

Here's the second one in our tree:

https://github.com/chaos/lustre/blob/2.1.1-3chaos/lustre/obdclass/genops.c#L717

Ned Bass (Inactive) added a comment - 18/Apr/12 5:00 PM So with many more crashes, any luck in getting the backtraces for the first two assertions? Any succesful crashdumps? I will review the logs and crash dumps and let you know. There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put? We hit the one in class_import_put. As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see: Here's the second one in our tree: https://github.com/chaos/lustre/blob/2.1.1-3chaos/lustre/obdclass/genops.c#L717

Oleg Drokin added a comment - 17/Apr/12 10:18 PM

So with many more crashes, any luck in getting the backtraces for the first two assertions?
Any succesful crashdumps?

There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put?

As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see:

Oleg-Drokins-MacBook-Pro-2:lustre green$ grep -r exp_refcount * | grep ASSERT
lustre/obdclass/genops.c:        LASSERT_ATOMIC_ZERO(&exp->exp_refcount);
lustre/obdclass/genops.c:        LASSERT_ATOMIC_GT_LT(&exp->exp_refcount, 0, 0x5a5a5a);
lustre/obdecho/echo_client.c:        LASSERT(cfs_atomic_read(&ec->ec_exp->exp_refcount) > 0);

The first one is actually defined into LASSERTF so it does not exacty match your output I think?
Can you elaborate a bit more on where that might have come from?

Oleg Drokin added a comment - 17/Apr/12 10:18 PM So with many more crashes, any luck in getting the backtraces for the first two assertions? Any succesful crashdumps? There are two imp_zombie_chain assertions, do you know which one of them did you hit? in obd_zombie_import_add or in class_import_put? As for the second assertion, there is no such assertion in the code? I checked out your tree and here's what I see: Oleg-Drokins-MacBook-Pro-2:lustre green$ grep -r exp_refcount * | grep ASSERT lustre/obdclass/genops.c: LASSERT_ATOMIC_ZERO(&exp->exp_refcount); lustre/obdclass/genops.c: LASSERT_ATOMIC_GT_LT(&exp->exp_refcount, 0, 0x5a5a5a); lustre/obdecho/echo_client.c: LASSERT(cfs_atomic_read(&ec->ec_exp->exp_refcount) > 0); The first one is actually defined into LASSERTF so it does not exacty match your output I think? Can you elaborate a bit more on where that might have come from?

Christopher Morrone (Inactive) added a comment - 10/Apr/12 10:50 PM

Ignore the comment about being in recovery. So far I don't think the logs show that.

I've looked at a few nodes, and it looks like there is some kind of client timeout/eviction and then reconnection storm going on before the assertions.

It looks like generally there are tens of thousands of "haven't heard from client X in 226 seconds. I think it's dead, and I am evicting it" messages. A couple of minutes clients begin reconnecting in droves. There is a mix of ost "connection from" and ost "Not available for connect" messages.

The "haven't heard from client" and the client connect messages are both interleaved in the logs, and often repeated 30,000+ times (lustre squashes them into "previous similar messages" lines).

And then we hit one of the two assertions listed at the beginning of this bug.

Note that for several of the OSS nodes that I have looked at so far, the clients all seem to be from one particular cluster, which is running 1.8. (servers are all 2.1.0-24chaos).

Christopher Morrone (Inactive) added a comment - 10/Apr/12 10:50 PM Ignore the comment about being in recovery. So far I don't think the logs show that. I've looked at a few nodes, and it looks like there is some kind of client timeout/eviction and then reconnection storm going on before the assertions. It looks like generally there are tens of thousands of "haven't heard from client X in 226 seconds. I think it's dead, and I am evicting it" messages. A couple of minutes clients begin reconnecting in droves. There is a mix of ost "connection from" and ost "Not available for connect" messages. The "haven't heard from client" and the client connect messages are both interleaved in the logs, and often repeated 30,000+ times (lustre squashes them into "previous similar messages" lines). And then we hit one of the two assertions listed at the beginning of this bug. Note that for several of the OSS nodes that I have looked at so far, the clients all seem to be from one particular cluster, which is running 1.8. (servers are all 2.1.0-24chaos).

Oleg Drokin added a comment - 10/Apr/12 7:41 PM

I wonder if that's related to ~~LU-1166~~ to some degree.

Unfortunately the current patch is not fully complete, but does not make things worse.

Any other messages before that crash that you can share with us?

Oleg Drokin added a comment - 10/Apr/12 7:41 PM I wonder if that's related to LU-1166 to some degree. Unfortunately the current patch is not fully complete, but does not make things worse. Any other messages before that crash that you can share with us?

Christopher Morrone (Inactive) added a comment - 10/Apr/12 6:23 PM

Apparently we hit this over 20 times on various OSS of multiple filesystems over the weekend. Note that last week we re-enabled OSS read cache (write cache is still disabled).

The admins tell me that this often hit while in (or perhaps shortly after) recovery. I'll need to look into why we were in recovery in the first place.

Christopher Morrone (Inactive) added a comment - 10/Apr/12 6:23 PM Apparently we hit this over 20 times on various OSS of multiple filesystems over the weekend. Note that last week we re-enabled OSS read cache (write cache is still disabled). The admins tell me that this often hit while in (or perhaps shortly after) recovery. I'll need to look into why we were in recovery in the first place.

Ned Bass (Inactive) added a comment - 13/Feb/12 1:17 PM

Also I guess I must ask how was the original recovery triggered? Some unrelated crash or was this a recovery test of some sort? (purely of purposes of seeing if there is some another bug at play here).

This was on a production system, not a recovery test.

In the case of this stack trace the server had crashed about twenty minutes earlier with ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0).

We are still trying to understand what led to that crash, and the others in LU-[1092,1093,1094], but the factors at play seem to be high server load, slow server response, and many clients dropping their connection to the server and reconnecting. There was a four hour window in which we had 16 crashes involving 5 OSS nodes. As mentioned above, things seemed to stabilize after we disabled the OSS read and writethrough caches.

Ned Bass (Inactive) added a comment - 13/Feb/12 1:17 PM Also I guess I must ask how was the original recovery triggered? Some unrelated crash or was this a recovery test of some sort? (purely of purposes of seeing if there is some another bug at play here). This was on a production system, not a recovery test. In the case of this stack trace the server had crashed about twenty minutes earlier with ASSERTION(cfs_atomic_read(&exp->exp_refcount) == 0). We are still trying to understand what led to that crash, and the others in LU- [1092,1093,1094] , but the factors at play seem to be high server load, slow server response, and many clients dropping their connection to the server and reconnecting. There was a four hour window in which we had 16 crashes involving 5 OSS nodes. As mentioned above, things seemed to stabilize after we disabled the OSS read and writethrough caches.

Oleg Drokin added a comment - 10/Feb/12 9:56 PM

Also I guess I must ask how was the original recovery triggered? Some unrelated crash or was this a recovery test of some sort? (purely of purposes of seeing if there is some another bug at play here).

Oleg Drokin added a comment - 10/Feb/12 9:56 PM Also I guess I must ask how was the original recovery triggered? Some unrelated crash or was this a recovery test of some sort? (purely of purposes of seeing if there is some another bug at play here).

Ned Bass (Inactive) added a comment - 10/Feb/12 7:20 PM

It was lustre/obdclass/genops.c:728:

 711 /* Export management functions */
 712 static void class_export_destroy(struct obd_export *exp)
 713 {
 714         struct obd_device *obd = exp->exp_obd;
 715         ENTRY;
 716 
 717         LASSERT_ATOMIC_ZERO(&exp->exp_refcount);
 718 
 719         CDEBUG(D_IOCTL, "destroying export %p/%s for %s\n", exp,
 720                exp->exp_client_uuid.uuid, obd->obd_name);
 721 
 722         LASSERT(obd != NULL);
 723 
 724         /* "Local" exports (lctl, LOV->{mdc,osc}) have no connection. */
 725         if (exp->exp_connection)
 726                 ptlrpc_put_connection_superhack(exp->exp_connection);
 727 
 728         LASSERT(cfs_list_empty(&exp->exp_outstanding_replies));
 729         LASSERT(cfs_list_empty(&exp->exp_uncommitted_replies));
 730         LASSERT(cfs_list_empty(&exp->exp_req_replay_queue));
 731         LASSERT(cfs_list_empty(&exp->exp_hp_rpcs));
 732         obd_destroy_export(exp);
 733         class_decref(obd, "export", exp);
 734 
 735         OBD_FREE_RCU(exp, sizeof(*exp), &exp->exp_handle);
 736         EXIT;
 737 }

Ned Bass (Inactive) added a comment - 10/Feb/12 7:20 PM It was lustre/obdclass/genops.c:728: 711 /* Export management functions */ 712 static void class_export_destroy(struct obd_export *exp) 713 { 714 struct obd_device *obd = exp->exp_obd; 715 ENTRY; 716 717 LASSERT_ATOMIC_ZERO(&exp->exp_refcount); 718 719 CDEBUG(D_IOCTL, "destroying export %p/%s for %s\n", exp, 720 exp->exp_client_uuid.uuid, obd->obd_name); 721 722 LASSERT(obd != NULL); 723 724 /* "Local" exports (lctl, LOV->{mdc,osc}) have no connection. */ 725 if (exp->exp_connection) 726 ptlrpc_put_connection_superhack(exp->exp_connection); 727 728 LASSERT(cfs_list_empty(&exp->exp_outstanding_replies)); 729 LASSERT(cfs_list_empty(&exp->exp_uncommitted_replies)); 730 LASSERT(cfs_list_empty(&exp->exp_req_replay_queue)); 731 LASSERT(cfs_list_empty(&exp->exp_hp_rpcs)); 732 obd_destroy_export(exp); 733 class_decref(obd, "export", exp); 734 735 OBD_FREE_RCU(exp, sizeof(*exp), &exp->exp_handle); 736 EXIT; 737 }

Oleg Drokin added a comment - 10/Feb/12 6:59 PM

for the zombid backtrace, what was the line number from where the lbug_with_loc was called? I know you did not get LASSERT in the logs, but you can use gdb obdclass.ko and then run "l *(obd_zombie_impexp_cull+0x...)" (the +0x... value from backtrace) and it'll tell you the line number.
Hm, I just noticed that neither obd_zombie_impexp_cull nor obd_zombie_impexp_thread have any lbugs nor lasserts.

Oleg Drokin added a comment - 10/Feb/12 6:59 PM for the zombid backtrace, what was the line number from where the lbug_with_loc was called? I know you did not get LASSERT in the logs, but you can use gdb obdclass.ko and then run "l *(obd_zombie_impexp_cull+0x...)" (the +0x... value from backtrace) and it'll tell you the line number. Hm, I just noticed that neither obd_zombie_impexp_cull nor obd_zombie_impexp_thread have any lbugs nor lasserts.

Ned Bass (Inactive) added a comment - 10/Feb/12 6:05 PM

Opened the following new issues:

~~LU-1092~~ NULL pointer dereference in filter_export_stats_init()
~~LU-1093~~ unable to handle kernel paging request in target_handle_connect()
~~LU-1094~~ general protection fault in _debug_req()

Ned Bass (Inactive) added a comment - 10/Feb/12 6:05 PM Opened the following new issues: LU-1092 NULL pointer dereference in filter_export_stats_init() LU-1093 unable to handle kernel paging request in target_handle_connect() LU-1094 general protection fault in _debug_req()

People

Assignee:: Oleg Drokin

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 09/Feb/12 3:43 PM

Updated:: 30/Apr/12 11:56 AM

Resolved:: 30/Apr/12 11:56 AM