[LU-1432] Race condition between lprocfs_exp_setup() and lprocfs_free_per_client_stats() causes LBUG - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3
Affects Version/s: None
Labels:
None
Environment:
FEFS based on Lustre 1.8.5, RIKEN K-computer

Severity:
3
Rank (Obsolete):
4541

Description

We've often happened to see the below LBUG until now, nevertheless we already applied the putch of ~~LU-39~~ to FEFS.

        LASSERTF(atomic_read(&client_stat->nid_exp_ref_count) == 0,
                 "nid %s:count %d\n", libcfs_nid2str(client_stat->nid),
                 atomic_read(&client_stat->nid_exp_ref_count));

Then, today, I finally found out the race condition between lprocfs_exp_setup() and lprocfs_free_per_client_stats(). And as far as I can see, it appears that the LBUG could be the case in Lustre-1.8 and Lustre-2.x. So I decided to report it here.

the account of the case is below.

umount command is executed and set obd_stopping 1 in class_cleanup()

but some ll_ost_xxx threads which have already passed the below if-statement in target_handle_connect() can go ahead anyway.

int target_handle_connect(struct ptlrpc_request *req, svc_handler_t handler)
{

        ...
        if (!target || target->obd_stopping || !target->obd_set_up) {
                LCONSOLE_ERROR_MSG(0x137, "UUID '%s' is not available "
                                   " for connect (%s)\n", str,
                                   !target ? "no target" :
                                   (target->obd_stopping ? "stopping" :
                                   "not set up"));
                GOTO(out, rc = -ENODEV);
        }
        ...

}

the threads will reach out the below codes

int lprocfs_exp_setup(struct obd_export *exp, lnet_nid_t *nid, int *newnid)
{
        ...
        old_stat = lustre_hash_findadd_unique(obd->obd_nid_stats_hash,
                                              nid, &new_stat->nid_hash);
        CDEBUG(D_INFO, "Found stats %p for nid %s - ref %d\n",
               old_stat, libcfs_nid2str(*nid),
               atomic_read(&new_stat->nid_exp_ref_count));

        /* we need to release old stats because lprocfs_exp_cleanup() hasn't
         * been and will never be called. */
        if (exp->exp_nid_stats) {
                nidstat_putref(exp->exp_nid_stats);
                exp->exp_nid_stats = NULL;
        }
        ...
}

if exp->exp_nid_stats is not NULL, lustre_hash_findadd_unique() increments the ref count of old_stat and returns it. So, it means that the old_stat is temporalily one bigger, until nidstat_putref() is called.

Then, at the same timing, if the umount is running in lprocfs_free_client_stats() and is handling the same stats as the above old_stat. it causes the LBUG beause the nid_stat ref count is one bigger than expected.

You can cause the same problem when you add some waiting code into the part of lprocfs_exp_setup() and execute umount command. in this way, I've actually caused the problem in purpose.

and now, I'm creating the patch to the issue and I'll upload that soon but it's for 1.8.8.
I'm really sorry that I don't have the machine for Lustre-2.x X(

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lu-1432.debug.log
5.59 MB
19/Jun/12 2:05 AM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....

Activity

[LU-1432] Race condition between lprocfs_exp_setup() and lprocfs_free_per_client_stats() causes LBUG

Lai Siyao added a comment - 19/Jun/12 11:40 PM

Hmm, one possible fix for this issue is to serialize target obd device cleanup and connect operation. I'll consider it.

As for request not taking export refcount in 2.x, it happens to be correct:

mdt_obd_connect() takes one extra export refcount.

But fix it the way I do is more clear, I will fix it in 2.x later.

Lai Siyao added a comment - 19/Jun/12 11:40 PM Hmm, one possible fix for this issue is to serialize target obd device cleanup and connect operation. I'll consider it. As for request not taking export refcount in 2.x, it happens to be correct: mdt_obd_connect() takes one extra export refcount. But fix it the way I do is more clear, I will fix it in 2.x later.

Hiroya Nozaki (Inactive) added a comment - 19/Jun/12 11:39 PM

I added a new member, obd_conn_inprogress into obd_device to count the number of threads which are doing connecting in target_handle_connect(), and do like below pseudo code. so far, it looks working well. So, I'll upload the patch soon.

target_handle_connect() {
        ...
        atomic_inc(obd_conn_inprogress);
        // do connect
        atomic_dec(obd_conn_inprogress);
        ...
}
 
class_cleanup() {
        ...
        while (atomic_read(obd_conn_inprogress) > 0) {
                cfs_cond_resched();
        }
        lustre_hash_exit(obd_nid_hash);
        ...
}

Hiroya Nozaki (Inactive) added a comment - 19/Jun/12 11:39 PM I added a new member, obd_conn_inprogress into obd_device to count the number of threads which are doing connecting in target_handle_connect(), and do like below pseudo code. so far, it looks working well. So, I'll upload the patch soon. target_handle_connect() { ... atomic_inc(obd_conn_inprogress); // do connect atomic_dec(obd_conn_inprogress); ... } class_cleanup() { ... while (atomic_read(obd_conn_inprogress) > 0) { cfs_cond_resched(); } lustre_hash_exit(obd_nid_hash); ... }

Hiroya Nozaki (Inactive) added a comment - 19/Jun/12 9:26 PM - edited

But when the case, there could be a case, and actually I've encountered the case, that target_handle_connect() touch the obd_nid_hash which have been already freed by class_cleanup(). So we should remove the race condition this time.

and ... I'm afraid to say that I'm still wondering whether the class_export_get() is right or not. The original code, "req->rq_export = export" instead of class_export_get(), looks obviously wrong to me too but it had exited so far, and it has existed in current Lustre-2.1 too. so I think there is a special considering to do so. do you think so?

Hiroya Nozaki (Inactive) added a comment - 19/Jun/12 9:26 PM - edited But when the case, there could be a case, and actually I've encountered the case, that target_handle_connect() touch the obd_nid_hash which have been already freed by class_cleanup(). So we should remove the race condition this time. and ... I'm afraid to say that I'm still wondering whether the class_export_get() is right or not. The original code, "req->rq_export = export" instead of class_export_get(), looks obviously wrong to me too but it had exited so far, and it has existed in current Lustre-2.1 too. so I think there is a special considering to do so. do you think so?

Lai Siyao added a comment - 19/Jun/12 3:15 AM

IMO this is not a bug, please see below log snippet:

00000020:00020000:10:1340084622.090102:0:15134:0:(lprocfs_status.c:1733:lprocfs_exp_setup()) ===== WAIT END =====
...
00002000:00000001:2:1340084622.090148:0:15134:0:(filter.c:2554:filter_reconnect()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:2:1340084622.090149:0:15134:0:(obd_class.h:775:obd_reconnect()) Process leaving (rc=0 : 0 : 0)
00010000:00000040:2:1340084622.090150:0:15134:0:(ldlm_lib.c:1020:target_handle_connect()) GETting export ffff81063ed80e00 : new refcount 2
...
00000100:00000040:2:1340084622.090167:0:15134:0:(connection.c:98:ptlrpc_connection_get()) conn=ffff8105f150cac0 refcount 3 to 192.168.128.38@o2ib
00000001:00000001:2:1340084622.090169:0:15134:0:(class_hash.c:238:lustre_hash_add()) Process entered
00000020:00000040:2:1340084622.090170:0:15134:0:(obd_config.c:1518:nid_export_get()) GETting export ffff81063ed80e00 : new refcount 3
00000020:00000001:2:1340084622.090172:0:15134:0:(obd_config.c:1520:nid_export_get()) Process leaving (rc=18446604462556974592 : -139611152577024 : ffff81063ed80e00)
00000001:00000001:2:1340084622.090174:0:15134:0:(class_hash.c:257:lustre_hash_add()) Process leaving

The cause of umount hang is that target has run .cleanup (client disconnected already), but the 'connect' is after that, and export is not in nid_hash, so that this export is added into obd_nid_hash. And it takes one refcount of this export, but since server has done .cleanup already, and is waiting in a loop in server_wait_finish(), it will always hang there. However server_wait_finish() checks signals, so by any chance you may send a SIGKILL to this process, and manually do `umount ...` again; it may finish successfully.

Lai Siyao added a comment - 19/Jun/12 3:15 AM IMO this is not a bug, please see below log snippet: 00000020:00020000:10:1340084622.090102:0:15134:0:(lprocfs_status.c:1733:lprocfs_exp_setup()) ===== WAIT END ===== ... 00002000:00000001:2:1340084622.090148:0:15134:0:(filter.c:2554:filter_reconnect()) Process leaving (rc=0 : 0 : 0) 00010000:00000001:2:1340084622.090149:0:15134:0:(obd_class.h:775:obd_reconnect()) Process leaving (rc=0 : 0 : 0) 00010000:00000040:2:1340084622.090150:0:15134:0:(ldlm_lib.c:1020:target_handle_connect()) GETting export ffff81063ed80e00 : new refcount 2 ... 00000100:00000040:2:1340084622.090167:0:15134:0:(connection.c:98:ptlrpc_connection_get()) conn=ffff8105f150cac0 refcount 3 to 192.168.128.38@o2ib 00000001:00000001:2:1340084622.090169:0:15134:0:(class_hash.c:238:lustre_hash_add()) Process entered 00000020:00000040:2:1340084622.090170:0:15134:0:(obd_config.c:1518:nid_export_get()) GETting export ffff81063ed80e00 : new refcount 3 00000020:00000001:2:1340084622.090172:0:15134:0:(obd_config.c:1520:nid_export_get()) Process leaving (rc=18446604462556974592 : -139611152577024 : ffff81063ed80e00) 00000001:00000001:2:1340084622.090174:0:15134:0:(class_hash.c:257:lustre_hash_add()) Process leaving The cause of umount hang is that target has run .cleanup (client disconnected already), but the 'connect' is after that, and export is not in nid_hash, so that this export is added into obd_nid_hash. And it takes one refcount of this export, but since server has done .cleanup already, and is waiting in a loop in server_wait_finish(), it will always hang there. However server_wait_finish() checks signals, so by any chance you may send a SIGKILL to this process, and manually do `umount ...` again; it may finish successfully.

Hiroya Nozaki (Inactive) added a comment - 19/Jun/12 2:05 AM

Hi, Lai. this is the debug log you demanded. pid 15133 is ll_ost thread and pid 15400 is umount thread.

the environment is the below
mds :1
oss (ost x1):1
client :1

Finally, umount-thread continues to display the sort of messages like this and never return.

(obd_mount.c:1401:server_wait_finished()) Mount still busy with 7 refs after 90 secs.

so, I think some request remains and

Hiroya Nozaki (Inactive) added a comment - 19/Jun/12 2:05 AM Hi, Lai. this is the debug log you demanded. pid 15133 is ll_ost thread and pid 15400 is umount thread. the environment is the below mds :1 oss (ost x1):1 client :1 Finally, umount-thread continues to display the sort of messages like this and never return. (obd_mount.c:1401:server_wait_finished()) Mount still busy with 7 refs after 90 secs. so, I think some request remains and

Hiroya Nozaki (Inactive) added a comment - 11/Jun/12 6:24 AM

I had wondered why filter_cleanup() is executed while some export objects of obdfilter exist, and finally I realized that this is not a problem of race condition but a problem of refcount of export object, almost same as ~~LU-1092~~. So I created the 1.8 patch from the ~~LU-1092~~ one.

Hiroya Nozaki (Inactive) added a comment - 11/Jun/12 6:24 AM I had wondered why filter_cleanup() is executed while some export objects of obdfilter exist, and finally I realized that this is not a problem of race condition but a problem of refcount of export object, almost same as LU-1092 . So I created the 1.8 patch from the LU-1092 one.

Hiroya Nozaki (Inactive) added a comment - 06/Jun/12 4:47 AM

I added some spin_trylock() in the above patch to fix the problem, but I realized lustre_hash_findadd_unique() calls kmalloc() its inside. so I have to change these into something resembled. It's easy to fix if I change spin locks into semaphores and add a new semaphore into struct obd_device. But I wonder I'm allowed to do so because I know the struct is a really basic and important structure.

Hiroya Nozaki (Inactive) added a comment - 06/Jun/12 4:47 AM I added some spin_trylock() in the above patch to fix the problem, but I realized lustre_hash_findadd_unique() calls kmalloc() its inside. so I have to change these into something resembled. It's easy to fix if I change spin locks into semaphores and add a new semaphore into struct obd_device. But I wonder I'm allowed to do so because I know the struct is a really basic and important structure.

Hiroya Nozaki (Inactive) added a comment - 23/May/12 8:09 AM

http://review.whamcloud.com/#change,2888

Hiroya Nozaki (Inactive) added a comment - 23/May/12 8:09 AM http://review.whamcloud.com/#change,2888

People

Assignee:: Lai Siyao

Reporter:: Hiroya Nozaki (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 23/May/12 7:12 AM

Updated:: 29/Nov/12 9:07 PM

Resolved:: 02/Aug/12 9:56 AM