[LU-2889] There's a race between starting and stopping service threads Created: 28/Feb/13  Updated: 16/Oct/13  Resolved: 16/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.9
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Critical
Reporter: Hiroya Nozaki Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: mn1, mn8, patch
Environment:

no specific environment is needed


Severity: 3
Rank (Obsolete): 6971

 Description   

When ptlrpc_start_thread fails to create a new thread, it will finalize and free a struct ptlrpc_thread created and used here. Considering this, it can be problem when ptlrpc_svcpt_stop_thread is driven and handles the struct ptlrpc_thread right after or right before failure of cfs_create_thread.

This situation let the both of ptlrpc_start_thread and ptlrpc_svcpt_stop_threads access the freed ptlrpc_thread and cause OS panic.

ptlrpc_thread_start
int ptlrpc_start_thread(struct ptlrpc_service_part *svcpt, int wait)
{
        ...

        spin_lock(&svcpt->scp_lock);

        ...

        cfs_list_add(&thread->t_link, &svcpt->scp_threads);
        spin_unlock(&svcpt->scp_lock);

        if (svcpt->scp_cpt >= 0) {
                snprintf(thread->t_name, PTLRPC_THR_NAME_LEN, "%s%02d_%03d",
                         svc->srv_thread_name, svcpt->scp_cpt, thread->t_id);
        } else {
                snprintf(thread->t_name, PTLRPC_THR_NAME_LEN, "%s_%04d",
                         svc->srv_thread_name, thread->t_id);
        }

        CDEBUG(D_RPCTRACE, "starting thread '%s'\n", thread->t_name);
        /*
         * CLONE_VM and CLONE_FILES just avoid a needless copy, because we
         * just drop the VM and FILES in cfs_daemonize_ctxt() right away.
         */
        rc = cfs_create_thread(ptlrpc_main, thread, CFS_DAEMON_FLAGS);
        if (rc < 0) {
                CERROR("cannot start thread '%s': rc %d\n",
                       thread->t_name, rc); 
                                            //////////////////////////////////////
                                            // <---- let's say when
                                            // ptlrpc_svcpt_stop_thread is driven here
                                            //////////////////////////////////////
                spin_lock(&svcpt->scp_lock);
                cfs_list_del(&thread->t_link);
                --svcpt->scp_nthrs_starting;
                spin_unlock(&svcpt->scp_lock);

                OBD_FREE(thread, sizeof(*thread));
                RETURN(rc);
        }

    ...

}
ptlrpc_svcpt_stop_threads
static void ptlrpc_svcpt_stop_threads(struct ptlrpc_service_part *svcpt)
{
        struct l_wait_info      lwi = { 0 };
        struct ptlrpc_thread    *thread;
        CFS_LIST_HEAD           (zombie);

        ENTRY;

        CDEBUG(D_INFO, "Stopping threads for service %s\n",
               svcpt->scp_service->srv_name);

        spin_lock(&svcpt->scp_lock);
        /* let the thread know that we would like it to stop asap */
        list_for_each_entry(thread, &svcpt->scp_threads, t_link) {
                CDEBUG(D_INFO, "Stopping thread %s #%u\n",
                       svcpt->scp_service->srv_thread_name, thread->t_id);
                thread_add_flags(thread, SVC_STOPPING);
        }

        cfs_waitq_broadcast(&svcpt->scp_waitq);

        while (!cfs_list_empty(&svcpt->scp_threads)) {
                thread = cfs_list_entry(svcpt->scp_threads.next,
                                        struct ptlrpc_thread, t_link);
                if (thread_is_stopped(thread)) {
                        cfs_list_del(&thread->t_link);
                        cfs_list_add(&thread->t_link, &zombie);
                        continue;
                }
                spin_unlock(&svcpt->scp_lock);

                CDEBUG(D_INFO, "waiting for stopping-thread %s #%u\n",
                       svcpt->scp_service->srv_thread_name, thread->t_id);
                l_wait_event(thread->t_ctl_waitq,
                             thread_is_stopped(thread), &lwi);

                spin_lock(&svcpt->scp_lock);
        }

        spin_unlock(&svcpt->scp_lock);

        while (!cfs_list_empty(&zombie)) {
                thread = cfs_list_entry(zombie.next,
                                        struct ptlrpc_thread, t_link);
                cfs_list_del(&thread->t_link);
                OBD_FREE_PTR(thread);
        }
        EXIT;
}


 Comments   
Comment by Hiroya Nozaki [ 28/Feb/13 ]

The patch for master
http://review.whamcloud.com/#change,5552

Comment by Hiroya Nozaki [ 28/Feb/13 ]

Here is the patch for b2_1
http://review.whamcloud.com/5557

BTW, I'll upload the patch for b1_8 after completing the review for b2_1. Because the patch for b1_8 will be almost the same as b2_1

Comment by Keith Mannthey (Inactive) [ 09/Aug/13 ]

Master patch has been landed.

Comment by Peter Jones [ 16/Oct/13 ]

Fix landed for 2.5.0. Fix will also be tracked as a candidate for inclusion in any future 1.8.x/2.1.x releases

Generated at Sat Feb 10 01:29:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.