[LU-4985] Return value of kthread_run truncated to 32-bit int in ll_sa thread and elsewhere Created: 30/Apr/14  Updated: 05/Jun/14  Resolved: 14/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Minor
Reporter: Ryan Haasken Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 13809

 Description   

Several places in the Lustre code store the result of kthread_run, which is of 64-bit type task_struct *, in a 32-bit integer. Depending on the value of the pointer, this will be interpreted as an error by the IS_ERR_VALUE macro.

This problem exists, for example, with the creation of statahead threads, ll_sa.

static int start_statahead_thread(struct inode *dir, struct dentry *dentry)
{
...
        int rc;
...
        rc = PTR_ERR(kthread_run(ll_statahead_thread, dentry->d_parent,
                                 "ll_sa_%u", lli->lli_opendir_pid));
        thread = &sai->sai_thread;
        if (IS_ERR_VALUE(rc)) {
                CERROR("can't start ll_sa thread, rc: %d\n", rc);

                spin_lock(&lli->lli_sa_lock);
                thread_set_flags(thread, SVC_STOPPED);
                thread_set_flags(&sai->sai_agl_thread, SVC_STOPPED);
                spin_unlock(&lli->lli_sa_lock);

                ll_sai_put(sai);
                GOTO(out, rc = -EAGAIN);
        }
...

This can cause a general protection fault if the sai struct is accessed by the ll_sa thread after it has been freed. Here is an example in which the return value from kthread_run is incorrectly interpreted as an error:

2014-04-18T04:37:49.706791-05:00 c0-0c1s2n3 LustreError: 11453:0:(statahead.c:1588:start_statahead_thread()) can't start ll_sa thread, rc: -3968

The -3968 which is reported as the rc of the kthread_run is not a valid error number. It is the result of truncating the returned task_struct pointer to a 32-bit int.

There are other places in Lustre which handle the return value of kthread_run incorrectly. We need to go through and make sure the return value of kthread_run is handled correctly in all cases. The correct way to handle it would be like this:

struct task_struct *task = kthread_run(...)

if (IS_ERR(task)) {
        rc = PTR_ERR(task);
        CERROR(..)
}


 Comments   
Comment by Jodi Levi (Inactive) [ 30/Apr/14 ]

John,
Can you look at this one and comment on the priority?
Thank you!

Comment by John Hammond [ 30/Apr/14 ]

This looks like 2.5.*. Is that correct Ryan?

This is fixed in 2.6.0 by http://review.whamcloud.com/#/c/6759/ for LU-3498.

I would say major/critical. Most of http://review.whamcloud.com/#/c/6759/ can probably be ported to 2.5 without too much fuss.

Comment by Ryan Haasken [ 30/Apr/14 ]

John, that's correct, it's 2.5.*.

I didn't see LU-3498 before, but it looks like http://review.whamcloud.com/#/c/6759/ fixes most of the kthread_run calls. Should we port that to b2_5 then?

It looks like there are still a few remaining spots that need fixing in master as well:

  File              Function                              Line
0 lfsck_layout.c    lfsck_layout_master_prep              4318 rc = PTR_ERR(kthread_run(lfsck_layout_assistant, lta, "lfsck_layout"));
1 lfsck_lib.c       lfsck_start                           2238 rc = PTR_ERR(kthread_run(lfsck_master_engine, lta, "lfsck"));
2 lfsck_namespace.c lfsck_namespace_double_scan           1534 rc = PTR_ERR(kthread_run(lfsck_namespace_double_scan_main, lta,
3 ofd_io.c          ofd_start_inconsistency_verification_  230 rc = PTR_ERR(kthread_run(ofd_inconsistency_verification_main, ofd,
Comment by Ryan Haasken [ 30/Apr/14 ]

Actually, I take back the comment about the four calls to kthread_run above. In each of those cases, rc is a long int, so it should be okay.

Comment by Ryan Haasken [ 30/Apr/14 ]

Here is the port for b2_5: http://review.whamcloud.com/#/c/10178/

Comment by Andreas Dilger [ 01/May/14 ]

It wouldn't be terrible to fix the four remaining cases to assign the return value to a task_struct pointer for consistency with the other callers.

Comment by Ryan Haasken [ 08/May/14 ]

Here's a patch against master to change the remaining four cases to match the rest: http://review.whamcloud.com/#/c/10266/

Comment by James A Simmons [ 14/May/14 ]

Patch landed to master. Ticket can be closed.

Comment by Peter Jones [ 14/May/14 ]

Landed for 2.6

Comment by Ryan Haasken [ 14/May/14 ]

The b2_5 port of the patch which fixed LU-3498 is ready to land as well:

http://review.whamcloud.com/#/c/10178/

Generated at Sat Feb 10 01:47:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.