[LU-5417] lod_load_striping_locked() fails to detect errors from lod_get_lmv_ea() Created: 25/Jul/14  Updated: 30/May/15  Resolved: 03/Sep/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: fault, lod

Issue Links:
Related
is related to LU-6506 Always false comparison in osc_init_g... Open
Severity: 3
Rank (Obsolete): 15066

 Description   

In lod_load_striping_locked() if lod_get_lmv_ea() fails then we fail to detect this due to a signed to unsigned comparison bug.

        } else if (S_ISDIR(lu_object_attr(lod2lu_obj(lo)))) {
                rc = lod_get_lmv_ea(env, lo);
                if (rc < sizeof(struct lmv_mds_md_v1))
                        GOTO(out, rc = rc > 0 ? -EINVAL : rc);
                buf->lb_buf = info->lti_ea_store;
                buf->lb_len = info->lti_ea_store_size;
                if (rc == sizeof(struct lmv_mds_md_v1)) {
                                ...
                }

                /*
                 * there is LOV EA (striping information) in this object
                 * let's parse it and create in-core objects for the stripes
                 */
                rc = lod_parse_dir_striping(env, lo, buf);
}

This causes a subsequent NULL pointer dereference in lod_parse_dir_striping():

[ 1402.003864] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
[ 1402.004401] IP: [<ffffffffa0d181a1>] lod_parse_dir_striping+0x101/0x730 [lod]
[ 1402.004401] PGD 1e4e32067 PUD 1f2b69067 PMD 0
[ 1402.004401] Oops: 0000 [#1] SMP
...
[ 1402.004401] Pid: 6183, comm: mdt01_001 Not tainted 2.6.32-431.5.1.el6.lustre.x86_64 #1 Bochs Bochs
[ 1402.004401] RIP: 0010:[<ffffffffa0d181a1>]  [<ffffffffa0d181a1>] lod_parse_dir_striping+0x101/0x730 [lod]
[ 1402.004401] RSP: 0018:ffff8801f318d9c0  EFLAGS: 00010286
[ 1402.004401] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
[ 1402.004401] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8802168d4418
[ 1402.004401] RBP: ffff8801f318da30 R08: 0000000000000000 R09: 0000000000000001
[ 1402.004401] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8801f2b50000
[ 1402.004401] R13: ffff8801f3197a68 R14: ffff8801f30c6b38 R15: ffff8801f3197a78
[ 1402.004401] FS:  0000000000000000(0000) GS:ffff88002fe00000(0000) knlGS:00000000000000\
00
[ 1402.004401] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 1402.004401] CR2: 000000000000000c CR3: 00000001e4d68000 CR4: 00000000000006e0
[ 1402.004401] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1402.004401] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1402.004401] Process mdt01_001 (pid: 6183, threadinfo ffff8801f318c000, task ffff8801f318a3c0)
[ 1402.004401] Stack:
[ 1402.004401]  fffffffffffffff2 00000000fffffffe ffff8801f318d9f0 ffff8801f3197a98
[ 1402.004401] <d> ffff8801f30c6b38 ffff8801f2b3bb58 ffff8801f318da30 ffffffffa0d0c314
[ 1402.004401] <d> ffffffffa0afb6ef ffff8801f2b3bb58 00000000fffffff2 ffff8801f3197a68
[ 1402.004401] Call Trace:
[ 1402.004401]  [<ffffffffa0d0c314>] ? lod_get_ea+0x514/0x520 [lod]
[ 1402.004401]  [<ffffffffa0afb6ef>] ? osd_object_write_lock+0x9f/0x130 [osd_ldiskfs]
[ 1402.004401]  [<ffffffffa0d0c674>] lod_load_striping_locked+0x354/0x5d0 [lod]
[ 1402.004401]  [<ffffffffa0d0c959>] lod_load_striping+0x69/0x190 [lod]
[ 1402.004401]  [<ffffffffa0d2066e>] lod_declare_attr_set+0x26e/0x760 [lod]
[ 1402.004401]  [<ffffffffa0be8378>] mdd_unlink+0x448/0xe80 [mdd]
[ 1402.004401]  [<ffffffffa0c5050a>] ? mdt_reint_unlink+0x9ca/0x10b0 [mdt]
[ 1402.004401]  [<ffffffffa02cf001>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[ 1402.004401]  [<ffffffffa0c47628>] mdo_unlink+0x18/0x50 [mdt]
[ 1402.004401]  [<ffffffffa0c50544>] mdt_reint_unlink+0xa04/0x10b0 [mdt]
[ 1402.004401]  [<ffffffffa0c473c1>] mdt_reint_rec+0x41/0xe0 [mdt]
[ 1402.004401]  [<ffffffffa0c2cc63>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
[ 1402.004401]  [<ffffffffa0c2d4eb>] mdt_reint+0x6b/0x120 [mdt]
[ 1402.004401]  [<ffffffffa06f1445>] tgt_request_handle+0x245/0xad0 [ptlrpc]
[ 1402.004401]  [<ffffffffa06a1e01>] ptlrpc_main+0xce1/0x1960 [ptlrpc]
[ 1402.004401]  [<ffffffffa06a1120>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[ 1402.004401]  [<ffffffff8109eab6>] kthread+0x96/0xa0
[ 1402.004401]  [<ffffffff8100c30a>] child_rip+0xa/0x20
[ 1402.004401]  [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40
[ 1402.004401]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
[ 1402.004401]  [<ffffffff8109ea20>] ? kthread+0x0/0xa0
[ 1402.004401]  [<ffffffff8100c300>] ? child_rip+0x0/0x20

This issue was found through DT API fault injection.



 Comments   
Comment by Oleg Drokin [ 25/Jul/14 ]

So this might be a broder class of problems that we need to do a separate audit of this.
Idid some grepping around and there's plenty of "if (rc < sizeof ....)" kind of code. Closer look shows that most of cases are ok because there's usually a preceeding "if (rc < 0)" before that statement.
But theern there's e.g. lod_xattr_get that while correct, is correct for non obvious reasons.

I wonder if we should run some static tool for signed/unsigned comparison and ensure all matches are checked for correctness/fixed/made more obvious?

Comment by Andreas Dilger [ 25/Jul/14 ]

Ouch, it took me a few looks to even see what the problem is, but Oleg confirmed that rc is "promoted" to unsigned int to match sizeof() and does the comparison as an unsigned value.

Comment by Peter Jones [ 25/Jul/14 ]

Dmitry

Could you please look into this one?

Thanks

Peter

Comment by Dmitry Eremin (Inactive) [ 28/Jul/14 ]

Uff. I observed few other places with such weird code. For example, in tgt_handle_lfsck_query() we have the following:

reply->lr_status = tgt_lfsck_query(tsi->tsi_env, tsi->tsi_tgt->lut_bottom, request);
if (reply->lr_status < 0)
	rc = reply->lr_status;
RETURN(rc);

The reply->lr_status is unsigned int. Therefore if is always false. So, we need a more careful look through such kind of code.

Comment by John Hammond [ 28/Jul/14 ]

You could do

make EXTRA_POST_CFLAGS='-Wsign-compare -Wno-error 2>&1 1> /dev/null | grep warning | sort | uniq

There are only 1059 comparisons to check!

Comment by Dmitry Eremin (Inactive) [ 31/Jul/14 ]

Originally I got 1084 warnings for current master. After my patch http://review.whamcloud.com/11281/ it's reduced to 1017. And lod now have only 3 warnings:

lustre-release/lustre/lod/lod_qos.c:1169: warning: comparison between signed and unsigned integer expressions
lustre-release/lustre/lod/lod_qos.c:182: warning: comparison between signed and unsigned integer expressions
lustre-release/lustre/lod/lod_qos.c:711: warning: comparison between signed and unsigned integer expressions
Comment by Dmitry Eremin (Inactive) [ 04/Aug/14 ]

After series of pathes all warnings from lustre/lod/* goes away and total amount of warnings become 1004.

Comment by Gerrit Updater [ 27/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13903
Subject: Revert "LU-5417 lfs: fix comparison between signed and unsigned"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 91bfc8b2f8c37ea8d3fbdb059e3c51a8164ca8a6

Comment by Gerrit Updater [ 27/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13903/
Subject: Revert "LU-5417 lfs: fix comparison between signed and unsigned"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bbadfbefd9e0323172ad0a37b4268e68bf0968b7

Generated at Sat Feb 10 01:51:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.