[LU-5417] lod_load_striping_locked() fails to detect errors from lod_get_lmv_ea() Created: 25/Jul/14 Updated: 30/May/15 Resolved: 03/Sep/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | John Hammond | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | fault, lod | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 15066 | ||||||||
| Description |
|
In lod_load_striping_locked() if lod_get_lmv_ea() fails then we fail to detect this due to a signed to unsigned comparison bug. } else if (S_ISDIR(lu_object_attr(lod2lu_obj(lo)))) { rc = lod_get_lmv_ea(env, lo); if (rc < sizeof(struct lmv_mds_md_v1)) GOTO(out, rc = rc > 0 ? -EINVAL : rc); buf->lb_buf = info->lti_ea_store; buf->lb_len = info->lti_ea_store_size; if (rc == sizeof(struct lmv_mds_md_v1)) { ... } /* * there is LOV EA (striping information) in this object * let's parse it and create in-core objects for the stripes */ rc = lod_parse_dir_striping(env, lo, buf); } This causes a subsequent NULL pointer dereference in lod_parse_dir_striping(): [ 1402.003864] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c [ 1402.004401] IP: [<ffffffffa0d181a1>] lod_parse_dir_striping+0x101/0x730 [lod] [ 1402.004401] PGD 1e4e32067 PUD 1f2b69067 PMD 0 [ 1402.004401] Oops: 0000 [#1] SMP ... [ 1402.004401] Pid: 6183, comm: mdt01_001 Not tainted 2.6.32-431.5.1.el6.lustre.x86_64 #1 Bochs Bochs [ 1402.004401] RIP: 0010:[<ffffffffa0d181a1>] [<ffffffffa0d181a1>] lod_parse_dir_striping+0x101/0x730 [lod] [ 1402.004401] RSP: 0018:ffff8801f318d9c0 EFLAGS: 00010286 [ 1402.004401] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 [ 1402.004401] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8802168d4418 [ 1402.004401] RBP: ffff8801f318da30 R08: 0000000000000000 R09: 0000000000000001 [ 1402.004401] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8801f2b50000 [ 1402.004401] R13: ffff8801f3197a68 R14: ffff8801f30c6b38 R15: ffff8801f3197a78 [ 1402.004401] FS: 0000000000000000(0000) GS:ffff88002fe00000(0000) knlGS:00000000000000\ 00 [ 1402.004401] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 1402.004401] CR2: 000000000000000c CR3: 00000001e4d68000 CR4: 00000000000006e0 [ 1402.004401] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1402.004401] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1402.004401] Process mdt01_001 (pid: 6183, threadinfo ffff8801f318c000, task ffff8801f318a3c0) [ 1402.004401] Stack: [ 1402.004401] fffffffffffffff2 00000000fffffffe ffff8801f318d9f0 ffff8801f3197a98 [ 1402.004401] <d> ffff8801f30c6b38 ffff8801f2b3bb58 ffff8801f318da30 ffffffffa0d0c314 [ 1402.004401] <d> ffffffffa0afb6ef ffff8801f2b3bb58 00000000fffffff2 ffff8801f3197a68 [ 1402.004401] Call Trace: [ 1402.004401] [<ffffffffa0d0c314>] ? lod_get_ea+0x514/0x520 [lod] [ 1402.004401] [<ffffffffa0afb6ef>] ? osd_object_write_lock+0x9f/0x130 [osd_ldiskfs] [ 1402.004401] [<ffffffffa0d0c674>] lod_load_striping_locked+0x354/0x5d0 [lod] [ 1402.004401] [<ffffffffa0d0c959>] lod_load_striping+0x69/0x190 [lod] [ 1402.004401] [<ffffffffa0d2066e>] lod_declare_attr_set+0x26e/0x760 [lod] [ 1402.004401] [<ffffffffa0be8378>] mdd_unlink+0x448/0xe80 [mdd] [ 1402.004401] [<ffffffffa0c5050a>] ? mdt_reint_unlink+0x9ca/0x10b0 [mdt] [ 1402.004401] [<ffffffffa02cf001>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [ 1402.004401] [<ffffffffa0c47628>] mdo_unlink+0x18/0x50 [mdt] [ 1402.004401] [<ffffffffa0c50544>] mdt_reint_unlink+0xa04/0x10b0 [mdt] [ 1402.004401] [<ffffffffa0c473c1>] mdt_reint_rec+0x41/0xe0 [mdt] [ 1402.004401] [<ffffffffa0c2cc63>] mdt_reint_internal+0x4c3/0x7c0 [mdt] [ 1402.004401] [<ffffffffa0c2d4eb>] mdt_reint+0x6b/0x120 [mdt] [ 1402.004401] [<ffffffffa06f1445>] tgt_request_handle+0x245/0xad0 [ptlrpc] [ 1402.004401] [<ffffffffa06a1e01>] ptlrpc_main+0xce1/0x1960 [ptlrpc] [ 1402.004401] [<ffffffffa06a1120>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [ 1402.004401] [<ffffffff8109eab6>] kthread+0x96/0xa0 [ 1402.004401] [<ffffffff8100c30a>] child_rip+0xa/0x20 [ 1402.004401] [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40 [ 1402.004401] [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [ 1402.004401] [<ffffffff8109ea20>] ? kthread+0x0/0xa0 [ 1402.004401] [<ffffffff8100c300>] ? child_rip+0x0/0x20 This issue was found through DT API fault injection. |
| Comments |
| Comment by Oleg Drokin [ 25/Jul/14 ] |
|
So this might be a broder class of problems that we need to do a separate audit of this. I wonder if we should run some static tool for signed/unsigned comparison and ensure all matches are checked for correctness/fixed/made more obvious? |
| Comment by Andreas Dilger [ 25/Jul/14 ] |
|
Ouch, it took me a few looks to even see what the problem is, but Oleg confirmed that rc is "promoted" to unsigned int to match sizeof() and does the comparison as an unsigned value. |
| Comment by Peter Jones [ 25/Jul/14 ] |
|
Dmitry Could you please look into this one? Thanks Peter |
| Comment by Dmitry Eremin (Inactive) [ 28/Jul/14 ] |
|
Uff. I observed few other places with such weird code. For example, in tgt_handle_lfsck_query() we have the following: reply->lr_status = tgt_lfsck_query(tsi->tsi_env, tsi->tsi_tgt->lut_bottom, request);
if (reply->lr_status < 0)
rc = reply->lr_status;
RETURN(rc);
The reply->lr_status is unsigned int. Therefore if is always false. So, we need a more careful look through such kind of code. |
| Comment by John Hammond [ 28/Jul/14 ] |
|
You could do make EXTRA_POST_CFLAGS='-Wsign-compare -Wno-error 2>&1 1> /dev/null | grep warning | sort | uniq
There are only 1059 comparisons to check! |
| Comment by Dmitry Eremin (Inactive) [ 31/Jul/14 ] |
|
Originally I got 1084 warnings for current master. After my patch http://review.whamcloud.com/11281/ it's reduced to 1017. And lod now have only 3 warnings: lustre-release/lustre/lod/lod_qos.c:1169: warning: comparison between signed and unsigned integer expressions lustre-release/lustre/lod/lod_qos.c:182: warning: comparison between signed and unsigned integer expressions lustre-release/lustre/lod/lod_qos.c:711: warning: comparison between signed and unsigned integer expressions |
| Comment by Dmitry Eremin (Inactive) [ 04/Aug/14 ] |
|
After series of pathes all warnings from lustre/lod/* goes away and total amount of warnings become 1004. |
| Comment by Gerrit Updater [ 27/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) uploaded a new patch: http://review.whamcloud.com/13903 |
| Comment by Gerrit Updater [ 27/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13903/ |