[LU-3346] The assignment of obt_sb was removed Created: 15/May/13 Updated: 21/Oct/13 Resolved: 14/Jul/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Dmitry Eremin (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | build, coverity | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 8274 | ||||||||||||
| Description |
|
Hi, Coverity has found a strange piece of code in check_and_start_recovery_timer() function: if (!(lsi->lsi_flags | LDD_F_IR_CAPABLE)) According to the tool, 'lsi->lsi_flags | LDD_F_IR_CAPABLE' is useless because it is always true. So it seems the developer meant to use '&' instead of '|'. But if we change the code accordingly, then we hit a number of issues because of unprotected access to recovery timer, as seen in https://maloo.whamcloud.com/test_sessions/96125f26-bc36-11e2-b013-52540035b04c |
| Comments |
| Comment by Dmitry Eremin (Inactive) [ 15/May/13 ] |
|
The stack for crash is following: 05:23:08:Lustre: lustre-MDT0000: used disk, loading 05:23:08:LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. 05:23:08:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u 05:23:08:LNet: 11898:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. 05:23:08:LNet: 11898:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 1 previous similar message 05:23:08:Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1 2>/dev/null 05:23:08:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x194 05:23:08:Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect 05:23:08:BUG: unable to handle kernel NULL pointer dereference at 0000000000000290 05:23:08:IP: [<ffffffffa074059f>] target_handle_connect+0x249f/0x2d00 [ptlrpc] 05:23:08:PGD 0 05:23:08:Oops: 0000 [#1] SMP 05:23:08:last sysfs file: /sys/devices/system/cpu/possible 05:23:08:CPU 0 05:23:08:Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 05:23:08: 05:23:08:Pid: 11705, comm: mdt00_002 Not tainted 2.6.32-358.6.1.el6_lustre.g309e05a.x86_64 #1 Red Hat KVM 05:23:08:RIP: 0010:[<ffffffffa074059f>] [<ffffffffa074059f>] target_handle_connect+0x249f/0x2d00 [ptlrpc] 05:23:08:RSP: 0018:ffff880056b9fb70 EFLAGS: 00010246 05:23:08:RAX: 0000000000000000 RBX: ffff88006481e278 RCX: 0000000000000014 05:23:08:RDX: 0000000000000005 RSI: 0000000000000010 RDI: 0000000000000010 05:23:08:RBP: ffff880056b9fd10 R08: ffffffff81c07720 R09: 0000000000000000 05:23:08:R10: ffff8800667f42d1 R11: 0000000000010000 R12: ffff88007bcf4c00 05:23:08:R13: 0000000000000001 R14: ffff88006481e278 R15: ffff88007bcf4fa0 05:23:08:FS: 00007f06d06ba700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 05:23:08:CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 05:23:08:CR2: 0000000000000290 CR3: 0000000037446000 CR4: 00000000000006f0 05:23:08:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 05:23:08:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 05:23:08:Process mdt00_002 (pid: 11705, threadinfo ffff880056b9e000, task ffff880065172ae0) 05:23:08:Stack: 05:23:08: ffffffffa07e1bac ffffffffa07e1b9f 000000005190db27 000000005190db26 05:23:08:<d> 0000000200000000 0000000affffffff ffffffffffffffff ffff88006578a800 05:23:08:<d> ffff88006578a800 000000005190db27 ffffffffa07e1ba0 0000000200000001 05:23:08:Call Trace: 05:23:08: [<ffffffffa0dd745a>] mdt_connect+0x3a/0x4e0 [mdt] 05:23:08: [<ffffffffa0dc8df8>] mdt_handle_common+0x648/0x1660 [mdt] 05:23:08: [<ffffffffa0e058c5>] mds_regular_handle+0x15/0x20 [mdt] 05:23:08: [<ffffffffa078b3a8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 05:23:08: [<ffffffffa04795de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 05:23:08: [<ffffffffa048ad8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 05:23:08: [<ffffffffa0782709>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 05:23:08: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 05:23:08: [<ffffffffa078c73e>] ptlrpc_main+0xace/0x1700 [ptlrpc] 05:23:08: [<ffffffffa078bc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 05:23:08: [<ffffffff8100c0ca>] child_rip+0xa/0x20 05:23:08: [<ffffffffa078bc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 05:23:08: [<ffffffffa078bc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 05:23:08: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 05:23:08:Code: 00 00 00 31 d2 89 c8 f7 f6 ba 05 00 00 00 44 89 ef 83 f8 05 0f 42 c2 c1 ef 02 41 8d 44 45 05 8d 3c 38 48 8b 83 10 05 00 00 89 fe <48> 8b 80 90 02 00 00 8b 00 f6 c4 20 75 15 83 f9 05 b8 32 00 00 05:23:08:RIP [<ffffffffa074059f>] target_handle_connect+0x249f/0x2d00 [ptlrpc] 05:23:08: RSP <ffff880056b9fb70> 05:23:08:CR2: 0000000000000290 static void
check_and_start_recovery_timer(struct obd_device *obd,
struct ptlrpc_request *req,
int new_client)
{
int service_time = lustre_msg_get_service_time(req->rq_reqmsg);
struct obd_device_target *obt = &obd->u.obt;
struct lustre_sb_info *lsi;
if (!new_client && service_time)
/* Teach server about old server's estimates, as first guess
* at how long new requests will take. */
at_measured(&req->rq_rqbd->rqbd_svcpt->scp_at_estimate,
service_time);
target_start_recovery_timer(obd);
/* Convert the service time to RPC timeout,
* and reuse service_time to limit stack usage. */
service_time = at_est2timeout(service_time);
/* We expect other clients to timeout within service_time, then try
* to reconnect, then try the failover server. The max delay between
* connect attempts is SWITCH_MAX + SWITCH_INC + INITIAL. */
service_time += 2 * INITIAL_CONNECT_TIMEOUT;
LASSERT(obt->obt_magic == OBT_MAGIC);
lsi = s2lsi(obt->obt_sb); <== obt->obt_sb is NULL here!!!
if (!(lsi->lsi_flags & LDD_F_IR_CAPABLE))
service_time += 2 * (CONNECTION_SWITCH_MAX +
CONNECTION_SWITCH_INC);
if (service_time > obd->obd_recovery_timeout && !new_client)
extend_recovery_timer(obd, service_time, false);
}
|
| Comment by Dmitry Eremin (Inactive) [ 16/May/13 ] |
|
The assignment of obt->obt_sb was removed in a recent project. The peace of code with obt->obt_sb usage from this function has never executed because of smart compiler remove it. This BUG was not discovered because we have IR so that even the timeout is much short it can still survive. |
| Comment by Dmitry Eremin (Inactive) [ 18/May/13 ] |
|
Patch set: http://review.whamcloud.com/6357 |
| Comment by Dmitry Eremin (Inactive) [ 14/Jul/13 ] |
|
It was covered by |