[LU-3346] The assignment of obt_sb was removed Created: 15/May/13  Updated: 21/Oct/13  Resolved: 14/Jul/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Major
Reporter: Sebastien Buisson (Inactive) Assignee: Dmitry Eremin (Inactive)
Resolution: Fixed Votes: 0
Labels: build, coverity

Issue Links:
Related
is related to LU-2753 Tracking bug for static code analysis... Resolved
is related to LU-1330 split/cleanup for obdclass/fid/fld mo... Resolved
Severity: 3
Rank (Obsolete): 8274

 Description   

Hi,

Coverity has found a strange piece of code in check_and_start_recovery_timer() function:

if (!(lsi->lsi_flags | LDD_F_IR_CAPABLE))
service_time += 2 * (CONNECTION_SWITCH_MAX +
CONNECTION_SWITCH_INC);

According to the tool, 'lsi->lsi_flags | LDD_F_IR_CAPABLE' is useless because it is always true. So it seems the developer meant to use '&' instead of '|'.

But if we change the code accordingly, then we hit a number of issues because of unprotected access to recovery timer, as seen in https://maloo.whamcloud.com/test_sessions/96125f26-bc36-11e2-b013-52540035b04c



 Comments   
Comment by Dmitry Eremin (Inactive) [ 15/May/13 ]

The stack for crash is following:

05:23:08:Lustre: lustre-MDT0000: used disk, loading
05:23:08:LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
05:23:08:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
05:23:08:LNet: 11898:0:(debug.c:324:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
05:23:08:LNet: 11898:0:(debug.c:324:libcfs_debug_str2mask()) Skipped 1 previous similar message
05:23:08:Lustre: DEBUG MARKER: e2label /dev/lvm-MDS/P1 2>/dev/null
05:23:08:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x194
05:23:08:Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
05:23:08:BUG: unable to handle kernel NULL pointer dereference at 0000000000000290
05:23:08:IP: [<ffffffffa074059f>] target_handle_connect+0x249f/0x2d00 [ptlrpc]
05:23:08:PGD 0 
05:23:08:Oops: 0000 [#1] SMP 
05:23:08:last sysfs file: /sys/devices/system/cpu/possible
05:23:08:CPU 0 
05:23:08:Modules linked in: osp(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) mdd(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
05:23:08:
05:23:08:Pid: 11705, comm: mdt00_002 Not tainted 2.6.32-358.6.1.el6_lustre.g309e05a.x86_64 #1 Red Hat KVM
05:23:08:RIP: 0010:[<ffffffffa074059f>]  [<ffffffffa074059f>] target_handle_connect+0x249f/0x2d00 [ptlrpc]
05:23:08:RSP: 0018:ffff880056b9fb70  EFLAGS: 00010246
05:23:08:RAX: 0000000000000000 RBX: ffff88006481e278 RCX: 0000000000000014
05:23:08:RDX: 0000000000000005 RSI: 0000000000000010 RDI: 0000000000000010
05:23:08:RBP: ffff880056b9fd10 R08: ffffffff81c07720 R09: 0000000000000000
05:23:08:R10: ffff8800667f42d1 R11: 0000000000010000 R12: ffff88007bcf4c00
05:23:08:R13: 0000000000000001 R14: ffff88006481e278 R15: ffff88007bcf4fa0
05:23:08:FS:  00007f06d06ba700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
05:23:08:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
05:23:08:CR2: 0000000000000290 CR3: 0000000037446000 CR4: 00000000000006f0
05:23:08:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
05:23:08:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
05:23:08:Process mdt00_002 (pid: 11705, threadinfo ffff880056b9e000, task ffff880065172ae0)
05:23:08:Stack:
05:23:08: ffffffffa07e1bac ffffffffa07e1b9f 000000005190db27 000000005190db26
05:23:08:<d> 0000000200000000 0000000affffffff ffffffffffffffff ffff88006578a800
05:23:08:<d> ffff88006578a800 000000005190db27 ffffffffa07e1ba0 0000000200000001
05:23:08:Call Trace:
05:23:08: [<ffffffffa0dd745a>] mdt_connect+0x3a/0x4e0 [mdt]
05:23:08: [<ffffffffa0dc8df8>] mdt_handle_common+0x648/0x1660 [mdt]
05:23:08: [<ffffffffa0e058c5>] mds_regular_handle+0x15/0x20 [mdt]
05:23:08: [<ffffffffa078b3a8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
05:23:08: [<ffffffffa04795de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
05:23:08: [<ffffffffa048ad8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
05:23:08: [<ffffffffa0782709>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
05:23:08: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
05:23:08: [<ffffffffa078c73e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
05:23:08: [<ffffffffa078bc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
05:23:08: [<ffffffff8100c0ca>] child_rip+0xa/0x20
05:23:08: [<ffffffffa078bc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
05:23:08: [<ffffffffa078bc70>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
05:23:08: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
05:23:08:Code: 00 00 00 31 d2 89 c8 f7 f6 ba 05 00 00 00 44 89 ef 83 f8 05 0f 42 c2 c1 ef 02 41 8d 44 45 05 8d 3c 38 48 8b 83 10 05 00 00 89 fe <48> 8b 80 90 02 00 00 8b 00 f6 c4 20 75 15 83 f9 05 b8 32 00 00 
05:23:08:RIP  [<ffffffffa074059f>] target_handle_connect+0x249f/0x2d00 [ptlrpc]
05:23:08: RSP <ffff880056b9fb70>
05:23:08:CR2: 0000000000000290
static void
check_and_start_recovery_timer(struct obd_device *obd,
                               struct ptlrpc_request *req,
                               int new_client)
{
        int service_time = lustre_msg_get_service_time(req->rq_reqmsg);
        struct obd_device_target *obt = &obd->u.obt;
        struct lustre_sb_info *lsi;

        if (!new_client && service_time)
                /* Teach server about old server's estimates, as first guess
                 * at how long new requests will take. */
		at_measured(&req->rq_rqbd->rqbd_svcpt->scp_at_estimate,
                            service_time);

        target_start_recovery_timer(obd);

	/* Convert the service time to RPC timeout,
	 * and reuse service_time to limit stack usage. */
	service_time = at_est2timeout(service_time);

	/* We expect other clients to timeout within service_time, then try
	 * to reconnect, then try the failover server.  The max delay between
	 * connect attempts is SWITCH_MAX + SWITCH_INC + INITIAL. */
        service_time += 2 * INITIAL_CONNECT_TIMEOUT;

        LASSERT(obt->obt_magic == OBT_MAGIC);
        lsi = s2lsi(obt->obt_sb);                <== obt->obt_sb is NULL here!!!
	if (!(lsi->lsi_flags & LDD_F_IR_CAPABLE))
                service_time += 2 * (CONNECTION_SWITCH_MAX +
                                     CONNECTION_SWITCH_INC);
        if (service_time > obd->obd_recovery_timeout && !new_client)
                extend_recovery_timer(obd, service_time, false);
}
Comment by Dmitry Eremin (Inactive) [ 16/May/13 ]

The assignment of obt->obt_sb was removed in a recent project. The peace of code with obt->obt_sb usage from this function has never executed because of smart compiler remove it. This BUG was not discovered because we have IR so that even the timeout is much short it can still survive.

Comment by Dmitry Eremin (Inactive) [ 18/May/13 ]

Patch set: http://review.whamcloud.com/6357

Comment by Dmitry Eremin (Inactive) [ 14/Jul/13 ]

It was covered by LU-1330.

Generated at Sat Feb 10 01:33:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.