[LU-1736] simul - clients soft CPU lockup Created: 11/Aug/12  Updated: 20/Aug/12  Resolved: 19/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Hyperion RHEL6 servers/clients


Severity: 3
Rank (Obsolete): 6351

 Description   

simul, running test #2 - file stat, shared mode
Clients having repeated soft CPU lockup.


Aug 11 10:45:27 ehyperion686 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [simul:24453]
Aug 11 10:45:27 ehyperion686 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [simul:24453]
Aug 11 10:45:27 ehyperion686 kernel: Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
Aug 11 10:45:27 ehyperion686 kernel: CPU 0
Aug 11 10:45:27 ehyperion686 kernel: Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm dcdbas i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core e1000e [last unloaded: cpufreq_ondemand]
Aug 11 10:45:27 ehyperion686 kernel:
Aug 11 10:45:27 ehyperion686 kernel: Pid: 24453, comm: simul Not tainted 2.6.32-279.2.1.el6.x86_64 #1 Dell XS23-TY /XS23-TY
Aug 11 10:45:27 ehyperion686 kernel: RIP: 0010:[<ffffffff8150024e>] [<ffffffff8150024e>] _spin_lock+0x1e/0x30
Aug 11 10:45:27 ehyperion686 kernel: RSP: 0018:ffff880028203d90 EFLAGS: 00000297
Aug 11 10:45:27 ehyperion686 kernel: RAX: 000000000000ad35 RBX: ffff880028203d90 RCX: 0000000000001fa4
Aug 11 10:45:27 ehyperion686 kernel: RDX: 000000000000ad34 RSI: 0000000000000000 RDI: ffff8801a60486e8
Aug 11 10:45:27 ehyperion686 kernel: RBP: ffffffff8100bc13 R08: 8000000000000000 R09: 00004f07ef0d6d80
Aug 11 10:45:27 ehyperion686 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff880028203d10
Aug 11 10:45:27 ehyperion686 kernel: R13: ffffffff81aa5700 R14: ffff8801a60486e0 R15: ffffffff81505d0b
Aug 11 10:45:27 ehyperion686 kernel: FS: 00002aaaafc522c0(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
Aug 11 10:45:27 ehyperion686 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 11 10:45:27 ehyperion686 kernel: CR2: 00002aaab10fe044 CR3: 0000000179036000 CR4: 00000000000006f0
Aug 11 10:45:27 ehyperion686 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 11 10:45:27 ehyperion686 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 11 10:45:27 ehyperion686 kernel: Process simul (pid: 24453, threadinfo ffff88033ac38000, task ffff88032e585500)
Aug 11 10:45:27 ehyperion686 kernel: Stack:
Aug 11 10:45:27 ehyperion686 kernel: ffff880028203dc0 ffffffffa026d120 0000000000000230 0000000000000000
Aug 11 10:45:27 ehyperion686 kernel: <d> ffffffff81aa5780 ffff88032dcfb480 ffff880028203e10 ffffffffa05860ca
Aug 11 10:45:27 ehyperion686 kernel: <d> ffff880028203e30 ffffffff810e0d44 0000000000000100 0000000000000008
Aug 11 10:45:27 ehyperion686 kernel: Call Trace:
Aug 11 10:45:27 ehyperion686 kernel: <IRQ>

11 10:45:27 ehyperion686 kernel: [<ffffffffa026d120>] ? lprocfs_counter_sub+0xc0/0x100 [lvfs]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa05860ca>] ? ldlm_lock_free+0x3a/0x110 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff810e0d44>] ? rcu_process_dyntick+0x114/0x120
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04b9e7f>] ? class_handle_free_cb+0x2f/0x110 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff810e0dce>] ? force_quiescent_state+0x7e/0x1a0
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff810e12fd>] ? __rcu_process_callbacks+0x10d/0x330
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff810e154b>] ? rcu_process_callbacks+0x2b/0x50
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff81073ca5>] ? irq_exit+0x85/0x90
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff81505d10>] ? smp_apic_timer_interrupt+0x70/0x9b
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
Aug 11 10:45:27 ehyperion686 kernel: <EOI>
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04eec60>] ? check_and_discard_cb+0x0/0x140 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa026d140>] ? lprocfs_counter_sub+0xe0/0x100 [lvfs]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa026d120>] ? lprocfs_counter_sub+0xc0/0x100 [lvfs]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa07dcc81>] ? lov_io_fini+0x211/0x3d0 [lov]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04f0548>] ? cl_io_init0+0xa8/0x190 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04f35a6>] ? cl_io_fini+0x76/0x160 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04efdd4>] ? cl_2queue_init+0x34/0xe0 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04eabeb>] ? cl_lock_page_out+0xeb/0x3b0 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04e3686>] ? cl_env_peek+0x36/0x110 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa040a0b7>] ? cfs_hash_bd_lookup_intent+0x37/0x120 [libcfs]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04eec60>] ? check_and_discard_cb+0x0/0x140 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa0779f0f>] ? osc_lock_flush+0x4f/0x90 [osc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa0779faa>] ? osc_lock_cancel+0x5a/0x170 [osc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04e9a65>] ? cl_lock_cancel0+0x75/0x130 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa04ea5ab>] ? cl_lock_cancel+0x11b/0x120 [obdclass]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa077b538>] ? osc_ldlm_blocking_ast+0x148/0x300 [osc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa0588be3>] ? ldlm_cancel_callback+0x63/0xf0 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa05a3421>] ? ldlm_cli_cancel_local+0x71/0x350 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa05a66af>] ? ldlm_cli_cancel_list_local+0xef/0x230 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa058db72>] ? ldlm_res_hop_get_locked+0x12/0x20 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa05a6958>] ? ldlm_cancel_resource_local+0x168/0x280 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa0590044>] ? ldlm_resource_get+0xa4/0x6e0 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa07655ff>] ? osc_destroy+0x11f/0x8f0 [osc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa07baa5d>] ? lov_set_add_req+0x2d/0x50 [lov]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa07bf9c2>] ? lov_prep_destroy_set+0x372/0xa40 [lov]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa07aac37>] ? lov_destroy+0x357/0xcd0 [lov]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa05d0465>] ? lustre_msg_buf+0x85/0x90 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa07d00c3>] ? lsm_unpackmd_v1+0xf3/0x200 [lov]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa07b5474>] ? lov_alloc_memmd+0x54/0x360 [lov]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa05fc611>] ? req_capsule_set_size+0x71/0x1b0 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa085af51>] ? ll_objects_destroy+0x571/0x1a90 [lustre]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa05d0465>] ? lustre_msg_buf+0x85/0x90 [ptlrpc]

Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa05fcc4b>] ? __req_capsule_get+0x14b/0x6b0 [ptlrpc]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa0855546>] ? ll_update_times+0x26/0x110 [lustre]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffffa085cf7f>] ? ll_unlink+0x35f/0x780 [lustre]
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff81188d3f>] ? vfs_unlink+0x9f/0xe0
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff81187a8a>] ? lookup_hash+0x3a/0x50
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff8118b273>] ? do_unlinkat+0x183/0x1c0
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff81180786>] ? sys_newstat+0x36/0x50
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff8118b2c6>] ? sys_unlink+0x16/0x20
Aug 11 10:45:27 ehyperion686 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
Aug 11 10:45:27 ehyperion686 kernel: Code: 00 00 00 01 74 05 e8 e2 e3 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89



 Comments   
Comment by Peter Jones [ 11/Aug/12 ]

Oleg

Could you please look into this one?

Thanks

Peter

Comment by Peter Jones [ 12/Aug/12 ]

Bobijam

Oleg thought that this issue was in an area of code that you have been working on recently. Can you please treat this as your top priority and look into what has changed between 2.1.2 and 2.1.3 to cause this issue?

Thanks

Peter

Comment by Zhenyu Xu [ 12/Aug/12 ]

we need port http://review.whamcloud.com/2451, http://review.whamcloud.com/3026 and http://review.whamcloud.com/3240 to b2_1, I'll push a combined patch.

Comment by Peter Jones [ 13/Aug/12 ]

Combined patch http://review.whamcloud.com/#change,3607

Comment by Christopher Morrone [ 17/Aug/12 ]

This is a bit confusing.

What does static allocation of stats structures at mount time have to do with soft-lockups when running simul? If there is a connection, we need a clear explanation.

Why are these commits are being squashed into one, which makes comparison of the two branches much more difficult in the future? I am really opposed to this practice.

I do see that one of the three commits contains a revert and additional code in a single commit. I would really like to see that avoided in the future as well. Can we have rules to that effect added to the wiki somewhere?

Comment by Zhenyu Xu [ 18/Aug/12 ]

In b2_1 branch when we disable percpu stats, lprocfs_counter_sub() uses cfs_spin_lock to protect the stat but an interrupt could happen during it and the interrupt handler could possible call lprocfs_counter_sub() again, causing lockup.

there is a glitch in patch #2451 (ls_biggest_alloc_num was not protected) and #3026 (clear stat is not multithread safe), and #3240 (fix the glith and revert the stat clear code), so the 3 patches combined as one is a wholesome one, that's the reason I squashed them into one for b2_1 branch.

Comment by Peter Jones [ 19/Aug/12 ]

Closing as duplicate of LU-1282

Comment by Oleg Drokin [ 20/Aug/12 ]

Actually I sort of think when we port stuff from current branches back into stable branches, it's better to merge whatever fixes were made on top of needed patches into one. Makes for a cleaner changelog and we pull in known good (combined) patches instead of the whole experimental stream of "try X, X did not work, revert - try Y, Y did not work, try Z, ok it works, here are some final adjustments".
Code ends up to be the same as in more new branches, though.

Comment by Christopher Morrone [ 20/Aug/12 ]

But it makes it much harder to compare the commits on multiple branches. I want to be able to look at two branches with "git log --oneline" and have some hope of figuring out if a change is in both branches. If we're always squashing things together it makes for a not-very-consistent log comparison.

Generated at Sat Feb 10 01:19:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.