[LU-1243] Lustre 2.1.1 client stack overrun Created: 20/Mar/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ned Bass | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: | |||
| Severity: | 3 |
| Rank (Obsolete): | 9741 |
| Description |
|
During I/O testing we hit a client stack overrun on a Lustre 2.1.1 client. We have the 2012-03-18 06:05:23 BUG: unable to handle kernel paging request at 000000032cbc7960 2012-03-18 06:05:23 IP: [<ffffffff81052814>] update_curr+0x144/0x1f0 2012-03-18 06:05:23 PGD 417be5067 PUD 0 2012-03-18 06:05:23 Thread overran stack, or stack corrupted 2012-03-18 06:05:23 Oops: 0000 [#1] SMP 2012-03-18 06:05:23 last sysfs file: /sys/devices/system/cpu/cpu31/online 2012-03-18 06:05:23 CPU 6 2012-03-18 06:05:23 Modules linked in: xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) libcfs(U) ipt_LOG xt_multiport iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm uinput sg sd_mod crc_t10dif isci libsas scsi_transport_sas ahci microcode sb_edac edac_core iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core ib_qib(U) ib_mad ib_core wmi ioatdma ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc igb dca [last unloaded: cpufreq_ondemand] 2012-03-18 06:05:23 2012-03-18 06:05:23 Pid: 68872, comm: lmp Not tainted 2.6.32-220.7.1.3chaos.ch5.x86_64 #1 appro appro-512x/S2600JF 2012-03-18 06:05:23 RIP: 0010:[<ffffffff81052814>] [<ffffffff81052814>] update_curr+0x144/0x1f0 2012-03-18 06:05:23 RSP: 0018:ffff8800366c3db8 EFLAGS: 00010086 2012-03-18 06:05:23 RAX: ffff8808334c9580 RBX: 00000000755fa0c8 RCX: ffff880437f111c0 2012-03-18 06:05:23 RDX: 0000000000018b48 RSI: 0000000000000000 RDI: ffff8808334c95b8 2012-03-18 06:05:23 RBP: ffff8800366c3de8 R08: ffffffff8160b665 R09: 0000000000000000 2012-03-18 06:05:23 R10: 0000000000000010 R11: 0000000000000000 R12: ffff8800366d5fe8 2012-03-18 06:05:23 R13: 00000000000f3c4d R14: 000081fe1182f6c5 R15: ffff8808334c9580 2012-03-18 06:05:23 FS: 00002aaaacfd0800(0000) GS:ffff8800366c0000(0000) knlGS:0000000000000000 2012-03-18 06:05:23 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-03-18 06:05:23 CR2: 000000032cbc7960 CR3: 0000000326499000 CR4: 00000000000406e0 2012-03-18 06:05:23 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-03-18 06:05:23 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-03-18 06:05:23 Process lmp (pid: 68872, threadinfo ffff8807755fa000, task ffff8808334c9580) 2012-03-18 06:05:23 Stack: 2012-03-18 06:05:23 ffff8800366c3dc8 ffffffff81013783 ffff8808334c95b8 ffff8800366d5fe8 2012-03-18 06:05:23 <0> 0000000000000000 0000000000000000 ffff8800366c3e18 ffffffff81052e2b 2012-03-18 06:05:23 <0> ffff8800366d5f80 0000000000000006 0000000000015f80 0000000000000006 2012-03-18 06:05:23 Call Trace: 2012-03-18 06:05:23 <IRQ> 2012-03-18 06:05:23 [<ffffffff81013783>] ? native_sched_clock+0x13/0x80 2012-03-18 06:05:23 [<ffffffff81052e2b>] task_tick_fair+0xdb/0x160 2012-03-18 06:05:23 [<ffffffff81056891>] scheduler_tick+0xc1/0x260 2012-03-18 06:05:23 [<ffffffff810a0e70>] ? tick_sched_timer+0x0/0xc0 2012-03-18 06:05:23 [<ffffffff8107c512>] update_process_times+0x52/0x70 2012-03-18 06:05:23 [<ffffffff810a0ed6>] tick_sched_timer+0x66/0xc0 2012-03-18 06:05:23 [<ffffffff8109555e>] __run_hrtimer+0x8e/0x1a0 2012-03-18 06:05:23 [<ffffffff81012b59>] ? read_tsc+0x9/0x20 2012-03-18 06:05:23 [<ffffffff81095906>] hrtimer_interrupt+0xe6/0x250 2012-03-18 06:05:23 [<ffffffff814f6f0b>] smp_apic_timer_interrupt+0x6b/0x9b 2012-03-18 06:05:23 [<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20 2012-03-18 06:05:23 <EOI> 2012-03-18 06:05:23 [<ffffffff8112adbe>] ? shrink_inactive_list+0x2de/0x740 2012-03-18 06:05:23 [<ffffffff8112ad72>] ? shrink_inactive_list+0x292/0x740 2012-03-18 06:05:23 [<ffffffff8112baef>] shrink_zone+0x38f/0x520 2012-03-18 06:05:23 [<ffffffff8112c894>] zone_reclaim+0x354/0x410 2012-03-18 06:05:23 [<ffffffff8112d4e0>] ? isolate_pages_global+0x0/0x350 2012-03-18 06:05:23 [<ffffffff81122d94>] get_page_from_freelist+0x694/0x820 2012-03-18 06:05:23 [<ffffffffa0475b6f>] ? cfs_hash_bd_from_key+0x3f/0xc0 [libcfs] 2012-03-18 06:05:23 [<ffffffffa05d4232>] ? lu_object_put+0x92/0x200 [obdclass] 2012-03-18 06:05:23 [<ffffffff811337c9>] ? zone_statistics+0x99/0xc0 2012-03-18 06:05:23 [<ffffffff81124011>] __alloc_pages_nodemask+0x111/0x940 2012-03-18 06:05:23 [<ffffffff8115e592>] kmem_getpages+0x62/0x170 2012-03-18 06:05:23 [<ffffffff8115ebff>] cache_grow+0x2cf/0x320 2012-03-18 06:05:23 [<ffffffff8115ee52>] cache_alloc_refill+0x202/0x240 2012-03-18 06:05:23 [<ffffffffa0469863>] ? cfs_alloc+0x63/0x90 [libcfs] 2012-03-18 06:05:23 [<ffffffff8115fb79>] __kmalloc+0x1a9/0x220 2012-03-18 06:05:23 [<ffffffffa0469863>] cfs_alloc+0x63/0x90 [libcfs] 2012-03-18 06:05:23 [<ffffffffa0702237>] ptlrpc_request_alloc_internal+0x167/0x360 [ptlrpc] 2012-03-18 06:05:23 [<ffffffffa070243e>] ptlrpc_request_alloc_pool+0xe/0x10 [ptlrpc] 2012-03-18 06:05:23 [<ffffffffa08d684b>] osc_brw_prep_request+0x1ab/0xcc0 [osc] 2012-03-18 06:05:23 [<ffffffffa08e7fbb>] ? osc_req_attr_set+0xfb/0x250 [osc] 2012-03-18 06:05:23 [<ffffffffa09f7638>] ? ccc_req_attr_set+0x78/0x150 [lustre] 2012-03-18 06:05:23 [<ffffffffa05e4e81>] ? cl_req_attr_set+0xd1/0x1a0 [obdclass] 2012-03-18 06:05:23 [<ffffffffa05e476c>] ? cl_req_prep+0x8c/0x130 [obdclass] 2012-03-18 06:05:23 [<ffffffffa08d821d>] osc_send_oap_rpc+0xebd/0x1690 [osc] 2012-03-18 06:05:23 [<ffffffffa08d17a6>] ? loi_list_maint+0xa6/0x130 [osc] 2012-03-18 06:05:23 [<ffffffffa08d8c9e>] osc_check_rpcs+0x2ae/0x3b0 [osc] 2012-03-18 06:05:23 [<ffffffffa08e86df>] osc_io_submit+0x1df/0x4a0 [osc] 2012-03-18 06:05:23 [<ffffffffa05e4938>] cl_io_submit_rw+0x78/0x130 [obdclass] 2012-03-18 06:05:23 [<ffffffffa094160f>] lov_io_submit+0x2bf/0x9d0 [lov] 2012-03-18 06:05:23 [<ffffffffa05e4938>] cl_io_submit_rw+0x78/0x130 [obdclass] 2012-03-18 06:05:23 [<ffffffffa09ddf15>] ll_writepage+0x205/0x3b0 [lustre] 2012-03-18 06:05:23 [<ffffffff8112a34b>] pageout.clone.1+0x12b/0x300 2012-03-18 06:05:23 [<ffffffff81129a01>] ? __remove_mapping+0x61/0x160 2012-03-18 06:05:23 [<ffffffff8112a8d2>] shrink_page_list.clone.0+0x3b2/0x5c0 2012-03-18 06:05:23 [<ffffffff811280bc>] ? release_pages+0x6c/0x250 2012-03-18 06:05:23 [<ffffffff8116804d>] ? mem_cgroup_del_lru_list+0x2d/0xc0 2012-03-18 06:05:23 [<ffffffff81168169>] ? mem_cgroup_del_lru+0x39/0x40 2012-03-18 06:05:23 [<ffffffff811689ae>] ? mem_cgroup_isolate_pages+0xee/0x1d0 2012-03-18 06:05:23 [<ffffffff8112addb>] shrink_inactive_list+0x2fb/0x740 2012-03-18 06:05:23 [<ffffffff8112baef>] shrink_zone+0x38f/0x520 2012-03-18 06:05:23 [<ffffffff811a0e4d>] ? bdi_queue_work+0x7d/0x110 2012-03-18 06:05:23 [<ffffffff8112bd7e>] do_try_to_free_pages+0xfe/0x520 2012-03-18 06:05:23 [<ffffffff8112c2ec>] try_to_free_mem_cgroup_pages+0x8c/0x90 2012-03-18 06:05:23 [<ffffffff811688c0>] ? mem_cgroup_isolate_pages+0x0/0x1d0 2012-03-18 06:05:23 [<ffffffff8116a450>] mem_cgroup_hierarchical_reclaim+0x2f0/0x460 2012-03-18 06:05:23 [<ffffffff8116b992>] __mem_cgroup_try_charge+0x202/0x460 2012-03-18 06:05:23 [<ffffffff811241e7>] ? __alloc_pages_nodemask+0x2e7/0x940 2012-03-18 06:05:23 [<ffffffff8116c4e7>] mem_cgroup_charge_common+0x87/0xd0 2012-03-18 06:05:23 [<ffffffff8116c658>] mem_cgroup_cache_charge+0x128/0x140 2012-03-18 06:05:23 [<ffffffff8111161a>] add_to_page_cache_locked+0x4a/0x140 2012-03-18 06:05:23 [<ffffffff8111173c>] add_to_page_cache_lru+0x2c/0x80 2012-03-18 06:05:23 [<ffffffff811122f9>] grab_cache_page_write_begin+0x99/0xc0 2012-03-18 06:05:23 [<ffffffffa09efa98>] ll_write_begin+0x58/0x170 [lustre] 2012-03-18 06:05:23 [<ffffffff81111b7e>] generic_file_buffered_write+0x10e/0x2a0 2012-03-18 06:05:23 [<ffffffff81070637>] ? current_fs_time+0x27/0x30 2012-03-18 06:05:23 [<ffffffff811134d0>] __generic_file_aio_write+0x250/0x480 2012-03-18 06:05:23 [<ffffffffa05d8d95>] ? cl_env_info+0x15/0x20 [obdclass] 2012-03-18 06:05:23 [<ffffffff8111376f>] generic_file_aio_write+0x6f/0xe0 2012-03-18 06:05:23 [<ffffffffa09fdf9e>] vvp_io_write_start+0x9e/0x1e0 [lustre] 2012-03-18 06:05:23 [<ffffffffa05e4a5a>] cl_io_start+0x6a/0x100 [obdclass] 2012-03-18 06:05:23 [<ffffffffa05e8094>] cl_io_loop+0xd4/0x160 [obdclass] 2012-03-18 06:05:23 [<ffffffffa09ba43b>] ll_file_io_generic+0x3bb/0x4b0 [lustre] 2012-03-18 06:05:23 [<ffffffffa09bf4ec>] ll_file_aio_write+0x13c/0x280 [lustre] 2012-03-18 06:05:23 [<ffffffffa09bf79c>] ll_file_write+0x16c/0x290 [lustre] 2012-03-18 06:05:23 [<ffffffff81177ae8>] vfs_write+0xb8/0x1a0 2012-03-18 06:05:23 [<ffffffff811784f1>] sys_write+0x51/0x90 2012-03-18 06:05:23 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 2012-03-18 06:05:23 Code: 00 8b 15 cc 36 a4 00 85 d2 74 34 48 8b 50 08 8b 5a 18 48 8b 90 10 09 00 00 48 8b 4a 50 48 85 c9 74 1d 48 63 db 66 90 48 8b 51 20 <48> 03 14 dd 20 73 bf 81 4c 01 2a 48 8b 49 78 48 85 c9 75 e8 48 2012-03-18 06:05:23 RIP [<ffffffff81052814>] update_curr+0x144/0x1f0 2012-03-18 06:05:23 RSP <ffff8800366c3db8> 2012-03-18 06:05:23 CR2: 000000032cbc7960 |
| Comments |
| Comment by Peter Jones [ 20/Mar/12 ] |
|
Hongchao Could you please comment on this one? Thanks Peter |
| Comment by Hongchao Zhang [ 20/Mar/12 ] |
|
it's similar with static int osc_brw_prep_request(int cmd, struct client_obd *cli,struct obdo *oa, ENTRY; |
| Comment by Hongchao Zhang [ 22/Mar/12 ] |
|
the patch to move some data in OSC to lu_env is under creation&test, and will attach it ASAP |
| Comment by Hongchao Zhang [ 28/Mar/12 ] |
|
status update: |
| Comment by Hongchao Zhang [ 30/Mar/12 ] |
|
the patch is tracked at http://review.whamcloud.com/#change,2418 by moving the variables into lu_env, the stack reduction is not obvious, the attached patch is using an separated thread |
| Comment by Andreas Dilger [ 30/Mar/12 ] |
|
We already have threads that send RPCs - that is ptlrpcd. Adding more threads in the very common IO path just to reduce stack usage will increase complexity and hurt performance due to extra context switches. We already have to do work to improve the NUMA CPU page locality for multiple ptlrpcds, and we would need to repeat this work for this new OSC thread to keep the page handling local to the CPU on which the pages were previously handled. It looks like the main problem is that the write path is going into kernel memory allocation/reclamation in ptlrpc_request_alloc_pool(), but this thread is already in the memory reclaim path from the top-level write: ll_file_write
->generic_file_buffered_write
->ll_write_begin
->grab_cache_page_write_begin
->do_try_to_free_pages
->shrink_zone
->shrink_page_list
->ll_writepage
->osc_send_oap_rpc
->ptlrpc_request_alloc_pool
->cfs_alloc
->get_page_from_freelist
->zone_reclaim
->shrink_zone
so adding another thread will not help very much because there will still be memory pressure, but it will slow down EVERY IO operation. A much better approach is to figure out why the ptlrpc request pool is not being used for this write request. It is entirely possible that the changes in 2.x for CLIO broke the initialization or usage of the ptlrpc request pool. This was added specifically to avoid memory allocations in the write path, to avoid memory allocation under pressure. |
| Comment by Build Master (Inactive) [ 30/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 30/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 30/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 30/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 30/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 30/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 30/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Build Master (Inactive) [ 30/Mar/12 ] |
|
Integrated in Result = SUCCESS
|
| Comment by Hongchao Zhang [ 31/Mar/12 ] |
|
1, the stack used by osc_check_rpcs0/osc_send_oap_rpc/osc_build_req/osc_brw_prep_request is ~350bytes 2, the ptlrpc_request is called by osc_check_rpcs, which can't guarantee it must to be write request, and for read request, cfs_alloc+0x63/0x90 [libcfs] static int osc_brw_prep_request(int cmd, struct client_obd *cli,struct obdo *oa, ... else { opc = OST_READ; req = ptlrpc_request_alloc(cli->cl_import, &RQF_OST_BRW_READ); } ... struct ptlrpc_request *ptlrpc_request_alloc(struct obd_import *imp, 3, the ptlrpcd is for async io, the new introduced thread will block the calling thread until it finish, which is better? 4, the separated thread will only be used if the calling thread is under low stack space, and will it also affect IO much? |
| Comment by Andreas Dilger [ 03/Apr/12 ] |
|
Even in the case of read operations at the top level, if the free memory is low and the VM is trying to flush dirty data from cache, it will be a write operation at the bottom of the stack that will use the RPC request pool. Getting the RPC request pool working will not only (hopefully) fix this stack usage issue, but also has the benefit of avoiding memory allocation deadlocks when the client and server are running on the same machine, and avoids making large RPC request allocations frequenty (which is why it was added originally). So there is a benefit in getting this code working properly for several reasons, not just the stack usage. Also, it is best to avoid making the code more complex (i.e. adding another thread) if this is not necessary. |
| Comment by Jinshan Xiong (Inactive) [ 16/Apr/12 ] |
|
I'd like to suggest to issue BRW RPC asynchronously so that we can reduce stack usage dramatically. I have added the functionality to issue async RPC at commit 2064c2a7e616b172f72462884b23d899bfc040ff so I think we can port it to b2_1 and fire writeback request in osc_io_submit() too. To best honest, obd interfaces are almost obsoleted in client IO so any attempt to add new obd ops is not good. |
| Comment by Hongchao Zhang [ 17/Apr/12 ] |
|
the updated patch based on the above async PTLRPC is under creation&testing. |
| Comment by Hongchao Zhang [ 19/Apr/12 ] |
|
the updated patch is under testing and will attach it soon. |
| Comment by Peter Jones [ 20/Apr/12 ] |
| Comment by Hongchao Zhang [ 07/Jun/12 ] |
|
after rebasing the patch, two new problems are found, one is related to IO requests advancing, the IO performance is very the updated test is under testing, hope to attach it tomorrow. |
| Comment by Hongchao Zhang [ 14/Aug/12 ] |
|
there are much change in OSC module recently, and the patch needed to be updated against it. |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Close old ticket. |