Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
hyperion/LLNL
-
3
-
8119
Description
Testing fix for LU-3188 http://review.whamcloud.com/#change,6191
Client panics immediately when running IOR
2013-05-05 12:06:15 Lustre: DEBUG MARKER: == test iorssf: iorssf == 12:06:15 2013-05-05 12:30:42 BUG: scheduling while atomic: ior/5692/0x10000002 2013-05-05 12:30:42 BUG: unable to handle kernel paging request at 0000000315c2e000 2013-05-05 12:30:42 IP: [<ffffffff810568e4>] update_curr+0x144/0x1f0 2013-05-05 12:30:42 PGD 106a964067 PUD 0 2013-05-05 12:30:42 Thread overran stack, or stack corrupted 2013-05-05 12:30:42 Oops: 0000 [#1] SMP 2013-05-05 12:30:42 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127 2013-05-05 12:30:42 CPU 9 2013-05-05 12:30:42 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core ext4 mbcache jbd2 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core ahci i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] 2013-05-05 12:30:43 2013-05-05 12:30:43 Pid: 5692, comm: ior Not tainted 2.6.32-358.2.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0HYFFG 2013-05-05 12:30:43 RIP: 0010:[<ffffffff810568e4>] [<ffffffff810568e4>] update_curr+0x144/0x1f0 2013-05-05 12:30:43 RSP: 0018:ffff88089c423db8 EFLAGS: 00010086 2013-05-05 12:30:43 RAX: ffff880840d79540 RBX: 0000000072806048 RCX: ffff880877f101c0 2013-05-05 12:30:43 RDX: 00000000000192d8 RSI: 0000000000000000 RDI: ffff880840d79578 2013-05-05 12:30:43 RBP: ffff88089c423de8 R08: ffffffff8160bb65 R09: 0000000000000007 2013-05-05 12:30:43 R10: 0000000000000010 R11: 0000000000000007 R12: ffff88089c436768 2013-05-05 12:30:43 R13: 00000000007c9fa8 R14: 0000082565f22284 R15: ffff880840d79540 2013-05-05 12:30:43 FS: 00002aaaafebf8c0(0000) GS:ffff88089c420000(0000) knlGS:0000000000000000 2013-05-05 12:30:43 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2013-05-05 12:30:43 CR2: 0000000315c2e000 CR3: 000000106aea6000 CR4: 00000000000407e0 2013-05-05 12:30:43 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2013-05-05 12:30:43 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2013-05-05 12:30:43 Process ior (pid: 5692, threadinfo ffff880872806000, task ffff880840d79540) 2013-05-05 12:30:43 Stack: 2013-05-05 12:30:43 ffff88089c423dc8 ffffffff81013783 ffff880840d79578 ffff88089c436768 2013-05-05 12:30:43 <d> 0000000000000000 0000000000000000 ffff88089c423e18 ffffffff81056e9b 2013-05-05 12:30:43 <d> ffff88089c436700 0000000000000009 0000000000016700 0000000000000009 2013-05-05 12:30:43 Call Trace: 2013-05-05 12:30:43 <IRQ> 2013-05-05 12:30:43 [<ffffffff81013783>] ? native_sched_clock+0x13/0x80 2013-05-05 12:30:43 BUG: unable to handle kernel paging request at 000000000001182f 2013-05-05 12:30:43 IP: [<ffffffff8100f4dd>] print_context_stack+0xad/0x140 2013-05-05 12:30:43 PGD 106a964067 PUD 106a825067 PMD 0 2013-05-05 12:30:43 Thread overran stack, or stack corrupted 2013-05-05 12:30:43 Oops: 0000 [#2] SMP 2013-05-05 12:30:43 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127 2013-05-05 12:30:43 CPU 9
Same as LU-3188 - continuous stack dumps until node crash
Attachments
- console.iwc113
- 103 kB
- console.iwc21
- 79 kB
- console.iwc4
- 93 kB
Issue Links
- is related to
-
LU-3188 IOR fails due to client stack overrun
-
- Resolved
-
Activity
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/16692
Subject: LU-3281 obdclass: remove structure holes to reduce memory
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d4bee9e4e76e3e7e9701595f6f45b7e27cff3a78
Closing ticket as patch landed to master. Please let me know if more work is needed and I will reopen.
Ok I ran the v3 of patch against checkstack:
Overall it is a good reduction. Note: I believe gcc is doing some differing handing of some of the static symbols that explains the loss of osc_send_read_rpc from the list.
376 osc:osc_io_unplug0 200 osc:osc_build_rpc 184 osc:osc_lru_shrink 168 osc:osc_brw_prep_request 144 obdclass:cl_page_find0 104 osc:osc_io_submit 96 lustre:vvp_io_read_start 96 lustre:ll_file_read 88 lov:lov_io_submit 80 ptlrpc:__ptlrpc_request_bufs_pack 80 obdclass:cl_page_put 72 osc:osc_lru_del 72 lustre:ll_file_io_generic 64 ptlrpc:ptlrpc_request_bufs_pack 64 lustre:ll_releasepage 56 osc:osc_queue_sync_pages 56 obdclass:lu_object_put 48 osc:osc_req_attr_set 48 lustre:ll_readpage 40 ptlrpc:lustre_pack_request 40 obdclass:lprocfs_alloc_stats 40 obdclass:cl_req_attr_set 40 lustre:ll_file_aio_read 32 ptlrpc:sptlrpc_cli_alloc_reqbuf 32 ptlrpc:ptlrpc_request_pack 32 ptlrpc:null_alloc_reqbuf 32 osc:osc_page_delete 32 obdclass:lprocfs_alloc_obd_stats 32 obdclass:lprocfs_alloc_md_stats 32 libcfs:cfs_hash_dual_bd_unlock 32 libcfs:cfs_hash_bd_from_key 32 libcfs:cfs_alloc 24 osc:discard_pagevec 24 obdclass:cl_page_assume 24 obdclass:cl_io_submit_rw 24 obdclass:cl_io_read_page 16 obdclass:cl_page_delete 8 obdclass:cl_page_free 8 obdclass:cl_page_delete0 8 obdclass:cl_io_start 8 obdclass:cl_env_hops_keycmp
Is iwc21 still using old client image (without http://review.whamcloud.com/#change,6191 patch), its stack seems indicate that.
And iwc4 shows another long calling path (memory pressure leading a long calling chain).
2013-05-05 13:05:45 WARNING: at kernel/sched_fair.c:1846 hrtick_start_fair+0x18b/0x190() (Not tainted) 2013-05-05 13:05:45 Hardware name: PowerEdge C6220 2013-05-05 13:05:45 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core ext4 mbcache jbd2 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] 2013-05-05 13:05:45 Pid: 7442, comm: ior Not tainted 2.6.32-358.2.1.el6.x86_64 #1 2013-05-05 13:05:45 Call Trace: 2013-05-05 13:05:45 [<ffffffff8106e2e7>] ? warn_slowpath_common+0x87/0xc0 2013-05-05 13:05:45 [<ffffffff8106e33a>] ? warn_slowpath_null+0x1a/0x20 2013-05-05 13:05:45 [<ffffffff810575eb>] ? hrtick_start_fair+0x18b/0x190 2013-05-05 13:05:45 [<ffffffff81064a00>] ? pick_next_task_fair+0xd0/0x130 2013-05-05 13:05:45 [<ffffffff8150d6d8>] ? schedule+0x178/0x3b2 2013-05-05 13:05:45 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 2013-05-05 13:05:45 [<ffffffff81064d6a>] ? __cond_resched+0x2a/0x40 2013-05-05 13:05:45 [<ffffffff8150e320>] ? _cond_resched+0x30/0x40 2013-05-05 13:05:45 [<ffffffffa0712df5>] ? cl_page_free+0xd5/0x5b0 [obdclass] 2013-05-05 13:05:45 [<ffffffffa07134bf>] ? cl_page_put+0x1ef/0x4c0 [obdclass] 2013-05-05 13:05:45 [<ffffffff8112c74f>] ? free_hot_page+0x2f/0x60 2013-05-05 13:05:45 [<ffffffff8112f36e>] ? __put_single_page+0x1e/0x30 2013-05-05 13:05:45 [<ffffffffa0a9beca>] ? lov_page_fini+0x6a/0x170 [lov] 2013-05-05 13:05:45 [<ffffffffa0712e2b>] ? cl_page_free+0x10b/0x5b0 [obdclass] 2013-05-05 13:05:45 [<ffffffffa07134bf>] ? cl_page_put+0x1ef/0x4c0 [obdclass] 2013-05-05 13:05:45 [<ffffffffa0711628>] ? cl_page_disown0+0x88/0x180 [obdclass] 2013-05-05 13:05:45 [<ffffffffa0712502>] ? cl_page_disown+0xa2/0x120 [obdclass] 2013-05-05 13:05:45 [<ffffffffa0a087e7>] ? discard_pagevec+0x47/0x110 [osc] 2013-05-05 13:05:45 [<ffffffffa0a092ac>] ? osc_lru_shrink+0x9fc/0xe40 [osc] 2013-05-05 13:05:45 [<ffffffffa0a0a346>] ? osc_lru_del+0x3c6/0x560 [osc] 2013-05-05 13:05:45 [<ffffffffa0588d84>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 2013-05-05 13:05:45 [<ffffffffa0a0a9c4>] ? osc_page_delete+0xe4/0x320 [osc] 2013-05-05 13:05:45 [<ffffffffa0713eb5>] ? cl_page_delete0+0xc5/0x4e0 [obdclass] 2013-05-05 13:05:45 [<ffffffffa0714312>] ? cl_page_delete+0x42/0x120 [obdclass] 2013-05-05 13:05:45 [<ffffffffa0b6274b>] ? ll_releasepage+0x12b/0x1a0 [lustre] 2013-05-05 13:05:45 [<ffffffff811190a0>] ? try_to_release_page+0x30/0x60 2013-05-05 13:05:45 [<ffffffff81132827>] ? shrink_page_list.clone.3+0x517/0x650 2013-05-05 13:05:45 [<ffffffff8117086b>] ? mem_cgroup_lru_del_list+0x2b/0xb0 2013-05-05 13:05:45 [<ffffffff81132bf7>] ? isolate_lru_pages.clone.0+0xd7/0x170 2013-05-05 13:05:45 [<ffffffff81133343>] ? shrink_inactive_list+0x343/0x830 2013-05-05 13:05:45 [<ffffffff8112d04a>] ? determine_dirtyable_memory+0x1a/0x30 2013-05-05 13:05:45 [<ffffffff8112d0f7>] ? get_dirty_limits+0x27/0x2f0 2013-05-05 13:05:45 [<ffffffff81133bde>] ? shrink_mem_cgroup_zone+0x3ae/0x610 2013-05-05 13:05:45 [<ffffffff8117257d>] ? mem_cgroup_iter+0xfd/0x280 2013-05-05 13:05:45 [<ffffffff81133ea3>] ? shrink_zone+0x63/0xb0 2013-05-05 13:05:46 [<ffffffff81134aa9>] ? zone_reclaim+0x349/0x400 2013-05-05 13:05:46 [<ffffffff8112a62c>] ? get_page_from_freelist+0x69c/0x830 2013-05-05 13:05:46 [<ffffffffa0710005>] ? cl_page_slice_add+0x55/0x140 [obdclass] 2013-05-05 13:05:46 [<ffffffff8112bbc3>] ? __alloc_pages_nodemask+0x113/0x8d0 2013-05-05 13:05:46 [<ffffffff8116095a>] ? alloc_pages_current+0xaa/0x110 2013-05-05 13:05:46 [<ffffffff8111a1d7>] ? __page_cache_alloc+0x87/0x90 2013-05-05 13:05:46 [<ffffffff81119bbe>] ? find_get_page+0x1e/0xa0 2013-05-05 13:05:46 [<ffffffff8111b560>] ? grab_cache_page_nowait+0x70/0xc0 2013-05-05 13:05:46 [<ffffffffa0b4afeb>] ? ll_readahead+0x8bb/0x10f0 [lustre] 2013-05-05 13:05:46 [<ffffffffa0b75af5>] ? vvp_io_read_page+0x305/0x340 [lustre] 2013-05-05 13:05:46 [<ffffffffa071ea0d>] ? cl_io_read_page+0x8d/0x170 [obdclass] 2013-05-05 13:05:46 [<ffffffffa07127f7>] ? cl_page_assume+0xf7/0x220 [obdclass] 2013-05-05 13:05:46 [<ffffffffa0b4a196>] ? ll_readpage+0x96/0x1a0 [lustre] 2013-05-05 13:05:46 [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50 2013-05-05 13:05:46 [<ffffffff8111b7ac>] ? generic_file_aio_read+0x1fc/0x700 2013-05-05 13:05:46 [<ffffffffa0b77297>] ? vvp_io_read_start+0x257/0x470 [lustre] 2013-05-05 13:05:46 [<ffffffffa071c56a>] ? cl_io_start+0x6a/0x140 [obdclass] 2013-05-05 13:05:46 [<ffffffffa0720ca4>] ? cl_io_loop+0xb4/0x1b0 [obdclass] 2013-05-05 13:05:46 [<ffffffffa0b1efcf>] ? ll_file_io_generic+0x33f/0x600 [lustre] 2013-05-05 13:05:46 [<ffffffffa0b1f3cf>] ? ll_file_aio_read+0x13f/0x2c0 [lustre] 2013-05-05 13:05:46 [<ffffffffa0b1fc6c>] ? ll_file_read+0x16c/0x2a0 [lustre] 2013-05-05 13:05:46 [<ffffffff811817a5>] ? vfs_read+0xb5/0x1a0 2013-05-05 13:05:46 [<ffffffff811818e1>] ? sys_read+0x51/0x90 2013-05-05 13:05:46 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b 2013-05-05 13:05:46 ---[ end trace a96e3bc7fcd2c0a4 ]---
I just took a quick run at finding the stack info out.
checkstack and my v1 script don't seem to resolve all the stack calls but this I what I got as a first pass.
There are a few heavy functions in this code path.
360 osc:osc_send_read_rpc 296 osc:osc_build_rpc 216 osc:osc_io_unplug0 184 osc:osc_lru_shrink 168 osc:osc_brw_prep_request 144 obdclass:cl_page_find0 104 osc:osc_io_submit 96 lustre:vvp_io_read_start 96 lustre:ll_file_read 88 lov:lov_io_submit 80 ptlrpc:__ptlrpc_request_bufs_pack 80 obdclass:cl_page_put 72 osc:osc_lru_del 72 lustre:ll_file_io_generic 64 ptlrpc:ptlrpc_request_bufs_pack 64 lustre:ll_releasepage 56 osc:osc_queue_sync_pages 56 obdclass:lu_object_put 48 osc:osc_req_attr_set 48 lustre:ll_readpage 40 ptlrpc:lustre_pack_request 40 obdclass:lprocfs_alloc_stats 40 obdclass:cl_req_attr_set 40 lustre:ll_file_aio_read 32 ptlrpc:sptlrpc_cli_alloc_reqbuf 32 ptlrpc:ptlrpc_request_pack 32 ptlrpc:null_alloc_reqbuf 32 osc:osc_page_delete 32 obdclass:lprocfs_alloc_obd_stats 32 obdclass:lprocfs_alloc_md_stats 32 libcfs:cfs_hash_dual_bd_unlock 32 libcfs:cfs_hash_bd_from_key 32 libcfs:cfs_alloc 24 osc:discard_pagevec 24 obdclass:cl_page_assume 24 obdclass:cl_io_submit_rw 24 obdclass:cl_io_read_page 16 obdclass:cl_page_delete 8 obdclass:cl_page_free 8 obdclass:cl_page_delete0 8 obdclass:cl_io_start 8 obdclass:cl_env_hops_keycmp
Has anyone actually looked at the stack usage of these functions? It is entirely possible that there is some stack-hogging function that could be cleaned up. Please run "make checkstack" and look at all of the functions in these stacks to see which are the biggest stack users.
This is the the long read stack seen on client 21
2013-05-05 14:07:48 ------------[ cut here ]------------ 2013-05-05 14:07:48 WARNING: at kernel/sched_fair.c:1846 hrtick_start_fair+0x18b/0x190() (Not tainted) 2013-05-05 14:07:48 Hardware name: PowerEdge C6220 2013-05-05 14:07:48 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core ext4 mbcache jbd2 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] 2013-05-05 14:07:49 Pid: 10797, comm: ior Not tainted 2.6.32-358.2.1.el6.x86_64 #1 2013-05-05 14:07:49 Call Trace: 2013-05-05 14:07:49 [<ffffffff8106e2e7>] ? warn_slowpath_common+0x87/0xc0 2013-05-05 14:07:49 BUG: unable to handle kernel NULL pointer dereference at 0000000000000a08 2013-05-05 14:07:49 IP: [<ffffffff8100f4dd>] print_context_stack+0xad/0x140 2013-05-05 14:07:49 PGD c3255b067 PUD fe2643067 PMD 0 2013-05-05 14:07:49 Thread overran stack, or stack corrupted 2013-05-05 14:07:49 Oops: 0000 [#1] SMP 2013-05-05 14:07:49 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127 2013-05-05 14:07:49 CPU 28 2013-05-05 14:07:49 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core ext4 mbcache jbd2 dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] 2013-05-05 14:07:49 2013-05-05 14:07:49 Pid: 10797, comm: ior Not tainted 2.6.32-358.2.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0W6W6G 2013-05-05 14:07:49 RIP: 0010:[<ffffffff8100f4dd>] [<ffffffff8100f4dd>] print_context_stack+0xad/0x140 2013-05-05 14:07:49 RSP: 0018:ffff8808725282d8 EFLAGS: 00010006 2013-05-05 14:07:49 RAX: 0000000000000020 RBX: ffff880872528460 RCX: 00000000000015d3 2013-05-05 14:07:49 RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046 2013-05-05 14:07:49 RBP: ffff880872528338 R08: 0000000000000000 R09: ffffffff8163fde0 2013-05-05 14:07:49 R10: 0000000000000001 R11: 0000000000000000 R12: ffff8808725283d8 2013-05-05 14:07:49 R13: ffff880872528000 R14: ffffffff81600460 R15: 0000000000000000 2013-05-05 14:07:49 FS: 00002aaaafebf8c0(0000) GS:ffff88089c580000(0000) knlGS:0000000000000000 2013-05-05 14:07:49 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2013-05-05 14:07:49 CR2: 0000000000000a08 CR3: 0000000dfedb4000 CR4: 00000000000407e0 2013-05-05 14:07:49 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2013-05-05 14:07:49 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2013-05-05 14:07:49 Process ior (pid: 10797, threadinfo ffff880872528000, task ffff88085d734080) 2013-05-05 14:07:49 Stack: 2013-05-05 14:07:49 ffff880872528378 ffff880872529ff8 ffff8808725283d8 ffffffffffffe000 2013-05-05 14:07:49 <d> ffffffff817c2fa0 ffffffff8106e2e7 ffff880872528348 ffff880872528440 2013-05-05 14:07:49 <d> 000000000000cbe0 ffffffff81600460 ffffffff817c2fa0 ffff88089c583fc0 2013-05-05 14:07:49 Call Trace: 2013-05-05 14:07:49 [<ffffffff8106e2e7>] ? warn_slowpath_common+0x87/0xc0 2013-05-05 14:07:49 [<ffffffff8100e4a0>] dump_trace+0x190/0x3b0 2013-05-05 14:07:49 [<ffffffff810575eb>] ? hrtick_start_fair+0x18b/0x190 2013-05-05 14:07:49 [<ffffffff8100f245>] show_trace_log_lvl+0x55/0x70 2013-05-05 14:07:49 [<ffffffff8100f275>] show_trace+0x15/0x20 2013-05-05 14:07:49 [<ffffffff8150d19a>] dump_stack+0x6f/0x76 2013-05-05 14:07:49 [<ffffffff810b2dfa>] ? print_modules+0x5a/0xf0 2013-05-05 14:07:49 [<ffffffff8106e2e7>] warn_slowpath_common+0x87/0xc0 2013-05-05 14:07:49 [<ffffffff8106e33a>] warn_slowpath_null+0x1a/0x20 2013-05-05 14:07:49 [<ffffffff810575eb>] hrtick_start_fair+0x18b/0x190 2013-05-05 14:07:49 [<ffffffff81064a00>] pick_next_task_fair+0xd0/0x130 2013-05-05 14:07:49 [<ffffffff8150d6d8>] schedule+0x178/0x3b2 2013-05-05 14:07:49 [<ffffffff81064d6a>] __cond_resched+0x2a/0x40 2013-05-05 14:07:49 [<ffffffff8150e320>] _cond_resched+0x30/0x40 2013-05-05 14:07:50 [<ffffffffa0708df5>] cl_page_free+0xd5/0x5b0 [obdclass] 2013-05-05 14:07:50 [<ffffffffa07094bf>] cl_page_put+0x1ef/0x4c0 [obdclass] 2013-05-05 14:07:50 [<ffffffffa0707628>] ? cl_page_disown0+0x88/0x180 [obdclass] 2013-05-05 14:07:50 [<ffffffffa0708502>] ? cl_page_disown+0xa2/0x120 [obdclass] 2013-05-05 14:07:50 [<ffffffffa0a0c7e7>] discard_pagevec+0x47/0x110 [osc] 2013-05-05 14:07:50 [<ffffffffa0a0d2ac>] osc_lru_shrink+0x9fc/0xe40 [osc] 2013-05-05 14:07:50 [<ffffffffa0a0e346>] osc_lru_del+0x3c6/0x560 [osc] 2013-05-05 14:07:50 [<ffffffffa057ed84>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 2013-05-05 14:07:50 [<ffffffffa0a0e9c4>] osc_page_delete+0xe4/0x320 [osc] 2013-05-05 14:07:50 [<ffffffffa0709eb5>] cl_page_delete0+0xc5/0x4e0 [obdclass] 2013-05-05 14:07:50 [<ffffffffa070a312>] cl_page_delete+0x42/0x120 [obdclass] 2013-05-05 14:07:50 [<ffffffffa0b6674b>] ll_releasepage+0x12b/0x1a0 [lustre] 2013-05-05 14:07:50 [<ffffffff81172160>] ? mem_cgroup_uncharge_cache_page+0x10/0x20 2013-05-05 14:07:50 [<ffffffff811190a0>] try_to_release_page+0x30/0x60 2013-05-05 14:07:50 [<ffffffff81132827>] shrink_page_list.clone.3+0x517/0x650 2013-05-05 14:07:50 [<ffffffff8117086b>] ? mem_cgroup_lru_del_list+0x2b/0xb0 2013-05-05 14:07:50 [<ffffffff81132bf7>] ? isolate_lru_pages.clone.0+0xd7/0x170 2013-05-05 14:07:50 [<ffffffff81133343>] shrink_inactive_list+0x343/0x830 2013-05-05 14:07:50 [<ffffffff81133bde>] shrink_mem_cgroup_zone+0x3ae/0x610 2013-05-05 14:07:50 [<ffffffff8117257d>] ? mem_cgroup_iter+0xfd/0x280 2013-05-05 14:07:50 [<ffffffff81133ea3>] shrink_zone+0x63/0xb0 2013-05-05 14:07:50 [<ffffffff81134aa9>] zone_reclaim+0x349/0x400 2013-05-05 14:07:50 [<ffffffffa0703279>] ? cl_env_hops_keycmp+0x19/0x70 [obdclass] 2013-05-05 14:07:50 [<ffffffff8112a62c>] get_page_from_freelist+0x69c/0x830 2013-05-05 14:07:50 [<ffffffff8112bbc3>] __alloc_pages_nodemask+0x113/0x8d0 2013-05-05 14:07:50 [<ffffffff810e6dd5>] ? call_rcu_sched+0x15/0x20 2013-05-05 14:07:50 [<ffffffffa057ed84>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] 2013-05-05 14:07:50 [<ffffffff81166ab2>] kmem_getpages+0x62/0x170 2013-05-05 14:07:50 [<ffffffff8116711f>] cache_grow+0x2cf/0x320 2013-05-05 14:07:50 [<ffffffff81167372>] cache_alloc_refill+0x202/0x240 2013-05-05 14:07:50 [<ffffffffa056ab60>] ? cfs_alloc+0x30/0x60 [libcfs] 2013-05-05 14:07:50 [<ffffffff81168239>] __kmalloc+0x1a9/0x220 2013-05-05 14:07:50 [<ffffffffa056ab60>] cfs_alloc+0x30/0x60 [libcfs] 2013-05-05 14:07:50 [<ffffffffa088ba00>] null_alloc_reqbuf+0x190/0x420 [ptlrpc] 2013-05-05 14:07:50 [<ffffffffa087b5c9>] sptlrpc_cli_alloc_reqbuf+0x69/0x220 [ptlrpc] 2013-05-05 14:07:50 [<ffffffffa084f891>] lustre_pack_request+0x81/0x180 [ptlrpc] 2013-05-05 14:07:50 [<ffffffffa083d085>] __ptlrpc_request_bufs_pack+0xe5/0x3b0 [ptlrpc] 2013-05-05 14:07:50 [<ffffffffa083d3ac>] ptlrpc_request_bufs_pack+0x5c/0x80 [ptlrpc] 2013-05-05 14:07:50 [<ffffffffa083d3f4>] ptlrpc_request_pack+0x24/0x70 [ptlrpc] 2013-05-05 14:07:50 [<ffffffffa09fe6e5>] osc_brw_prep_request+0x155/0x1140 [osc] 2013-05-05 14:07:50 [<ffffffffa0a12cec>] ? osc_req_attr_set+0x16c/0x5b0 [osc] 2013-05-05 14:07:50 [<ffffffffa0712aa1>] ? cl_req_attr_set+0xd1/0x230 [obdclass] 2013-05-05 14:07:50 [<ffffffffa0a0491b>] osc_build_rpc+0x86b/0x1730 [osc] 2013-05-05 14:07:50 [<ffffffffa0a1c003>] osc_send_read_rpc+0x6a3/0x880 [osc] 2013-05-05 14:07:50 [<ffffffffa057f522>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs] 2013-05-05 14:07:50 [<ffffffffa06fcabc>] ? lu_object_put+0x12c/0x330 [obdclass] 2013-05-05 14:07:50 [<ffffffffa0a1fb76>] osc_io_unplug0+0xb46/0x12c0 [osc] 2013-05-05 14:07:50 [<ffffffffa070a83c>] ? cl_page_find0+0x44c/0x810 [obdclass] 2013-05-05 14:07:50 [<ffffffff8112bbc3>] ? __alloc_pages_nodemask+0x113/0x8d0 2013-05-05 14:07:50 [<ffffffffa0a22091>] osc_io_unplug+0x11/0x20 [osc] 2013-05-05 14:07:50 [<ffffffffa0a28c80>] osc_queue_sync_pages+0x1d0/0x360 [osc] 2013-05-05 14:07:50 [<ffffffffa0a137e0>] osc_io_submit+0x330/0x4a0 [osc] 2013-05-05 14:07:50 [<ffffffffa071240c>] cl_io_submit_rw+0x6c/0x160 [obdclass] 2013-05-05 14:07:50 [<ffffffffa0aa904a>] lov_io_submit+0x4ea/0xbc0 [lov] 2013-05-05 14:07:50 [<ffffffffa071240c>] cl_io_submit_rw+0x6c/0x160 [obdclass] 2013-05-05 14:07:50 [<ffffffffa0714a2e>] cl_io_read_page+0xae/0x170 [obdclass] 2013-05-05 14:07:50 [<ffffffffa07087f7>] ? cl_page_assume+0xf7/0x220 [obdclass] 2013-05-05 14:07:51 [<ffffffffa0b4e196>] ll_readpage+0x96/0x1a0 [lustre] 2013-05-05 14:07:51 [<ffffffff8111b7ac>] generic_file_aio_read+0x1fc/0x700 2013-05-05 14:07:51 [<ffffffffa0b7b297>] vvp_io_read_start+0x257/0x470 [lustre] 2013-05-05 14:07:51 [<ffffffffa071256a>] cl_io_start+0x6a/0x140 [obdclass] 2013-05-05 14:07:51 [<ffffffffa0716ca4>] cl_io_loop+0xb4/0x1b0 [obdclass] 2013-05-05 14:07:51 [<ffffffffa0b22fcf>] ll_file_io_generic+0x33f/0x600 [lustre] 2013-05-05 14:07:51 [<ffffffffa0b233cf>] ll_file_aio_read+0x13f/0x2c0 [lustre] 2013-05-05 14:07:51 [<ffffffffa0b23c6c>] ll_file_read+0x16c/0x2a0 [lustre] 2013-05-05 14:07:51 [<ffffffff811817a5>] vfs_read+0xb5/0x1a0 2013-05-05 14:07:51 [<ffffffff811818e1>] sys_read+0x51/0x90 2013-05-05 14:07:51 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 2013-05-05 14:07:51 Code: 38 08 00 85 c0 74 2c 49 8d 44 24 08 48 39 c3 74 7d 31 d2 48 8b 75 c8 48 8b 7d c0 41 ff 56 10 48 81 7d c8 41 ae 00 81 49 8b 45 00 <8b> 90 e8 09 00 00 74 1b 48 83 c3 08 4d 85 ff 75 92 4c 39 eb 76 2013-05-05 14:07:51 RIP [<ffffffff8100f4dd>] print_context_stack+0xad/0x140 2013-05-05 14:07:51 RSP <ffff8808725282d8> 2013-05-05 14:07:51 CR2: 0000000000000a08
I thought one of goals of the last patch was to move to osc_io_unplug_async ? In general it seems like calling osc_io_unplug from any context can be unsafe for the stack, as memory pressure can really lengthen things out.
Another client iwc21 was able to dump a stack in the kernel logs. This looks to be very relevant.
If possible would you try this patch?
http://review.whamcloud.com/6270
LU-3281 osc: don't hold a spinlock before waking up process ptlrpcd_add_req() will wake_up other process, do not hold a spinlock before calling ptlrpcd_queue_work()->ptlrpcd_add_req().
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16692/
Subject:
LU-3281obdclass: remove structure holes to reduce memoryProject: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 790ca7b3c47d27639390bd7a2b9693182d3f24d6