[LU-3188] IOR fails due to client stack overrun Created: 17/Apr/13 Updated: 11/Oct/14 Resolved: 23/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
Hyperion/LLNL |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 7781 | ||||||||||||
| Description |
|
This is currently killing all IOR runs on Hyperion: 2013-04-17 15:55:20 BUG: scheduling while atomic: ior/44672/0x10000002 2013-04-17 15:55:20 BUG: unable to handle kernel paging request at fffffffceb9ee000 2013-04-17 15:55:20 IP: [<ffffffff810568e4>] update_curr+0x144/0x1f0 2013-04-17 15:55:20 PGD 1a87067 PUD 0 2013-04-17 15:55:20 Thread overran stack, or stack corrupted 2013-04-17 15:55:20 Oops: 0000 [#1] SMP 2013-04-17 15:55:20 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127 2013-04-17 15:55:20 CPU 25 2013-04-17 15:55:20 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sha512_generic sha256_generic ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ahci shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] 2013-04-17 15:55:20 2013-04-17 15:55:20 Pid: 44672, comm: ior Not tainted 2.6.32-358.2.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0HYFFG 2013-04-17 15:55:20 RIP: 0010:[<ffffffff810568e4>] [<ffffffff810568e4>] update_curr+0x144/0x1f0 2013-04-17 15:55:20 RSP: 0018:ffff88089c523db8 EFLAGS: 00010086 2013-04-17 15:55:20 RAX: ffff88086f748080 RBX: ffffffffad3be048 RCX: ffff880877f101c0 2013-04-17 15:55:20 RDX: 00000000000192d8 RSI: 0000000000000000 RDI: ffff88086f7480b8 2013-04-17 15:55:20 RBP: ffff88089c523de8 R08: ffffffff8160bb65 R09: 0000000000000007 2013-04-17 15:55:20 R10: 0000000000000010 R11: 0000000000000007 R12: ffff88089c536768 2013-04-17 15:55:20 R13: 000000000080f9df R14: 0000a8ac18cddce3 R15: ffff88086f748080 2013-04-17 15:55:20 FS: 00002aaaafebf8c0(0000) GS:ffff88089c520000(0000) knlGS:0000000000000000 2013-04-17 15:55:20 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2013-04-17 15:55:20 CR2: fffffffceb9ee000 CR3: 000000105cb6c000 CR4: 00000000000407e0 2013-04-17 15:55:20 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2013-04-17 15:55:20 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2013-04-17 15:55:20 Process ior (pid: 44672, threadinfo ffff8807ad3be000, task ffff88086f748080) 2013-04-17 15:55:20 Stack: 2013-04-17 15:55:20 ffff88089c523dc8 ffffffff81013783 ffff88086f7480b8 ffff88089c536768 2013-04-17 15:55:20 <d> 0000000000000000 0000000000000000 ffff88089c523e18 ffffffff81056e9b 2013-04-17 15:55:20 <d> ffff88089c536700 0000000000000019 0000000000016700 0000000000000019 2013-04-17 15:55:20 Call Trace: 2013-04-17 15:55:20 <IRQ> 2013-04-17 15:55:20 [<ffffffff81013783>] ? native_sched_clock+0x13/0x80 2013-04-17 15:55:20 BUG: unable to handle kernel paging request at 000000000001400d 2013-04-17 15:55:20 IP: [<ffffffff8100f4dd>] print_context_stack+0xad/0x140 2013-04-17 15:55:20 PGD 1067f2c067 PUD 105b956067 PMD 0 2013-04-17 15:55:20 Thread overran stack, or stack corrupted 2013-04-17 15:55:20 Oops: 0000 [#2] SMP 2013-04-17 15:55:20 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127 2013-04-17 15:55:20 CPU 25 2013-04-17 15:55:20 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sha512_generic sha256_generic ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ahci shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] 2013-04-17 15:55:21 2013-04-17 15:55:21 Pid: 44672, comm: ior Not tainted 2.6.32-358.2.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0HYFFG 2013-04-17 15:55:21 RIP: 0010:[<ffffffff8100f4dd>] [<ffffffff8100f4dd>] print_context_stack+0xad/0x140 2013-04-17 15:55:21 RSP: 0018:ffff88089c5238c8 EFLAGS: 00010006 2013-04-17 15:55:21 RAX: 0000000000013625 RBX: ffff88089c523dc0 RCX: 00000000000016f5 2013-04-17 15:55:21 RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046 2013-04-17 15:55:21 RBP: ffff88089c523928 R08: 0000000000000000 R09: ffffffff8163fde0 2013-04-17 15:55:21 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88089c523de8 2013-04-17 15:55:21 R13: ffff8807ad3be000 R14: ffffffff81600460 R15: ffff88089c523fc0 2013-04-17 15:55:21 FS: 00002aaaafebf8c0(0000) GS:ffff88089c520000(0000) knlGS:0000000000000000 2013-04-17 15:55:21 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2013-04-17 15:55:21 CR2: 000000000001400d CR3: 000000105cb6c000 CR4: 00000000000407e0 2013-04-17 15:55:21 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2013-04-17 15:55:21 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2013-04-17 15:55:21 Process ior (pid: 44672, threadinfo ffff8807ad3be000, task ffff88086f748080) It never completes dumping the stack, instead it hits this BUG in a loop until the kernel stack is corrupt, then the node reboots. |
| Comments |
| Comment by Cliff White (Inactive) [ 17/Apr/13 ] |
|
Chris from LLNL indicates the stack overflow might be indicating a potential stack overflow issue with Lustre, or our kernel. |
| Comment by Peter Jones [ 18/Apr/13 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 18/Apr/13 ] |
|
is there any more information? better get another hit and collect more information. |
| Comment by Cliff White (Inactive) [ 18/Apr/13 ] |
|
I will see what I can do. It never dumped a full stack, due to repeatedly hitting the BUG |
| Comment by Keith Mannthey (Inactive) [ 18/Apr/13 ] |
|
Cliff what IOR configuration where you running? Was that seen on the Client, MDS or OSS? It does look like stack corruption. Do you have more kernel message context before the scheduling while atomic? |
| Comment by Keith Mannthey (Inactive) [ 18/Apr/13 ] |
|
I have a kernel panics seen on similar clients while running. It is big of mess but but they might be related to the stack overflow.. Not sure. The client (iwc16) has been up and running for maybe a day so doing tests. Then it blew up. 2013-04-17 08:10:47 WARNING: at kernel/sched_fair.c:1846 hrtick_start_fair+0x18b/0x190() (Not tainted) 2013-04-17 08:10:47 Hardware name: PowerEdge C6220 2013-04-17 08:10:47 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] 2013-04-17 08:10:48 Pid: 31339, comm: ior Not tainted 2.6.32-358.2.1.el6.x86_64 #1 2013-04-17 08:10:48 Call Trace: 2013-04-17 08:10:48 <IRQ> [<ffffffff8106e2e7>] ? warn_slowpath_common+0x87/0xc0 2013-04-17 08:10:48 [<ffffffff8106e33a>] ? warn_slowpath_null+0x1a/0x20 2013-04-17 08:10:48 [<ffffffff810575eb>] ? hrtick_start_fair+0x18b/0x190 2013-04-17 08:10:48 [<ffffffff81065ceb>] ? enqueue_task_fair+0xfb/0x100 2013-04-17 08:10:48 [<ffffffff81055f76>] ? enqueue_task+0x66/0x80 2013-04-17 08:10:48 [<ffffffff81055fb3>] ? activate_task+0x23/0x30 2013-04-17 08:10:48 [<ffffffff81063120>] ? try_to_wake_up+0x1f0/0x3e0 2013-04-17 08:10:48 [<ffffffff81063322>] ? default_wake_function+0x12/0x20 2013-04-17 08:10:48 [<ffffffff81096cb6>] ? autoremove_wake_function+0x16/0x40 2013-04-17 08:10:48 [<ffffffff81051419>] ? __wake_up_common+0x59/0x90 2013-04-17 08:10:48 [<ffffffff81055aa8>] ? __wake_up+0x48/0x70 2013-04-17 08:10:48 [<ffffffff810912e0>] ? delayed_work_timer_fn+0x0/0x50 2013-04-17 08:10:48 [<ffffffff81090d8d>] ? insert_work+0x6d/0xb0 2013-04-17 08:10:48 [<ffffffff810912c6>] ? __queue_work+0x36/0x50 2013-04-17 08:10:48 [<ffffffff81091319>] ? delayed_work_timer_fn+0x39/0x50 2013-04-17 08:10:48 [<ffffffff81081837>] ? run_timer_softirq+0x197/0x340 2013-04-17 08:10:48 [<ffffffff810a7ff0>] ? tick_sched_timer+0x0/0xc0 2013-04-17 08:10:48 [<ffffffff8102e94d>] ? lapic_next_event+0x1d/0x30 2013-04-17 08:10:48 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0 2013-04-17 08:10:48 [<ffffffff8109b77b>] ? hrtimer_interrupt+0x14b/0x260 2013-04-17 08:10:48 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30 2013-04-17 08:10:48 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0 2013-04-17 08:10:48 [<ffffffff81076d95>] ? irq_exit+0x85/0x90 2013-04-17 08:10:48 [<ffffffff81517000>] ? smp_apic_timer_interrupt+0x70/0x9b 2013-04-17 08:10:48 [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20 2013-04-17 08:10:48 <EOI> [<ffffffff8100b2f2>] ? int_check_syscall_exit_work+0x0/0x3d 2013-04-17 08:10:48 ---[ end trace c5d166a5a1430d97 ]--- 2013-04-17 08:10:48 ------------[ cut here ]------------ 2013-04-17 08:10:48 kernel BUG at kernel/sched.c:1412! 2013-04-17 08:10:48 invalid opcode: 0000 [#1] SMP 2013-04-17 08:10:48 last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:03:00.0/infiniband/mlx4_0/ports/1/pkeys/127 2013-04-17 08:10:48 CPU 31 2013-04-17 08:10:48 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ipmi_devintf acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad iw_cxgb4 iw_cxgb3 ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm sg sd_mod crc_t10dif wmi dcdbas sb_edac edac_core i2c_i801 i2c_core ahci iTCO_wdt iTCO_vendor_support shpchp ioatdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en mlx4_core igb dca ptp pps_core be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: cpufreq_ondemand] 2013-04-17 08:10:48 2013-04-17 08:10:48 Pid: 31339, comm: ior Tainted: G W --------------- 2.6.32-358.2.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0W6W6G 2013-04-17 08:10:48 RIP: 0010:[<ffffffff8105225a>] [<ffffffff8105225a>] resched_task+0x6a/0x80 2013-04-17 08:10:48 RSP: 0018:ffff88089c5e3c10 EFLAGS: 00010046 2013-04-17 08:10:48 RAX: 0000000035ee35ee RBX: ffff88089c5f6700 RCX: 00000000000035ee 2013-04-17 08:10:48 RDX: ffff8808625fa000 RSI: ffff880873d18b18 RDI: ffff880577359540 2013-04-17 08:10:48 RBP: ffff88089c5e3c10 R08: 0000000000000000 R09: ffffffff8163fde0 2013-04-17 08:10:48 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880577359578 2013-04-17 08:10:49 R13: ffff880873d18b18 R14: ffff880577359540 R15: 0000000000000000 2013-04-17 08:10:49 FS: 00002aaaafebf8c0(0000) GS:ffff88089c5e0000(0000) knlGS:0000000000000000 2013-04-17 08:10:49 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2013-04-17 08:10:49 CR2: 0000000000899000 CR3: 0000001055f74000 CR4: 00000000000407e0 2013-04-17 08:10:49 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2013-04-17 08:10:49 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2013-04-17 08:10:49 Process ior (pid: 31339, threadinfo ffff8808625fa000, task ffff880577359540) 2013-04-17 08:10:49 Stack: 2013-04-17 08:10:49 ffff88089c5e3c70 ffffffff810570e0 ffff88089c5e3c50 ffffffff81065ceb 2013-04-17 08:10:49 <d> ffff88089c5e3c50 0000000073d18ae0 ffff88089c5f6700 ffff88089c5f6700 2013-04-17 08:10:49 <d> ffff88089c5f6700 0000000000000000 000000000000001f 0000000000000000 2013-04-17 08:10:49 Call Trace: 2013-04-17 08:10:49 <IRQ> 2013-04-17 08:10:49 [<ffffffff810570e0>] check_preempt_wakeup+0x1c0/0x260 2013-04-17 08:10:49 [<ffffffff81065ceb>] ? enqueue_task_fair+0xfb/0x100 2013-04-17 08:10:49 [<ffffffff8105230c>] check_preempt_curr+0x7c/0x90 2013-04-17 08:10:49 [<ffffffff81063145>] try_to_wake_up+0x215/0x3e0 2013-04-17 08:10:49 [<ffffffff81063322>] default_wake_function+0x12/0x20 2013-04-17 08:10:49 [<ffffffff81096cb6>] autoremove_wake_function+0x16/0x40 2013-04-17 08:10:49 [<ffffffff81051419>] __wake_up_common+0x59/0x90 2013-04-17 08:10:49 [<ffffffff81055aa8>] __wake_up+0x48/0x70 2013-04-17 08:10:49 [<ffffffff810912e0>] ? delayed_work_timer_fn+0x0/0x50 2013-04-17 08:10:49 [<ffffffff81090d8d>] insert_work+0x6d/0xb0 2013-04-17 08:10:49 [<ffffffff810912c6>] __queue_work+0x36/0x50 2013-04-17 08:10:49 [<ffffffff81091319>] delayed_work_timer_fn+0x39/0x50 2013-04-17 08:10:49 [<ffffffff81081837>] run_timer_softirq+0x197/0x340 2013-04-17 08:10:49 [<ffffffff810a7ff0>] ? tick_sched_timer+0x0/0xc0 2013-04-17 08:10:49 [<ffffffff8102e94d>] ? lapic_next_event+0x1d/0x30 2013-04-17 08:10:49 [<ffffffff81076fb1>] __do_softirq+0xc1/0x1e0 2013-04-17 08:10:49 [<ffffffff8109b77b>] ? hrtimer_interrupt+0x14b/0x260 2013-04-17 08:10:49 [<ffffffff8100c1cc>] call_softirq+0x1c/0x30 2013-04-17 08:10:49 [<ffffffff8100de05>] do_softirq+0x65/0xa0 2013-04-17 08:10:49 [<ffffffff81076d95>] irq_exit+0x85/0x90 2013-04-17 08:10:49 [<ffffffff81517000>] smp_apic_timer_interrupt+0x70/0x9b 2013-04-17 08:10:49 [<ffffffff8100bb93>] apic_timer_interrupt+0x13/0x20 013-04-17 08:10:49 <EOI> 2013-04-17 08:10:49 [<ffffffff8100b2f2>] ? int_check_syscall_exit_work+0x0/0x3d 2013-04-17 08:10:49 Code: 25 b8 e0 00 00 8b 40 18 39 d0 74 0d 0f ae f0 48 8b 57 08 f6 42 14 04 74 08 c9 c3 66 0f 1f 44 00 00 89 c7 ff 15 28 43 a4 00 c9 c3 <0f> 0b 0f 1f 40 00 eb fa 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 2013-04-17 08:10:49 RIP [<ffffffff8105225a>] resched_task+0x6a/0x80 2013-04-17 08:10:49 RSP <ffff88089c5e3c10> 2013-04-17 08:10:50 Initializing cgroup subsys cpuset 2013-04-17 08:10:50 Initializing cgroup subsys cpu 2013-04-17 08:10:50 Linux version 2.6.32-358.2.1.el6.x86_64 .... Kernel reboots. |
| Comment by Keith Mannthey (Inactive) [ 18/Apr/13 ] |
|
Conman output from a client that got BUG: scheduling while atomic: ior/19445/0x10010100 |
| Comment by Keith Mannthey (Inactive) [ 18/Apr/13 ] |
|
Another schedule while atomic with some real debug. |
| Comment by Andreas Dilger [ 22/Apr/13 ] |
|
There is a stack overflow in the attached console logs: 2013-04-15 04:52:31 WARNING: at kernel/sched_fair.c:1846 hrtick_start_fair+0x18b/0x190() (Tainted: G W 2013-04-15 04:52:32 Pid: 19445, comm: ior Tainted: G W --------------- 2.6.32-358.2.1.el6.x86_64 #1 2013-04-15 04:52:32 Call Trace: 2013-04-15 04:52:32 <IRQ> [<ffffffff8106e2e7>] ? warn_slowpath_common+0x87/0xc0 2013-04-15 04:52:32 [<ffffffff8106e33a>] ? warn_slowpath_null+0x1a/0x20 2013-04-15 04:52:32 [<ffffffff810575eb>] ? hrtick_start_fair+0x18b/0x190 2013-04-15 04:52:32 [<ffffffff81065ceb>] ? enqueue_task_fair+0xfb/0x100 2013-04-15 04:52:32 [<ffffffff81055f76>] ? enqueue_task+0x66/0x80 2013-04-15 04:52:32 [<ffffffff81055fb3>] ? activate_task+0x23/0x30 2013-04-15 04:52:32 [<ffffffff81063120>] ? try_to_wake_up+0x1f0/0x3e0 2013-04-15 04:52:32 [<ffffffff81063322>] ? default_wake_function+0x12/0x20 2013-04-15 04:52:32 [<ffffffff81096cb6>] ? autoremove_wake_function+0x16/0x40 2013-04-15 04:52:32 [<ffffffff81051419>] ? __wake_up_common+0x59/0x90 2013-04-15 04:52:32 [<ffffffff81055aa8>] ? __wake_up+0x48/0x70 2013-04-15 04:52:32 [<ffffffff810912e0>] ? delayed_work_timer_fn+0x0/0x50 2013-04-15 04:52:32 [<ffffffff81090d8d>] ? insert_work+0x6d/0xb0 2013-04-15 04:52:32 [<ffffffff810912c6>] ? __queue_work+0x36/0x50 2013-04-15 04:52:32 [<ffffffff81091319>] ? delayed_work_timer_fn+0x39/0x50 2013-04-15 04:52:32 [<ffffffff81081837>] ? run_timer_softirq+0x197/0x340 2013-04-15 04:52:32 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0 2013-04-15 04:52:32 [<ffffffff8109b77b>] ? hrtimer_interrupt+0x14b/0x260 2013-04-15 04:52:32 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30 2013-04-15 04:52:32 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0 2013-04-15 04:52:32 [<ffffffff81076d95>] ? irq_exit+0x85/0x90 2013-04-15 04:52:32 [<ffffffff81517000>] ? smp_apic_timer_interrupt+0x70/0x9b 2013-04-15 04:52:32 [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20 2013-04-15 04:52:32 <EOI> [<ffffffff81510297>] ? _spin_unlock_irqrestore+0x17/0x20 2013-04-15 04:52:32 [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 2013-04-15 04:52:32 [<ffffffffa04cf75a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 2013-04-15 04:52:32 [<ffffffffa07a18ca>] ? ptlrpc_set_add_new_req+0x9a/0x150 [ptlrpc] 2013-04-15 04:52:32 [<ffffffffa07d703f>] ? ptlrpcd_add_req+0x18f/0x2d0 [ptlrpc] 2013-04-15 04:52:32 [<ffffffffa0677c51>] ? cl_req_attr_set+0xd1/0x230 [obdclass] 2013-04-15 04:52:32 [<ffffffffa07b423c>] ? lustre_msg_get_opc+0x9c/0x110 [ptlrpc] 2013-04-15 04:52:32 [<ffffffffa0971e2e>] ? osc_build_rpc+0xd7e/0x1730 [osc] 2013-04-15 04:52:32 [<ffffffffa0988ff3>] ? osc_send_read_rpc+0x6a3/0x880 [osc] 2013-04-15 04:52:32 [<ffffffffa04e4522>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs] 2013-04-15 04:52:32 [<ffffffffa0661c6c>] ? lu_object_put+0x12c/0x330 [obdclass] 2013-04-15 04:52:32 [<ffffffffa098cb66>] ? osc_io_unplug0+0xb46/0x12c0 [osc] 2013-04-15 04:52:32 [<ffffffff81167330>] ? cache_alloc_refill+0x1c0/0x240 2013-04-15 04:52:32 [<ffffffffa098f081>] ? osc_io_unplug+0x11/0x20 [osc] 2013-04-15 04:52:32 [<ffffffffa0995c70>] ? osc_queue_sync_pages+0x1d0/0x360 [osc] 2013-04-15 04:52:32 [<ffffffffa09807d0>] ? osc_io_submit+0x330/0x4a0 [osc] 2013-04-15 04:52:32 [<ffffffffa06775bc>] ? cl_io_submit_rw+0x6c/0x160 [obdclass] 2013-04-15 04:52:32 [<ffffffffa0a1604a>] ? lov_io_submit+0x4ea/0xbc0 [lov] 2013-04-15 04:52:32 [<ffffffffa06775bc>] ? cl_io_submit_rw+0x6c/0x160 [obdclass] 2013-04-15 04:52:32 [<ffffffffa0679bde>] ? cl_io_read_page+0xae/0x170 [obdclass] 2013-04-15 04:52:32 [<ffffffffa066d9a7>] ? cl_page_assume+0xf7/0x220 [obdclass] 2013-04-15 04:52:33 [<ffffffffa0aba5d6>] ? ll_readpage+0x96/0x1a0 [lustre] 2013-04-15 04:52:33 [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50 2013-04-15 04:52:33 [<ffffffff8111b7ac>] ? generic_file_aio_read+0x1fc/0x700 2013-04-15 04:52:33 [<ffffffffa0ae7707>] ? vvp_io_read_start+0x257/0x470 [lustre] 2013-04-15 04:52:33 [<ffffffffa067771a>] ? cl_io_start+0x6a/0x140 [obdclass] 2013-04-15 04:52:33 [<ffffffffa067be54>] ? cl_io_loop+0xb4/0x1b0 [obdclass] 2013-04-15 04:52:33 [<ffffffffa0a8e6df>] ? ll_file_io_generic+0x33f/0x600 [lustre] 2013-04-15 04:52:33 [<ffffffffa0a8eadf>] ? ll_file_aio_read+0x13f/0x2c0 [lustre] 2013-04-15 04:52:33 [<ffffffffa0a8f68c>] ? ll_file_read+0x16c/0x2a0 [lustre] 2013-04-15 04:52:33 [<ffffffff811817a5>] ? vfs_read+0xb5/0x1a0 2013-04-15 04:52:33 [<ffffffff811818e1>] ? sys_read+0x51/0x90 2013-04-15 04:52:33 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b 2013-04-15 04:52:33 ---[ end trace 47344f6fbb7a59e2 ]--- 2013-04-15 04:52:33 BUG: scheduling while atomic: ior/19445/0x10010100 2013-04-15 04:52:33 Pid: 19445, comm: ior Tainted: G W --------------- 2.6.32-358.2.1.el6.x86_64 #1 Dell Inc. PowerEdge C6220/0HYFFG 2013-04-15 04:52:33 RIP: 0010:[<ffffffff81510297>] [<ffffffff81510297>] _spin_unlock_irqrestore 2013-04-15 04:52:34 Call Trace: 2013-04-15 04:52:34 <IRQ> 2013-04-15 04:52:34 [<ffffffff810912d1>] ? __queue_work+0x41/0x50 2013-04-15 04:52:34 [<ffffffff81091319>] ? delayed_work_timer_fn+0x39/0x50 2013-04-15 04:52:34 [<ffffffff81081837>] ? run_timer_softirq+0x197/0x340 2013-04-15 04:52:34 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0 2013-04-15 04:52:34 [<ffffffff8109b77b>] ? hrtimer_interrupt+0x14b/0x260 2013-04-15 04:52:34 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30 2013-04-15 04:52:34 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0 2013-04-15 04:52:34 [<ffffffff81076d95>] ? irq_exit+0x85/0x90 2013-04-15 04:52:34 [<ffffffff81517000>] ? smp_apic_timer_interrupt+0x70/0x9b 2013-04-15 04:52:34 [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20 2013-04-15 04:52:34 <EOI> 2013-04-15 04:52:34 [<ffffffff81510297>] ? _spin_unlock_irqrestore+0x17/0x20 2013-04-15 04:52:34 [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 2013-04-15 04:52:34 [<ffffffffa04cf75a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 2013-04-15 04:52:34 [<ffffffffa07a18ca>] ? ptlrpc_set_add_new_req+0x9a/0x150 [ptlrpc] 2013-04-15 04:52:34 [<ffffffffa07d703f>] ? ptlrpcd_add_req+0x18f/0x2d0 [ptlrpc] 2013-04-15 04:52:34 [<ffffffffa0677c51>] ? cl_req_attr_set+0xd1/0x230 [obdclass] 2013-04-15 04:52:34 [<ffffffffa07b423c>] ? lustre_msg_get_opc+0x9c/0x110 [ptlrpc] 2013-04-15 04:52:34 [<ffffffffa0971e2e>] ? osc_build_rpc+0xd7e/0x1730 [osc] 2013-04-15 04:52:34 [<ffffffffa0988ff3>] ? osc_send_read_rpc+0x6a3/0x880 [osc] 2013-04-15 04:52:34 [<ffffffffa04e4522>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs] 2013-04-15 04:52:34 [<ffffffffa0661c6c>] ? lu_object_put+0x12c/0x330 [obdclass] 2013-04-15 04:52:34 [<ffffffffa098cb66>] ? osc_io_unplug0+0xb46/0x12c0 [osc] 2013-04-15 04:52:34 [<ffffffff81167330>] ? cache_alloc_refill+0x1c0/0x240 2013-04-15 04:52:34 [<ffffffffa098f081>] ? osc_io_unplug+0x11/0x20 [osc] 2013-04-15 04:52:34 [<ffffffffa0995c70>] ? osc_queue_sync_pages+0x1d0/0x360 [osc] 2013-04-15 04:52:34 [<ffffffffa09807d0>] ? osc_io_submit+0x330/0x4a0 [osc] 2013-04-15 04:52:34 [<ffffffffa06775bc>] ? cl_io_submit_rw+0x6c/0x160 [obdclass] 2013-04-15 04:52:34 [<ffffffffa0a1604a>] ? lov_io_submit+0x4ea/0xbc0 [lov] 2013-04-15 04:52:34 [<ffffffffa06775bc>] ? cl_io_submit_rw+0x6c/0x160 [obdclass] 2013-04-15 04:52:34 [<ffffffffa0679bde>] ? cl_io_read_page+0xae/0x170 [obdclass] 2013-04-15 04:52:34 [<ffffffffa066d9a7>] ? cl_page_assume+0xf7/0x220 [obdclass] 2013-04-15 04:52:34 [<ffffffffa0aba5d6>] ? ll_readpage+0x96/0x1a0 [lustre] 2013-04-15 04:52:34 [<ffffffff81096ce0>] ? wake_bit_function+0x0/0x50 2013-04-15 04:52:34 [<ffffffff8111b7ac>] ? generic_file_aio_read+0x1fc/0x700 2013-04-15 04:52:34 [<ffffffffa0ae7707>] ? vvp_io_read_start+0x257/0x470 [lustre] 2013-04-15 04:52:34 [<ffffffffa067771a>] ? cl_io_start+0x6a/0x140 [obdclass] 2013-04-15 04:52:34 [<ffffffffa067be54>] ? cl_io_loop+0xb4/0x1b0 [obdclass] 2013-04-15 04:52:34 [<ffffffffa0a8e6df>] ? ll_file_io_generic+0x33f/0x600 [lustre] 2013-04-15 04:52:34 [<ffffffffa0a8eadf>] ? ll_file_aio_read+0x13f/0x2c0 [lustre] 2013-04-15 04:52:34 [<ffffffffa0a8f68c>] ? ll_file_read+0x16c/0x2a0 [lustre] 2013-04-15 04:52:34 [<ffffffff811817a5>] ? vfs_read+0xb5/0x1a0 2013-04-15 04:52:34 [<ffffffff811818e1>] ? sys_read+0x51/0x90 2013-04-15 04:52:34 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b |
| Comment by Zhenyu Xu [ 24/Apr/13 ] |
|
Jinshan, Can osc_queue_sync_pages() call osc_io_unplug_async() instead of osc_io_unplug()? My understanding of their difference is that whether we build the RPC request on this thread context or on ptlrpcd context. |
| Comment by Jinshan Xiong (Inactive) [ 24/Apr/13 ] |
|
Yes, we can use osc_io_unplug_async() here. However, I don't think the calling stack is super long so I'm not sure if there exists a large local variable somewhere. |
| Comment by Zhenyu Xu [ 28/Apr/13 ] |
| Comment by Zhenyu Xu [ 03/May/13 ] |
|
Keith, Can you try http://review.whamcloud.com/6191 as you've reproduced the issue? |
| Comment by Keith Mannthey (Inactive) [ 03/May/13 ] |
|
All those stack traces can from Hyperion. Cliff can we re-run this on on the cluster with the fix patch? |
| Comment by Keith Mannthey (Inactive) [ 03/May/13 ] |
|
Ok after chatting with Peter it is clear we want to be sure we don't regress on lu-2909. I am setting up to test for that issue with this fix patch. I will update this LU and the patch when I pass at least 24 hours of testing or if I encounter the fsx crash again. |
| Comment by Keith Mannthey (Inactive) [ 05/May/13 ] |
|
I have been testing for close to 48 hours with out encountering any issues with the fsx test. I will let the tests continue for run and update if I do encounter anything. I acked the patch. |
| Comment by Peter Jones [ 06/May/13 ] |
|
Proposed fix landed for 2.4. Will open a new ticket if that change still proves insufficient for Hyperion test runs. |
| Comment by Keith Mannthey (Inactive) [ 06/May/13 ] |
|
Just for clarity Hyperion got worse: |
| Comment by Oleg Drokin [ 13/May/13 ] |
|
patch 6191 reverted from master |
| Comment by Cory Spitz [ 14/May/13 ] |
|
Was 6191 reverted due to |
| Comment by Peter Jones [ 14/May/13 ] |
|
Hi Cory Basically, Peter |
| Comment by Zhenyu Xu [ 04/May/14 ] |
|
per |
| Comment by Peter Jones [ 23/Jun/14 ] |
|
Landed for 2.6 |