[LU-1836] BUG: unable to handle kernel paging request at ffffffff8504dee0 IP: [<ffffffff810528a4>] update_curr+0x144/0x1f0 Created: 05/Sep/12 Updated: 06/Sep/12 Resolved: 06/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.1.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Brian Murrell (Inactive) | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-client-2.1.1-2.6.32_220.13.1.el6.x86_64.x86_64 |
||
| Severity: | 3 |
| Rank (Obsolete): | 6338 |
| Description |
|
On the Chroma Lustre builder instance we got the following oops: BUG: unable to handle kernel paging request at ffffffff8504dee0
IP: [<ffffffff810528a4>] update_curr+0x144/0x1f0
PGD 1a87067 PUD 1a8b063 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:06.0/local_cpus
CPU 0
Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
Pid: 1983, comm: java Not tainted 2.6.32-220.13.1.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff810528a4>] [<ffffffff810528a4>] update_curr+0x144/0x1f0
RSP: 0000:ffff880002203db8 EFLAGS: 00010082
RAX: ffff880037bde040 RBX: 000000000068ae30 RCX: ffff88007e782f40
RDX: 0000000000018b48 RSI: 0000000000000000 RDI: ffff880037bde078
RBP: ffff880002203de8 R08: ffffffff8160b6a5 R09: 0000000000000000
R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002215fe8
R13: 00000000000c54d5 R14: 0000068ae585f1ae R15: ffff880037bde040
FS: 00007f01c4a32700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff8504dee0 CR3: 000000007d3f3000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 1983, threadinfo ffff88007b8a4000, task ffff880037bde040)
Stack:
ffff880002203dc8 0000000000000000 ffff880037bde078 ffff880002215fe8
<0> 0000000000000000 0000000000000000 ffff880002203e18 ffffffff81052e5b
<0> ffff880002215f80 0000000000000000 0000000000015f80 0000000000000000
Call Trace:
<IRQ>
[<ffffffff81052e5b>] task_tick_fair+0xdb/0x160
[<ffffffff810566a1>] scheduler_tick+0xc1/0x260
[<ffffffff810a0cc0>] ? tick_sched_timer+0x0/0xc0
[<ffffffff8107c382>] update_process_times+0x52/0x70
[<ffffffff810a0d26>] tick_sched_timer+0x66/0xc0
[<ffffffff810953ae>] __run_hrtimer+0x8e/0x1a0
[<ffffffff81037599>] ? kvm_clock_get_cycles+0x9/0x10
[<ffffffff81095756>] hrtimer_interrupt+0xe6/0x250
[<ffffffff814f52fb>] smp_apic_timer_interrupt+0x6b/0x9b
[<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20
<EOI>
[<ffffffffa0623419>] ? cfs_hash_dual_bd_unlock+0x19/0x60 [libcfs]
[<ffffffffa0624197>] ? cfs_hash_dual_bd_lookup_locked+0x37/0x70 [libcfs]
[<ffffffffa0625336>] cfs_hash_lookup+0x76/0xa0 [libcfs]
[<ffffffffa01bea99>] cl_env_fetch+0x29/0x70 [obdclass]
[<ffffffffa01c0ca4>] cl_env_reexit+0x14/0x140 [obdclass]
[<ffffffffa073c433>] ll_releasepage+0x33/0x50 [lustre]
[<ffffffff8110ffb0>] try_to_release_page+0x30/0x60
[<ffffffff8112a3f1>] shrink_page_list.clone.0+0x4f1/0x5c0
[<ffffffff8112a7bb>] shrink_inactive_list+0x2fb/0x740
[<ffffffff81038488>] ? pvclock_clocksource_read+0x58/0xd0
[<ffffffff8112b4cf>] shrink_zone+0x38f/0x520
[<ffffffff8112b75e>] do_try_to_free_pages+0xfe/0x520
[<ffffffff81114cef>] ? zone_watermark_ok+0x1f/0x30
[<ffffffff8112bd6d>] try_to_free_pages+0x9d/0x130
[<ffffffff8112cec0>] ? isolate_pages_global+0x0/0x350
[<ffffffff81123ced>] __alloc_pages_nodemask+0x40d/0x940
[<ffffffff8115e002>] kmem_getpages+0x62/0x170
[<ffffffff8115ec1a>] fallback_alloc+0x1ba/0x270
[<ffffffff8115e66f>] ? cache_grow+0x2cf/0x320
[<ffffffff8115e999>] ____cache_alloc_node+0x99/0x160
[<ffffffff8115f77b>] kmem_cache_alloc+0x11b/0x190
[<ffffffffa06158c8>] cfs_mem_cache_alloc+0x48/0x50 [libcfs]
[<ffffffffa04f1419>] osc_page_init+0x59/0x330 [osc]
[<ffffffffa06c4f3b>] ? lovsub_page_init+0xdb/0x2f0 [lov]
[<ffffffffa01c510b>] cl_page_find0+0x1eb/0x8a0 [obdclass]
[<ffffffffa01c57d8>] cl_page_find_sub+0x18/0x20 [obdclass]
[<ffffffffa06baad1>] lov_page_init_raid0+0x1a1/0x6d0 [lov]
[<ffffffffa01c1982>] ? cl_page_slice_add+0x52/0x110 [obdclass]
[<ffffffffa06b7bbd>] lov_page_init+0x6d/0xe0 [lov]
[<ffffffffa01c510b>] cl_page_find0+0x1eb/0x8a0 [obdclass]
[<ffffffff8113327e>] ? __inc_zone_page_state+0x2e/0x30
[<ffffffff81128020>] ? __lru_cache_add+0x40/0x90
[<ffffffffa01c57f1>] cl_page_find+0x11/0x20 [obdclass]
[<ffffffffa072405d>] ll_readahead+0xedd/0x1290 [lustre]
[<ffffffffa0747f6a>] ? ccc_page_is_under_lock+0x1aa/0x200 [lustre]
[<ffffffffa0721f28>] ? ras_update+0x58/0xe30 [lustre]
[<ffffffffa06bf87e>] ? lov_page_stripe+0x3e/0x150 [lov]
[<ffffffffa074dee5>] vvp_io_read_page+0x385/0x3c0 [lustre]
[<ffffffffa01cf2e5>] cl_io_read_page+0x95/0x1a0 [obdclass]
[<ffffffffa01c3939>] ? cl_page_assume+0xe9/0x250 [obdclass]
[<ffffffffa0724988>] ll_readpage+0x98/0x1f0 [lustre]
[<ffffffff81128020>] ? __lru_cache_add+0x40/0x90
[<ffffffff811120c3>] filemap_fault+0x313/0x500
[<ffffffffa074e8bf>] vvp_io_fault_start+0x12f/0x5a0 [lustre]
[<ffffffffa01c87f5>] ? cl_wait+0xb5/0x280 [obdclass]
[<ffffffffa01cc9f8>] cl_io_start+0x68/0x170 [obdclass]
[<ffffffffa01d1570>] cl_io_loop+0x110/0x1c0 [obdclass]
[<ffffffffa0731bb4>] ll_fault0+0xb4/0x280 [lustre]
[<ffffffffa01bff08>] ? cl_object_attr_get+0x88/0x1b0 [obdclass]
[<ffffffffa0731fc8>] ll_fault+0x48/0x160 [lustre]
[<ffffffff8113b414>] __do_fault+0x54/0x510
[<ffffffffa01bfd5c>] ? cl_object_attr_set+0x8c/0x1b0 [obdclass]
[<ffffffff8113b9c7>] handle_pte_fault+0xf7/0xb50
[<ffffffff8120c0fa>] ? security_capable+0x2a/0x30
[<ffffffff81076973>] ? capable+0x13/0x50
[<ffffffffa061862e>] ? cfs_capable+0xe/0x10 [libcfs]
[<ffffffff8113c604>] handle_mm_fault+0x1e4/0x2b0
[<ffffffff81042b79>] __do_page_fault+0x139/0x480
[<ffffffff810564b3>] ? perf_event_task_sched_out+0x33/0x80
[<ffffffffa06c28a7>] ? lov_io_commit_write+0xa7/0x1d0 [lov]
[<ffffffff810532b0>] ? __dequeue_entity+0x30/0x50
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff814f298e>] do_page_fault+0x3e/0xa0
[<ffffffff814efd45>] page_fault+0x25/0x30
[<ffffffff8110fe0f>] ? iov_iter_fault_in_readable+0x2f/0x60
[<ffffffff814ed910>] ? _cond_resched+0x30/0x40
[<ffffffff8111152e>] generic_file_buffered_write+0xde/0x2a0
[<ffffffff810707c7>] ? current_fs_time+0x27/0x30
[<ffffffff81112eb0>] __generic_file_aio_write+0x250/0x480
[<ffffffffa01be7e5>] ? cl_env_info+0x15/0x20 [obdclass]
[<ffffffff8111314f>] generic_file_aio_write+0x6f/0xe0
[<ffffffffa074e101>] vvp_io_write_start+0xa1/0x270 [lustre]
[<ffffffffa01cc9f8>] cl_io_start+0x68/0x170 [obdclass]
[<ffffffffa01d1570>] cl_io_loop+0x110/0x1c0 [obdclass]
[<ffffffffa0625342>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
[<ffffffffa06f59db>] ll_file_io_generic+0x44b/0x580 [lustre]
[<ffffffffa0623434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
[<ffffffffa01c05c9>] ? cl_env_get+0x29/0x350 [obdclass]
[<ffffffffa06f5c4f>] ll_file_aio_write+0x13f/0x310 [lustre]
[<ffffffffa01c073e>] ? cl_env_get+0x19e/0x350 [obdclass]
[<ffffffffa06fc2e1>] ll_file_write+0x171/0x310 [lustre]
[<ffffffff81176818>] vfs_write+0xb8/0x1a0
[<ffffffff810d4832>] ? audit_syscall_entry+0x272/0x2a0
[<ffffffff811772e2>] sys_pwrite64+0x82/0xa0
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Code: 00 8b 15 3c 36 a4 00 85 d2 74 34 48 8b 50 08 8b 5a 18 48 8b 90 10 09 00 00 48 8b 4a 50 48 85 c9 74 1d 48 63 db 66 90 48 8b 51 20 <48> 03 14 dd 60 6d bf 81 4c 01 2a 48 8b 49 78 48 85 c9 75 e8 48
RIP [<ffffffff810528a4>] update_curr+0x144/0x1f0
RSP <ffff880002203db8>
CR2: ffffffff8504dee0
---[ end trace 1c726fc394ca3f3f ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 1983, comm: java Tainted: G D ---------------- 2.6.32-220.13.1.el6.x86_64 #1
Call Trace:
<IRQ> [<ffffffff814ec83a>] ? panic+0x78/0x143
[<ffffffff814f09e2>] ? oops_end+0xf2/0x100
[<ffffffff8104234b>] ? no_context+0xfb/0x260
[<ffffffff810425d5>] ? __bad_area_nosemaphore+0x125/0x1e0
[<ffffffff810426a3>] ? bad_area_nosemaphore+0x13/0x20
[<ffffffff81042d5d>] ? __do_page_fault+0x31d/0x480
[<ffffffff814626bd>] ? ip_local_deliver_finish+0xdd/0x2d0
[<ffffffff81462948>] ? ip_local_deliver+0x98/0xa0
[<ffffffff814f298e>] ? do_page_fault+0x3e/0xa0
[<ffffffff814efd45>] ? page_fault+0x25/0x30
[<ffffffff810528a4>] ? update_curr+0x144/0x1f0
[<ffffffff81052868>] ? update_curr+0x108/0x1f0
[<ffffffff81052e5b>] ? task_tick_fair+0xdb/0x160
[<ffffffff810566a1>] ? scheduler_tick+0xc1/0x260
[<ffffffff810a0cc0>] ? tick_sched_timer+0x0/0xc0
[<ffffffff8107c382>] ? update_process_times+0x52/0x70
[<ffffffff810a0d26>] ? tick_sched_timer+0x66/0xc0
[<ffffffff810953ae>] ? __run_hrtimer+0x8e/0x1a0
[<ffffffff81037599>] ? kvm_clock_get_cycles+0x9/0x10
[<ffffffff81095756>] ? hrtimer_interrupt+0xe6/0x250
[<ffffffff814f52fb>] ? smp_apic_timer_interrupt+0x6b/0x9b
[<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
<EOI> [<ffffffffa0623419>] ? cfs_hash_dual_bd_unlock+0x19/0x60 [libcfs]
[<ffffffffa0624197>] ? cfs_hash_dual_bd_lookup_locked+0x37/0x70 [libcfs]
[<ffffffffa0625336>] ? cfs_hash_lookup+0x76/0xa0 [libcfs]
[<ffffffffa01bea99>] ? cl_env_fetch+0x29/0x70 [obdclass]
[<ffffffffa01c0ca4>] ? cl_env_reexit+0x14/0x140 [obdclass]
[<ffffffffa073c433>] ? ll_releasepage+0x33/0x50 [lustre]
[<ffffffff8110ffb0>] ? try_to_release_page+0x30/0x60
[<ffffffff8112a3f1>] ? shrink_page_list.clone.0+0x4f1/0x5c0
[<ffffffff8112a7bb>] ? shrink_inactive_list+0x2fb/0x740
[<ffffffff81038488>] ? pvclock_clocksource_read+0x58/0xd0
[<ffffffff8112b4cf>] ? shrink_zone+0x38f/0x520
[<ffffffff8112b75e>] ? do_try_to_free_pages+0xfe/0x520
[<ffffffff81114cef>] ? zone_watermark_ok+0x1f/0x30
[<ffffffff8112bd6d>] ? try_to_free_pages+0x9d/0x130
[<ffffffff8112cec0>] ? isolate_pages_global+0x0/0x350
[<ffffffff81123ced>] ? __alloc_pages_nodemask+0x40d/0x940
[<ffffffff8115e002>] ? kmem_getpages+0x62/0x170
[<ffffffff8115ec1a>] ? fallback_alloc+0x1ba/0x270
[<ffffffff8115e66f>] ? cache_grow+0x2cf/0x320
[<ffffffff8115e999>] ? ____cache_alloc_node+0x99/0x160
[<ffffffff8115f77b>] ? kmem_cache_alloc+0x11b/0x190
[<ffffffffa06158c8>] ? cfs_mem_cache_alloc+0x48/0x50 [libcfs]
[<ffffffffa04f1419>] ? osc_page_init+0x59/0x330 [osc]
[<ffffffffa06c4f3b>] ? lovsub_page_init+0xdb/0x2f0 [lov]
[<ffffffffa01c510b>] ? cl_page_find0+0x1eb/0x8a0 [obdclass]
[<ffffffffa01c57d8>] ? cl_page_find_sub+0x18/0x20 [obdclass]
[<ffffffffa06baad1>] ? lov_page_init_raid0+0x1a1/0x6d0 [lov]
[<ffffffffa01c1982>] ? cl_page_slice_add+0x52/0x110 [obdclass]
[<ffffffffa06b7bbd>] ? lov_page_init+0x6d/0xe0 [lov]
[<ffffffffa01c510b>] ? cl_page_find0+0x1eb/0x8a0 [obdclass]
[<ffffffff8113327e>] ? __inc_zone_page_state+0x2e/0x30
[<ffffffff81128020>] ? __lru_cache_add+0x40/0x90
[<ffffffffa01c57f1>] ? cl_page_find+0x11/0x20 [obdclass]
[<ffffffffa072405d>] ? ll_readahead+0xedd/0x1290 [lustre]
[<ffffffffa0747f6a>] ? ccc_page_is_under_lock+0x1aa/0x200 [lustre]
[<ffffffffa0721f28>] ? ras_update+0x58/0xe30 [lustre]
[<ffffffffa06bf87e>] ? lov_page_stripe+0x3e/0x150 [lov]
[<ffffffffa074dee5>] ? vvp_io_read_page+0x385/0x3c0 [lustre]
[<ffffffffa01cf2e5>] ? cl_io_read_page+0x95/0x1a0 [obdclass]
[<ffffffffa01c3939>] ? cl_page_assume+0xe9/0x250 [obdclass]
[<ffffffffa0724988>] ? ll_readpage+0x98/0x1f0 [lustre]
[<ffffffff81128020>] ? __lru_cache_add+0x40/0x90
[<ffffffff811120c3>] ? filemap_fault+0x313/0x500
[<ffffffffa074e8bf>] ? vvp_io_fault_start+0x12f/0x5a0 [lustre]
[<ffffffffa01c87f5>] ? cl_wait+0xb5/0x280 [obdclass]
[<ffffffffa01cc9f8>] ? cl_io_start+0x68/0x170 [obdclass]
[<ffffffffa01d1570>] ? cl_io_loop+0x110/0x1c0 [obdclass]
[<ffffffffa0731bb4>] ? ll_fault0+0xb4/0x280 [lustre]
[<ffffffffa01bff08>] ? cl_object_attr_get+0x88/0x1b0 [obdclass]
[<ffffffffa0731fc8>] ? ll_fault+0x48/0x160 [lustre]
[<ffffffff8113b414>] ? __do_fault+0x54/0x510
[<ffffffffa01bfd5c>] ? cl_object_attr_set+0x8c/0x1b0 [obdclass]
[<ffffffff8113b9c7>] ? handle_pte_fault+0xf7/0xb50
[<ffffffff8120c0fa>] ? security_capable+0x2a/0x30
[<ffffffff81076973>] ? capable+0x13/0x50
[<ffffffffa061862e>] ? cfs_capable+0xe/0x10 [libcfs]
[<ffffffff8113c604>] ? handle_mm_fault+0x1e4/0x2b0
[<ffffffff81042b79>] ? __do_page_fault+0x139/0x480
[<ffffffff810564b3>] ? perf_event_task_sched_out+0x33/0x80
[<ffffffffa06c28a7>] ? lov_io_commit_write+0xa7/0x1d0 [lov]
[<ffffffff810532b0>] ? __dequeue_entity+0x30/0x50
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff814f298e>] ? do_page_fault+0x3e/0xa0
[<ffffffff814efd45>] ? page_fault+0x25/0x30
[<ffffffff8110fe0f>] ? iov_iter_fault_in_readable+0x2f/0x60
[<ffffffff814ed910>] ? _cond_resched+0x30/0x40
[<ffffffff8111152e>] ? generic_file_buffered_write+0xde/0x2a0
[<ffffffff810707c7>] ? current_fs_time+0x27/0x30
[<ffffffff81112eb0>] ? __generic_file_aio_write+0x250/0x480
[<ffffffffa01be7e5>] ? cl_env_info+0x15/0x20 [obdclass]
[<ffffffff8111314f>] ? generic_file_aio_write+0x6f/0xe0
[<ffffffffa074e101>] ? vvp_io_write_start+0xa1/0x270 [lustre]
[<ffffffffa01cc9f8>] ? cl_io_start+0x68/0x170 [obdclass]
[<ffffffffa01d1570>] ? cl_io_loop+0x110/0x1c0 [obdclass]
[<ffffffffa0625342>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
[<ffffffffa06f59db>] ? ll_file_io_generic+0x44b/0x580 [lustre]
[<ffffffffa0623434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
[<ffffffffa01c05c9>] ? cl_env_get+0x29/0x350 [obdclass]
[<ffffffffa06f5c4f>] ? ll_file_aio_write+0x13f/0x310 [lustre]
[<ffffffffa01c073e>] ? cl_env_get+0x19e/0x350 [obdclass]
[<ffffffffa06fc2e1>] ? ll_file_write+0x171/0x310 [lustre]
[<ffffffff81176818>] ? vfs_write+0xb8/0x1a0
[<ffffffff810d4832>] ? audit_syscall_entry+0x272/0x2a0
[<ffffffff811772e2>] ? sys_pwrite64+0x82/0xa0
[<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
|
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 05/Sep/12 ] |
|
stack overflow? |
| Comment by Brian Murrell (Inactive) [ 05/Sep/12 ] |
|
I don't know TBH, but it did not escape me how deep the stack was when I was pasting it. |
| Comment by Keith Mannthey (Inactive) [ 05/Sep/12 ] |
|
So the base code was trying do a write but the page was swapped out so it moves to allocate memory but it looked to have difficulty? Any idea how much memory was left on the system? No sign of OOM yet? Then a timer interrupts and looks like it blew up while in the scheduler. Looks related to http://jira.whamcloud.com/browse/LU-1474 and http://jira.whamcloud.com/browse/LU-969 does your code have the fix from |
| Comment by Oleg Drokin [ 05/Sep/12 ] |
|
I vote stack overflow too. 2.1.1 does not have lu969 patches, but they were included into 2.1.3 |
| Comment by Brian Murrell (Inactive) [ 06/Sep/12 ] |
|
OK. I've updated that machine to b2_1. Let's see if it's more stable. |
| Comment by Brian Murrell (Inactive) [ 06/Sep/12 ] |
|
I will re-open if problems continue after having done the upgrade to b2_1. |