[LU-1836] BUG: unable to handle kernel paging request at ffffffff8504dee0 IP: [<ffffffff810528a4>] update_curr+0x144/0x1f0 Created: 05/Sep/12  Updated: 06/Sep/12  Resolved: 06/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.1.3

Type: Bug Priority: Major
Reporter: Brian Murrell (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None
Environment:

lustre-client-2.1.1-2.6.32_220.13.1.el6.x86_64.x86_64
lustre-client-modules-2.1.1-2.6.32_220.13.1.el6.x86_64.x86_64


Severity: 3
Rank (Obsolete): 6338

 Description   

On the Chroma Lustre builder instance we got the following oops:

BUG: unable to handle kernel paging request at ffffffff8504dee0
IP: [<ffffffff810528a4>] update_curr+0x144/0x1f0
PGD 1a87067 PUD 1a8b063 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:06.0/local_cpus
CPU 0 
Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ipv6 microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]

Pid: 1983, comm: java Not tainted 2.6.32-220.13.1.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffff810528a4>]  [<ffffffff810528a4>] update_curr+0x144/0x1f0
RSP: 0000:ffff880002203db8  EFLAGS: 00010082
RAX: ffff880037bde040 RBX: 000000000068ae30 RCX: ffff88007e782f40
RDX: 0000000000018b48 RSI: 0000000000000000 RDI: ffff880037bde078
RBP: ffff880002203de8 R08: ffffffff8160b6a5 R09: 0000000000000000
R10: 0000000000000010 R11: 0000000000000000 R12: ffff880002215fe8
R13: 00000000000c54d5 R14: 0000068ae585f1ae R15: ffff880037bde040
FS:  00007f01c4a32700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff8504dee0 CR3: 000000007d3f3000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 1983, threadinfo ffff88007b8a4000, task ffff880037bde040)
Stack:
 ffff880002203dc8 0000000000000000 ffff880037bde078 ffff880002215fe8
<0> 0000000000000000 0000000000000000 ffff880002203e18 ffffffff81052e5b
<0> ffff880002215f80 0000000000000000 0000000000015f80 0000000000000000
Call Trace:
 <IRQ> 
 [<ffffffff81052e5b>] task_tick_fair+0xdb/0x160
 [<ffffffff810566a1>] scheduler_tick+0xc1/0x260
 [<ffffffff810a0cc0>] ? tick_sched_timer+0x0/0xc0
 [<ffffffff8107c382>] update_process_times+0x52/0x70
 [<ffffffff810a0d26>] tick_sched_timer+0x66/0xc0
 [<ffffffff810953ae>] __run_hrtimer+0x8e/0x1a0
 [<ffffffff81037599>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff81095756>] hrtimer_interrupt+0xe6/0x250
 [<ffffffff814f52fb>] smp_apic_timer_interrupt+0x6b/0x9b
 [<ffffffff8100bc13>] apic_timer_interrupt+0x13/0x20
 <EOI> 
 [<ffffffffa0623419>] ? cfs_hash_dual_bd_unlock+0x19/0x60 [libcfs]
 [<ffffffffa0624197>] ? cfs_hash_dual_bd_lookup_locked+0x37/0x70 [libcfs]
 [<ffffffffa0625336>] cfs_hash_lookup+0x76/0xa0 [libcfs]
 [<ffffffffa01bea99>] cl_env_fetch+0x29/0x70 [obdclass]
 [<ffffffffa01c0ca4>] cl_env_reexit+0x14/0x140 [obdclass]
 [<ffffffffa073c433>] ll_releasepage+0x33/0x50 [lustre]
 [<ffffffff8110ffb0>] try_to_release_page+0x30/0x60
 [<ffffffff8112a3f1>] shrink_page_list.clone.0+0x4f1/0x5c0
 [<ffffffff8112a7bb>] shrink_inactive_list+0x2fb/0x740
 [<ffffffff81038488>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8112b4cf>] shrink_zone+0x38f/0x520
 [<ffffffff8112b75e>] do_try_to_free_pages+0xfe/0x520
 [<ffffffff81114cef>] ? zone_watermark_ok+0x1f/0x30
 [<ffffffff8112bd6d>] try_to_free_pages+0x9d/0x130
 [<ffffffff8112cec0>] ? isolate_pages_global+0x0/0x350
 [<ffffffff81123ced>] __alloc_pages_nodemask+0x40d/0x940
 [<ffffffff8115e002>] kmem_getpages+0x62/0x170
 [<ffffffff8115ec1a>] fallback_alloc+0x1ba/0x270
 [<ffffffff8115e66f>] ? cache_grow+0x2cf/0x320
 [<ffffffff8115e999>] ____cache_alloc_node+0x99/0x160
 [<ffffffff8115f77b>] kmem_cache_alloc+0x11b/0x190
 [<ffffffffa06158c8>] cfs_mem_cache_alloc+0x48/0x50 [libcfs]
 [<ffffffffa04f1419>] osc_page_init+0x59/0x330 [osc]
 [<ffffffffa06c4f3b>] ? lovsub_page_init+0xdb/0x2f0 [lov]
 [<ffffffffa01c510b>] cl_page_find0+0x1eb/0x8a0 [obdclass]
 [<ffffffffa01c57d8>] cl_page_find_sub+0x18/0x20 [obdclass]
 [<ffffffffa06baad1>] lov_page_init_raid0+0x1a1/0x6d0 [lov]
 [<ffffffffa01c1982>] ? cl_page_slice_add+0x52/0x110 [obdclass]
 [<ffffffffa06b7bbd>] lov_page_init+0x6d/0xe0 [lov]
 [<ffffffffa01c510b>] cl_page_find0+0x1eb/0x8a0 [obdclass]
 [<ffffffff8113327e>] ? __inc_zone_page_state+0x2e/0x30
 [<ffffffff81128020>] ? __lru_cache_add+0x40/0x90
 [<ffffffffa01c57f1>] cl_page_find+0x11/0x20 [obdclass]
 [<ffffffffa072405d>] ll_readahead+0xedd/0x1290 [lustre]
 [<ffffffffa0747f6a>] ? ccc_page_is_under_lock+0x1aa/0x200 [lustre]
 [<ffffffffa0721f28>] ? ras_update+0x58/0xe30 [lustre]
 [<ffffffffa06bf87e>] ? lov_page_stripe+0x3e/0x150 [lov]
 [<ffffffffa074dee5>] vvp_io_read_page+0x385/0x3c0 [lustre]
 [<ffffffffa01cf2e5>] cl_io_read_page+0x95/0x1a0 [obdclass]
 [<ffffffffa01c3939>] ? cl_page_assume+0xe9/0x250 [obdclass]
 [<ffffffffa0724988>] ll_readpage+0x98/0x1f0 [lustre]
 [<ffffffff81128020>] ? __lru_cache_add+0x40/0x90
 [<ffffffff811120c3>] filemap_fault+0x313/0x500
 [<ffffffffa074e8bf>] vvp_io_fault_start+0x12f/0x5a0 [lustre]
 [<ffffffffa01c87f5>] ? cl_wait+0xb5/0x280 [obdclass]
 [<ffffffffa01cc9f8>] cl_io_start+0x68/0x170 [obdclass]
 [<ffffffffa01d1570>] cl_io_loop+0x110/0x1c0 [obdclass]
 [<ffffffffa0731bb4>] ll_fault0+0xb4/0x280 [lustre]
 [<ffffffffa01bff08>] ? cl_object_attr_get+0x88/0x1b0 [obdclass]
 [<ffffffffa0731fc8>] ll_fault+0x48/0x160 [lustre]
 [<ffffffff8113b414>] __do_fault+0x54/0x510
 [<ffffffffa01bfd5c>] ? cl_object_attr_set+0x8c/0x1b0 [obdclass]
 [<ffffffff8113b9c7>] handle_pte_fault+0xf7/0xb50
 [<ffffffff8120c0fa>] ? security_capable+0x2a/0x30
 [<ffffffff81076973>] ? capable+0x13/0x50
 [<ffffffffa061862e>] ? cfs_capable+0xe/0x10 [libcfs]
 [<ffffffff8113c604>] handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81042b79>] __do_page_fault+0x139/0x480
 [<ffffffff810564b3>] ? perf_event_task_sched_out+0x33/0x80
 [<ffffffffa06c28a7>] ? lov_io_commit_write+0xa7/0x1d0 [lov]
 [<ffffffff810532b0>] ? __dequeue_entity+0x30/0x50
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff814f298e>] do_page_fault+0x3e/0xa0
 [<ffffffff814efd45>] page_fault+0x25/0x30
 [<ffffffff8110fe0f>] ? iov_iter_fault_in_readable+0x2f/0x60
 [<ffffffff814ed910>] ? _cond_resched+0x30/0x40
 [<ffffffff8111152e>] generic_file_buffered_write+0xde/0x2a0
 [<ffffffff810707c7>] ? current_fs_time+0x27/0x30
 [<ffffffff81112eb0>] __generic_file_aio_write+0x250/0x480
 [<ffffffffa01be7e5>] ? cl_env_info+0x15/0x20 [obdclass]
 [<ffffffff8111314f>] generic_file_aio_write+0x6f/0xe0
 [<ffffffffa074e101>] vvp_io_write_start+0xa1/0x270 [lustre]
 [<ffffffffa01cc9f8>] cl_io_start+0x68/0x170 [obdclass]
 [<ffffffffa01d1570>] cl_io_loop+0x110/0x1c0 [obdclass]
 [<ffffffffa0625342>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
 [<ffffffffa06f59db>] ll_file_io_generic+0x44b/0x580 [lustre]
 [<ffffffffa0623434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
 [<ffffffffa01c05c9>] ? cl_env_get+0x29/0x350 [obdclass]
 [<ffffffffa06f5c4f>] ll_file_aio_write+0x13f/0x310 [lustre]
 [<ffffffffa01c073e>] ? cl_env_get+0x19e/0x350 [obdclass]
 [<ffffffffa06fc2e1>] ll_file_write+0x171/0x310 [lustre]
 [<ffffffff81176818>] vfs_write+0xb8/0x1a0
 [<ffffffff810d4832>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff811772e2>] sys_pwrite64+0x82/0xa0
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Code: 00 8b 15 3c 36 a4 00 85 d2 74 34 48 8b 50 08 8b 5a 18 48 8b 90 10 09 00 00 48 8b 4a 50 48 85 c9 74 1d 48 63 db 66 90 48 8b 51 20 <48> 03 14 dd 60 6d bf 81 4c 01 2a 48 8b 49 78 48 85 c9 75 e8 48 
RIP  [<ffffffff810528a4>] update_curr+0x144/0x1f0
 RSP <ffff880002203db8>
CR2: ffffffff8504dee0
---[ end trace 1c726fc394ca3f3f ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 1983, comm: java Tainted: G      D    ----------------   2.6.32-220.13.1.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff814ec83a>] ? panic+0x78/0x143
 [<ffffffff814f09e2>] ? oops_end+0xf2/0x100
 [<ffffffff8104234b>] ? no_context+0xfb/0x260
 [<ffffffff810425d5>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff810426a3>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff81042d5d>] ? __do_page_fault+0x31d/0x480
 [<ffffffff814626bd>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff81462948>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff814f298e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814efd45>] ? page_fault+0x25/0x30
 [<ffffffff810528a4>] ? update_curr+0x144/0x1f0
 [<ffffffff81052868>] ? update_curr+0x108/0x1f0
 [<ffffffff81052e5b>] ? task_tick_fair+0xdb/0x160
 [<ffffffff810566a1>] ? scheduler_tick+0xc1/0x260
 [<ffffffff810a0cc0>] ? tick_sched_timer+0x0/0xc0
 [<ffffffff8107c382>] ? update_process_times+0x52/0x70
 [<ffffffff810a0d26>] ? tick_sched_timer+0x66/0xc0
 [<ffffffff810953ae>] ? __run_hrtimer+0x8e/0x1a0
 [<ffffffff81037599>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff81095756>] ? hrtimer_interrupt+0xe6/0x250
 [<ffffffff814f52fb>] ? smp_apic_timer_interrupt+0x6b/0x9b
 [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffffa0623419>] ? cfs_hash_dual_bd_unlock+0x19/0x60 [libcfs]
 [<ffffffffa0624197>] ? cfs_hash_dual_bd_lookup_locked+0x37/0x70 [libcfs]
 [<ffffffffa0625336>] ? cfs_hash_lookup+0x76/0xa0 [libcfs]
 [<ffffffffa01bea99>] ? cl_env_fetch+0x29/0x70 [obdclass]
 [<ffffffffa01c0ca4>] ? cl_env_reexit+0x14/0x140 [obdclass]
 [<ffffffffa073c433>] ? ll_releasepage+0x33/0x50 [lustre]
 [<ffffffff8110ffb0>] ? try_to_release_page+0x30/0x60
 [<ffffffff8112a3f1>] ? shrink_page_list.clone.0+0x4f1/0x5c0
 [<ffffffff8112a7bb>] ? shrink_inactive_list+0x2fb/0x740
 [<ffffffff81038488>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8112b4cf>] ? shrink_zone+0x38f/0x520
 [<ffffffff8112b75e>] ? do_try_to_free_pages+0xfe/0x520
 [<ffffffff81114cef>] ? zone_watermark_ok+0x1f/0x30
 [<ffffffff8112bd6d>] ? try_to_free_pages+0x9d/0x130
 [<ffffffff8112cec0>] ? isolate_pages_global+0x0/0x350
 [<ffffffff81123ced>] ? __alloc_pages_nodemask+0x40d/0x940
 [<ffffffff8115e002>] ? kmem_getpages+0x62/0x170
 [<ffffffff8115ec1a>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8115e66f>] ? cache_grow+0x2cf/0x320
 [<ffffffff8115e999>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff8115f77b>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffffa06158c8>] ? cfs_mem_cache_alloc+0x48/0x50 [libcfs]
 [<ffffffffa04f1419>] ? osc_page_init+0x59/0x330 [osc]
 [<ffffffffa06c4f3b>] ? lovsub_page_init+0xdb/0x2f0 [lov]
 [<ffffffffa01c510b>] ? cl_page_find0+0x1eb/0x8a0 [obdclass]
 [<ffffffffa01c57d8>] ? cl_page_find_sub+0x18/0x20 [obdclass]
 [<ffffffffa06baad1>] ? lov_page_init_raid0+0x1a1/0x6d0 [lov]
 [<ffffffffa01c1982>] ? cl_page_slice_add+0x52/0x110 [obdclass]
 [<ffffffffa06b7bbd>] ? lov_page_init+0x6d/0xe0 [lov]
 [<ffffffffa01c510b>] ? cl_page_find0+0x1eb/0x8a0 [obdclass]
 [<ffffffff8113327e>] ? __inc_zone_page_state+0x2e/0x30
 [<ffffffff81128020>] ? __lru_cache_add+0x40/0x90
 [<ffffffffa01c57f1>] ? cl_page_find+0x11/0x20 [obdclass]
 [<ffffffffa072405d>] ? ll_readahead+0xedd/0x1290 [lustre]
 [<ffffffffa0747f6a>] ? ccc_page_is_under_lock+0x1aa/0x200 [lustre]
 [<ffffffffa0721f28>] ? ras_update+0x58/0xe30 [lustre]
 [<ffffffffa06bf87e>] ? lov_page_stripe+0x3e/0x150 [lov]
 [<ffffffffa074dee5>] ? vvp_io_read_page+0x385/0x3c0 [lustre]
 [<ffffffffa01cf2e5>] ? cl_io_read_page+0x95/0x1a0 [obdclass]
 [<ffffffffa01c3939>] ? cl_page_assume+0xe9/0x250 [obdclass]
 [<ffffffffa0724988>] ? ll_readpage+0x98/0x1f0 [lustre]
 [<ffffffff81128020>] ? __lru_cache_add+0x40/0x90
 [<ffffffff811120c3>] ? filemap_fault+0x313/0x500
 [<ffffffffa074e8bf>] ? vvp_io_fault_start+0x12f/0x5a0 [lustre]
 [<ffffffffa01c87f5>] ? cl_wait+0xb5/0x280 [obdclass]
 [<ffffffffa01cc9f8>] ? cl_io_start+0x68/0x170 [obdclass]
 [<ffffffffa01d1570>] ? cl_io_loop+0x110/0x1c0 [obdclass]
 [<ffffffffa0731bb4>] ? ll_fault0+0xb4/0x280 [lustre]
 [<ffffffffa01bff08>] ? cl_object_attr_get+0x88/0x1b0 [obdclass]
 [<ffffffffa0731fc8>] ? ll_fault+0x48/0x160 [lustre]
 [<ffffffff8113b414>] ? __do_fault+0x54/0x510
 [<ffffffffa01bfd5c>] ? cl_object_attr_set+0x8c/0x1b0 [obdclass]
 [<ffffffff8113b9c7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff8120c0fa>] ? security_capable+0x2a/0x30
 [<ffffffff81076973>] ? capable+0x13/0x50
 [<ffffffffa061862e>] ? cfs_capable+0xe/0x10 [libcfs]
 [<ffffffff8113c604>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81042b79>] ? __do_page_fault+0x139/0x480
 [<ffffffff810564b3>] ? perf_event_task_sched_out+0x33/0x80
 [<ffffffffa06c28a7>] ? lov_io_commit_write+0xa7/0x1d0 [lov]
 [<ffffffff810532b0>] ? __dequeue_entity+0x30/0x50
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff814f298e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814efd45>] ? page_fault+0x25/0x30
 [<ffffffff8110fe0f>] ? iov_iter_fault_in_readable+0x2f/0x60
 [<ffffffff814ed910>] ? _cond_resched+0x30/0x40
 [<ffffffff8111152e>] ? generic_file_buffered_write+0xde/0x2a0
 [<ffffffff810707c7>] ? current_fs_time+0x27/0x30
 [<ffffffff81112eb0>] ? __generic_file_aio_write+0x250/0x480
 [<ffffffffa01be7e5>] ? cl_env_info+0x15/0x20 [obdclass]
 [<ffffffff8111314f>] ? generic_file_aio_write+0x6f/0xe0
 [<ffffffffa074e101>] ? vvp_io_write_start+0xa1/0x270 [lustre]
 [<ffffffffa01cc9f8>] ? cl_io_start+0x68/0x170 [obdclass]
 [<ffffffffa01d1570>] ? cl_io_loop+0x110/0x1c0 [obdclass]
 [<ffffffffa0625342>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
 [<ffffffffa06f59db>] ? ll_file_io_generic+0x44b/0x580 [lustre]
 [<ffffffffa0623434>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
 [<ffffffffa01c05c9>] ? cl_env_get+0x29/0x350 [obdclass]
 [<ffffffffa06f5c4f>] ? ll_file_aio_write+0x13f/0x310 [lustre]
 [<ffffffffa01c073e>] ? cl_env_get+0x19e/0x350 [obdclass]
 [<ffffffffa06fc2e1>] ? ll_file_write+0x171/0x310 [lustre]
 [<ffffffff81176818>] ? vfs_write+0xb8/0x1a0
 [<ffffffff810d4832>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff811772e2>] ? sys_pwrite64+0x82/0xa0
 [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b


 Comments   
Comment by Jinshan Xiong (Inactive) [ 05/Sep/12 ]

stack overflow?

Comment by Brian Murrell (Inactive) [ 05/Sep/12 ]

I don't know TBH, but it did not escape me how deep the stack was when I was pasting it.

Comment by Keith Mannthey (Inactive) [ 05/Sep/12 ]

So the base code was trying do a write but the page was swapped out so it moves to allocate memory but it looked to have difficulty? Any idea how much memory was left on the system? No sign of OOM yet? Then a timer interrupts and looks like it blew up while in the scheduler.

Looks related to http://jira.whamcloud.com/browse/LU-1474 and http://jira.whamcloud.com/browse/LU-969 does your code have the fix from LU-969?

Comment by Oleg Drokin [ 05/Sep/12 ]

I vote stack overflow too. 2.1.1 does not have lu969 patches, but they were included into 2.1.3

Comment by Brian Murrell (Inactive) [ 06/Sep/12 ]

OK. I've updated that machine to b2_1. Let's see if it's more stable.

Comment by Brian Murrell (Inactive) [ 06/Sep/12 ]

I will re-open if problems continue after having done the upgrade to b2_1.

Generated at Sat Feb 10 01:20:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.