[LU-1138] Client Panic on Lustre 1.8.6 and RHEL 6 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 1.8.6
Labels:
None
Environment:
RHEL 6 2.6.32-71.el6.x86_64 kernel

Severity:
3
Rank (Obsolete):
6447

Description

Customer reports that a few compute nodes have been panic'ing. They have seen the behavior on 7 nodes. Each node has seen the problem numerous times. It looks like it may be similar to ~~LU-93~~. I'd like to get Whamcloud to weigh in on whether you think it is related or if it is a known issue. The trace backs and console messages are attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

r20i1n11.messages.bz2
3 kB
24/Feb/12 3:15 PM
r20i1n11-20120213.console.bz2
69 kB
24/Feb/12 3:15 PM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-1138] Client Panic on Lustre 1.8.6 and RHEL 6

Peter Jones added a comment - 24/Apr/12 7:19 PM

ok thanks for the update Dennis

Peter Jones added a comment - 24/Apr/12 7:19 PM ok thanks for the update Dennis

Dennis Nelson added a comment - 24/Apr/12 12:10 PM

Just looking at open cases. Customer found this was not a Lustre issue after all. I believe that they upgraded the kernel to fix the issue. Please close this.

Dennis Nelson added a comment - 24/Apr/12 12:10 PM Just looking at open cases. Customer found this was not a Lustre issue after all. I believe that they upgraded the kernel to fix the issue. Please close this.

Zhenyu Xu added a comment - 29/Feb/12 7:37 AM

When did this situation happen? Did it happen after switched to RHEL6? Upgrading from older Lustre 1.8.6, or since starting to use specific kernel version or other software?

Zhenyu Xu added a comment - 29/Feb/12 7:37 AM When did this situation happen? Did it happen after switched to RHEL6? Upgrading from older Lustre 1.8.6, or since starting to use specific kernel version or other software?

Peter Jones added a comment - 28/Feb/12 1:50 PM

Bobi

Andreas is rather busy at the moment so could you please review and comment on this latest information from our customer?

Thanks

Peter

Peter Jones added a comment - 28/Feb/12 1:50 PM Bobi Andreas is rather busy at the moment so could you please review and comment on this latest information from our customer? Thanks Peter

Dennis Nelson added a comment - 27/Feb/12 11:22 PM

I received the following from the customer today:

Please ask WC to stand down on it being P1. We found in a sample that there was
lustre so we went with that. Once we started looking at all of the traces lustre is present it SOME of
the stack traces but it is not in the most common. I would appreciate if Andreas
can have a look at some more strack traces to see if there is anything
he's seen before though.

ftp://shell.sgi.com/collect/jhanson/nodeswithsoftlockupconsoles.tar.bz2

What I've found by looking at these

Once there is a BUG: soft lockup the next lines are like this (example chosen at random)

BUG: soft lockup - CPU#0 stuck for 61s! [global_fcst:30024]
Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt
iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler]
CPU 0:
Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt
iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler]
Pid: 30024, comm: global_fcst Tainted: G W ---------------- 2.6.32-71.el6.x86_64 #1 AltixICE8400IP105
RIP: 0010:[<ffffffff814caa3e>] [<ffffffff814caa3e>] _spin_lock+0x1e/0x30
RSP: 0018:ffff8802e9b3fc38 EFLAGS: 00000297
RAX: 000000000000e364 RBX: ffff8802e9b3fc38 RCX: ffff8804b764de80
RDX: 0000000000000000 RSI: ffff88033d53d208 RDI: ffff880637837268
RBP: ffffffff81013c8e R08: ffff8802e9b3fe10 R09: 0000000000100000
R10: 00007fffffff2dc0 R11: 0000000000000213 R12: ffff88033b712100
R13: ffffffff817300c0 R14: ffff88033b7126b8 R15: 0000000000010518
FS: 00002aaaaf3e0800(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaae8f0840 CR3: 000000033ca90000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
[<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem]
[<ffffffff81068598>] ? get_task_mm+0x28/0x70
[<ffffffffa030073a>] ? xpmem_make+0x9a/0x360 [xpmem]
[<ffffffff8110c037>] ? __lock_page+0x67/0x70
[<ffffffffa02ff19d>] ? xpmem_ioctl+0xdd/0x3f0 [xpmem]
[<ffffffff8110dade>] ? filemap_fault+0xbe/0x510
[<ffffffff8110c177>] ? unlock_page+0x27/0x30
[<ffffffff81135837>] ? handle_pte_fault+0xf7/0xad0
[<ffffffff811502a7>] ? alloc_pages_current+0x87/0xd0
[<ffffffff8117f182>] ? vfs_ioctl+0x22/0xa0
[<ffffffff81258ae5>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff8117f324>] ? do_vfs_ioctl+0x84/0x580
[<ffffffff811363fd>] ? handle_mm_fault+0x1ed/0x2b0
[<ffffffff8117f8a1>] ? sys_ioctl+0x81/0xa0
[<ffffffff81013172>] ? system_call_fastpath+0x16/0x1b

So I went to look for commonality after Call Trace: and found little with lots
of possible places to check.

guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | sort | uniq
–
Call Trace:
[<ffffffff810117bc>] ? __switch_to+0x1ac/0x320
[<ffffffff81013ace>] ? common_interrupt+0xe/0x13
[<ffffffff81013b76>] retint_careful+0x14/0x32
[<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff81013cee>] ? invalidate_interrupt1+0xe/0x20
[<ffffffff81013d4e>] ? invalidate_interrupt4+0xe/0x20
[<ffffffff81013d6e>] ? invalidate_interrupt5+0xe/0x20
[<ffffffff81014162>] ? kernel_thread+0x82/0xe0
[<ffffffff81014645>] ? math_state_restore+0x45/0x60
[<ffffffff8101660f>] ? dump_trace+0x1af/0x3a0
[<ffffffff8101a4f9>] ? read_tsc+0x9/0x20
[<ffffffff8104f61c>] ? enqueue_task+0x5c/0x70
[<ffffffff8104fff9>] ? __wake_up_common+0x59/0x90
[<ffffffff810507f8>] ? resched_task+0x68/0x80
[<ffffffff810508a5>] ? check_preempt_curr_idle+0x15/0x20
[<ffffffff81056303>] ? __wake_up+0x53/0x70
[<ffffffff81056630>] ? __dequeue_entity+0x30/0x50
[<ffffffff81059d12>] ? finish_task_switch+0x42/0xd0
[<ffffffff8105a808>] ? pull_task+0x58/0x70
[<ffffffff8105c490>] ? default_wake_function+0x0/0x20
[<ffffffff8105c4a2>] ? default_wake_function+0x12/0x20
[<ffffffff8105c4e5>] ? wake_up_process+0x15/0x20
[<ffffffff8105c756>] ? update_curr+0xe6/0x1e0
[<ffffffff8105fa72>] ? enqueue_entity+0x122/0x320
[<ffffffff8105fcb3>] ? enqueue_task_fair+0x43/0x90
[<ffffffff81061b71>] ? dequeue_entity+0x1a1/0x1e0
[<ffffffff81062b84>] ? find_busiest_group+0x254/0xb40
[<ffffffff8106329a>] ? find_busiest_group+0x96a/0xb40
[<ffffffff81066d6e>] ? select_task_rq_fair+0x9ee/0xab0
[<ffffffff810670c1>] ? check_preempt_wakeup+0x41/0x3c0
[<ffffffff81067244>] ? check_preempt_wakeup+0x1c4/0x3c0
[<ffffffff81067732>] migration_thread+0x1d2/0x310
[<ffffffff81069207>] ? dup_mm+0x2a7/0x520
[<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0
[<ffffffff8106b9f5>] ? __call_console_drivers+0x75/0x90
[<ffffffff8106d0a1>] do_syslog+0x461/0x4c0
[<ffffffff8106f805>] do_wait+0x1c5/0x250
[<ffffffff8107064f>] do_exit+0x56f/0x820
[<ffffffff810737a5>] ksoftirqd+0xd5/0x110
[<ffffffff8107d5ac>] ? lock_timer_base+0x3c/0x70
[<ffffffff8107e616>] ? mod_timer+0x146/0x230
[<ffffffff8107e718>] ? add_timer+0x18/0x30
[<ffffffff8108ac20>] ? __call_usermodehelper+0x0/0xa0
[<ffffffff8108c4a0>] ? worker_thread+0x0/0x2a0
[<ffffffff8108cc82>] ? queue_work_on+0x42/0x60
[<ffffffff81091cb6>] ? autoremove_wake_function+0x16/0x40
[<ffffffff81091eae>] ? prepare_to_wait_exclusive+0x4e/0x80
[<ffffffff81091f8e>] ? prepare_to_wait+0x4e/0x80
[<ffffffff81095da3>] ? __hrtimer_start_range_ns+0x1a3/0x430
[<ffffffff8109638a>] ? down_read_trylock+0x1a/0x30
[<ffffffff81096bff>] ? up+0x2f/0x50
[<ffffffff81098f05>] async_manager_thread+0xc5/0x100
[<ffffffff8109b9a9>] ? ktime_get_ts+0xa9/0xe0
[<ffffffff810a25a9>] futex_wait_queue_me+0xb9/0xf0
[<ffffffff810a666b>] ? rt_mutex_adjust_pi+0x7b/0x90
[<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff810ca7b6>] ? audit_hold_skb+0x26/0x50
[<ffffffff810cab7b>] ? kauditd_send_skb+0x3b/0x90
[<ffffffff810d3d4b>] ? audit_syscall_exit+0x25b/0x290
[<ffffffff8110351b>] slow_work_thread+0x32b/0x3a0
[<ffffffff81108047>] ? perf_event_exit_task+0x37/0x160
[<ffffffff8110b832>] ? iov_iter_copy_from_user_atomic+0x92/0x130
[<ffffffff8110bb70>] ? find_get_pages_tag+0x40/0x120
[<ffffffff8110c060>] ? sync_page+0x0/0x50
[<ffffffff8110c0b0>] ? sync_page_killable+0x0/0x40
[<ffffffff8110eecb>] oom_kill_process+0xcb/0x2e0
[<ffffffff8111b3a5>] ? __rmqueue+0xc5/0x490
[<ffffffff8111bd57>] bad_page+0x107/0x160
[<ffffffff8111cf91>] ? get_page_from_freelist+0x3d1/0x820
[<ffffffff8111e1c6>] ? __alloc_pages_nodemask+0xf6/0x810
[<ffffffff8111e48d>] ? __alloc_pages_nodemask+0x3bd/0x810
[<ffffffff8111e745>] __alloc_pages_nodemask+0x675/0x810
[<ffffffff8111f78a>] ? determine_dirtyable_memory+0x1a/0x30
[<ffffffff81120951>] ? do_writepages+0x21/0x40
[<ffffffff8112bc27>] ? vma_prio_tree_next+0x47/0x70
[<ffffffff8112d14d>] ? zone_statistics+0x7d/0xa0
[<ffffffff8112d980>] ? vmstat_update+0x0/0x40
[<ffffffff8112de70>] ? bdi_sync_supers+0x0/0x60
[<ffffffff811336b5>] ? unmap_vmas+0xa85/0xc00
[<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150
[<ffffffff81135a85>] ? handle_pte_fault+0x345/0xad0
[<ffffffff81136455>] ? handle_mm_fault+0x245/0x2b0
[<ffffffff81139582>] ? unlink_file_vma+0x42/0x70
[<ffffffff8113e59d>] ? rmap_walk+0x7d/0x1c0
[<ffffffff8113f2de>] ? page_referenced+0x9e/0x2f0
[<ffffffff8113fb72>] ? try_to_unmap_file+0x42/0x750
[<ffffffff81156007>] ? cache_grow+0x217/0x320
[<ffffffff811560bf>] ? cache_grow+0x2cf/0x320
[<ffffffff81157e51>] ? drain_array+0xe1/0x100
[<ffffffff81158d38>] ? drain_freelist+0x78/0xc0
[<ffffffff81158d80>] ? cache_reap+0x0/0x260
[<ffffffff8115fe28>] ? __mem_cgroup_uncharge_common+0x78/0x260
[<ffffffff81161c89>] ? mem_cgroup_charge_common+0x99/0xc0
[<ffffffff81165218>] khugepaged+0x958/0x1190
[<ffffffff8116c65a>] ? do_sync_read+0xfa/0x140
[<ffffffff81175fdb>] pipe_wait+0x5b/0x80
[<ffffffff81258839>] ? cpumask_next_and+0x29/0x50
[<ffffffff81262a54>] ? vsnprintf+0x484/0x5f0
[<ffffffff81264025>] ? memmove+0x45/0x50
[<ffffffff812fcaa0>] ? flush_to_ldisc+0x0/0x1b0
[<ffffffff812fee81>] vt_event_wait+0xa1/0x100
[<ffffffff8137fe39>] hub_thread+0x369/0x17f0
[<ffffffff8138a164>] ? usb_suspend_both+0x1a4/0x320
[<ffffffff814277d0>] ? eth_type_trans+0x40/0x140
[<ffffffff81445e95>] ? ip_local_out+0x25/0x30
[<ffffffff8144e7e6>] ? tcp_sendmsg+0x756/0xa30
[<ffffffff8149b2d6>] ? unix_stream_sendmsg+0x3c6/0x3e0
[<ffffffff814c7b23>] panic+0x78/0x137
[<ffffffff814c8286>] ? thread_return+0x4e/0x778
[<ffffffff814c8b00>] ? _cond_resched+0x30/0x40
[<ffffffff814c8c5c>] ? wait_for_common+0x14c/0x180
[<ffffffff814c8d4d>] ? wait_for_completion+0x1d/0x20
[<ffffffff814c8f34>] schedule_timeout+0x194/0x2f0
[<ffffffff814c8f3c>] ? schedule_timeout+0x19c/0x2f0
[<ffffffff814c8fc5>] schedule_timeout+0x225/0x2f0
[<ffffffff814c96e0>] ? __mutex_lock_slowpath+0x70/0x180
[<ffffffff814c97ae>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814c9ad8>] schedule_hrtimeout_range+0xc8/0x160
[<ffffffff814c9b4d>] schedule_hrtimeout_range+0x13d/0x160
[<ffffffff814c9c1b>] do_nanosleep+0x8b/0xc0
[<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
[<ffffffff814cac1b>] ? _spin_unlock_bh+0x1b/0x20
[<ffffffff814cd766>] ? notifier_call_chain+0x16/0x80
[<ffffffffa00a78be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs]
[<ffffffffa00a9e10>] ? fib6_clean_node+0x0/0xd0 [ipv6]
[<ffffffffa00b0540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[<ffffffffa01407fd>] ? call_transmit_status+0x4d/0xe0 [sunrpc]
[<ffffffffa01433e9>] ? xprt_release_xprt+0x89/0x90 [sunrpc]
[<ffffffffa01435bf>] ? xprt_reserve+0x1cf/0x1f0 [sunrpc]
[<ffffffffa01444a0>] ? xprt_autoclose+0x0/0x70 [sunrpc]
[<ffffffffa0146210>] ? xs_tcp_connect_worker4+0x0/0x30 [sunrpc]
[<ffffffffa01488a0>] ? rpc_async_release+0x0/0x20 [sunrpc]
[<ffffffffa0148d00>] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc]
[<ffffffffa0149760>] ? rpc_async_schedule+0x0/0x20 [sunrpc]
[<ffffffffa01e68be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs]
[<ffffffffa01e7560>] ? nfs_wait_bit_killable+0x0/0x40 [nfs]
[<ffffffffa01ef540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[<ffffffffa01f40cd>] ? nfs_commit_free+0x3d/0x50 [nfs]
[<ffffffffa01f4688>] ? nfs_writeback_release_full+0x128/0x1b0 [nfs]
[<ffffffffa01fe3a5>] xpmem_clear_PFNtable+0x185/0x340 [xpmem]
[<ffffffffa02467b0>] ? process_req+0x0/0x1a0 [ib_addr]
[<ffffffffa02745ae>] ? mlx4_ib_post_send+0x4be/0xf10 [mlx4_ib]
[<ffffffffa02a80cd>] ? mcast_work_handler+0xed/0x830 [ib_sa]
[<ffffffffa030073a>] xpmem_make+0x9a/0x360 [xpmem]
[<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem]
[<ffffffffa03054f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem]
[<ffffffffa0308445>] xpmem_clear_PFNtable+0x185/0x340 [xpmem]
[<ffffffffa0309ec8>] ? xpmem_recall_PFNs_of_tg+0xf8/0x2d0 [xpmem]
[<ffffffffa030a40b>] xpmem_pgcl_thread+0x1db/0x220 [xpmem]
[<ffffffffa0320ab2>] lcw_dispatch_main+0xd2/0x400 [libcfs]
[<ffffffffa0353b8b>] ? mlx4_ib_poll_cq+0x2ab/0x780 [mlx4_ib]
[<ffffffffa0379c9d>] ? LNetMDAttach+0x35d/0x4c0 [lnet]
[<ffffffffa03dbc5a>] obd_zombie_impexp_thread+0x15a/0x2b0 [obdclass]
[<ffffffffa046a330>] ? ipoib_reap_ah+0x0/0x50 [ib_ipoib]
[<ffffffffa04e6c3a>] ? kiblnd_queue_tx+0x4a/0x60 [ko2iblnd]
[<ffffffffa04f3eb6>] ? loi_list_maint+0xa6/0x130 [osc]
[<ffffffffa050fb64>] ? cache_add_extent+0x134/0x640 [osc]
[<ffffffffa056efd0>] ? ib_mad_completion_handler+0x0/0x810 [ib_mad]
[<ffffffffa057a492>] ? cm_process_work+0x32/0x110 [ib_cm]
[<ffffffffa057bcff>] ? cm_rep_handler+0x31f/0x590 [ib_cm]
[<ffffffffa057bf70>] ? cm_work_handler+0x0/0x11d6 [ib_cm]
[<ffffffffa0584330>] ? cma_work_handler+0x0/0xb0 [rdma_cm]
[<ffffffffa059fc81>] ? kiblnd_init_tx_msg+0x91/0x200 [ko2iblnd]
[<ffffffffa05a4465>] kiblnd_scheduler+0x325/0x760 [ko2iblnd]
[<ffffffffa05bafed>] ? ldlm_lock_put+0x19d/0x450 [ptlrpc]
[<ffffffffa05bffb1>] ? ldlm_lock_decref+0x41/0xb0 [ptlrpc]
[<ffffffffa05c0af3>] ? ldlm_resource_putref_internal+0xb3/0x4c0 [ptlrpc]
[<ffffffffa05e3397>] ? ldlm_callback_handler+0xa57/0x1e10 [ptlrpc]
[<ffffffffa05e6140>] ldlm_bl_thread_main+0x3f0/0x440 [ptlrpc]
[<ffffffffa060d1d0>] ptlrpc_wait_event+0x3b0/0x3c0 [ptlrpc]
[<ffffffffa060e6a7>] ? lov_merge_lvb+0xb7/0x240 [lov]
[<ffffffffa0684ac2>] ? ll_removepage+0x352/0x8d0 [lustre]
[<ffffffffa0695c9c>] ? ll_file_mmap+0x12c/0x180 [lustre]
[<ffffffffa06ef6a7>] ? lov_merge_lvb+0xb7/0x240 [lov]
[<ffffffffa06f20f5>] ? lov_finish_set+0x435/0x710 [lov]
[<ffffffffa07056a7>] ? lov_merge_lvb+0xb7/0x240 [lov]
[<ffffffffa073f1a4>] ll_close_thread+0x124/0x260 [lustre]
[<ffffffffa075aac2>] ? ll_removepage+0x352/0x8d0 [lustre]
[<ffffffffa09d7c9c>] ? ll_file_mmap+0x12c/0x180 [lustre]
<IRQ>
<IRQ> [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0
<IRQ> [<ffffffff810d8740>] ? handle_IRQ_event+0x60/0x170
<IRQ> [<ffffffff814c7b23>] panic+0x78/0x137

It is probably not unexpected there are many places because
guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | wc -l
372554

In the history if this cluster (as reflected in the console logs) we have had BUG: soft lockup 119496 times.

There are a wide variety of places where the back trace starts but the two most dominant are

After the first function in the dominant ones it starts to diverge for unmap_mapping_range
grep --binary-files=text -h -A1 "unmap_mapping_range" r* | sort | uniq
–
[<ffffffff81013cce>] ? invalidate_interrupt0+0xe/0x20
[<ffffffff810ddc95>] ? call_rcu_sched+0x15/0x20
[<ffffffff811343b4>] unmap_mapping_range_vma+0x64/0xf0
[<ffffffff811343ea>] ? unmap_mapping_range_vma+0x9a/0xf0
[<ffffffff811344d7>] ? unmap_mapping_range_tree+0x97/0xf0
[<ffffffff811344d7>] unmap_mapping_range_tree+0x97/0xf0
[<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150
[<ffffffff811345a2>] unmap_mapping_range+0x72/0x150
[<ffffffff81134661>] ? unmap_mapping_range+0x131/0x150
[<ffffffff81134661>] unmap_mapping_range+0x131/0x150
[<ffffffff814caa3e>] ? _spin_lock+0x1e/0x30
[<ffffffff814caa41>] ? _spin_lock+0x21/0x30
[<ffffffffa01fb4f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem]
[<ffffffffa042231c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa042231c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa069631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa069631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa076c31c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa076c31c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa09d831c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa09d831c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa0ac631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa0ac631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]

For xpmem_tg_ref_by_tgid it is only get_task_mm

Dennis Nelson added a comment - 27/Feb/12 11:22 PM I received the following from the customer today: Please ask WC to stand down on it being P1. We found in a sample that there was lustre so we went with that. Once we started looking at all of the traces lustre is present it SOME of the stack traces but it is not in the most common. I would appreciate if Andreas can have a look at some more strack traces to see if there is anything he's seen before though. ftp://shell.sgi.com/collect/jhanson/nodeswithsoftlockupconsoles.tar.bz2 What I've found by looking at these Once there is a BUG: soft lockup the next lines are like this (example chosen at random) BUG: soft lockup - CPU#0 stuck for 61s! [global_fcst:30024] Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler] CPU 0: Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler] Pid: 30024, comm: global_fcst Tainted: G W ---------------- 2.6.32-71.el6.x86_64 #1 AltixICE8400IP105 RIP: 0010: [<ffffffff814caa3e>] [<ffffffff814caa3e>] _spin_lock+0x1e/0x30 RSP: 0018:ffff8802e9b3fc38 EFLAGS: 00000297 RAX: 000000000000e364 RBX: ffff8802e9b3fc38 RCX: ffff8804b764de80 RDX: 0000000000000000 RSI: ffff88033d53d208 RDI: ffff880637837268 RBP: ffffffff81013c8e R08: ffff8802e9b3fe10 R09: 0000000000100000 R10: 00007fffffff2dc0 R11: 0000000000000213 R12: ffff88033b712100 R13: ffffffff817300c0 R14: ffff88033b7126b8 R15: 0000000000010518 FS: 00002aaaaf3e0800(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaae8f0840 CR3: 000000033ca90000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem] [<ffffffff81068598>] ? get_task_mm+0x28/0x70 [<ffffffffa030073a>] ? xpmem_make+0x9a/0x360 [xpmem] [<ffffffff8110c037>] ? __lock_page+0x67/0x70 [<ffffffffa02ff19d>] ? xpmem_ioctl+0xdd/0x3f0 [xpmem] [<ffffffff8110dade>] ? filemap_fault+0xbe/0x510 [<ffffffff8110c177>] ? unlock_page+0x27/0x30 [<ffffffff81135837>] ? handle_pte_fault+0xf7/0xad0 [<ffffffff811502a7>] ? alloc_pages_current+0x87/0xd0 [<ffffffff8117f182>] ? vfs_ioctl+0x22/0xa0 [<ffffffff81258ae5>] ? _atomic_dec_and_lock+0x55/0x80 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff8117f324>] ? do_vfs_ioctl+0x84/0x580 [<ffffffff811363fd>] ? handle_mm_fault+0x1ed/0x2b0 [<ffffffff8117f8a1>] ? sys_ioctl+0x81/0xa0 [<ffffffff81013172>] ? system_call_fastpath+0x16/0x1b So I went to look for commonality after Call Trace: and found little with lots of possible places to check. guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | sort | uniq – Call Trace: [<ffffffff810117bc>] ? __switch_to+0x1ac/0x320 [<ffffffff81013ace>] ? common_interrupt+0xe/0x13 [<ffffffff81013b76>] retint_careful+0x14/0x32 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff81013cee>] ? invalidate_interrupt1+0xe/0x20 [<ffffffff81013d4e>] ? invalidate_interrupt4+0xe/0x20 [<ffffffff81013d6e>] ? invalidate_interrupt5+0xe/0x20 [<ffffffff81014162>] ? kernel_thread+0x82/0xe0 [<ffffffff81014645>] ? math_state_restore+0x45/0x60 [<ffffffff8101660f>] ? dump_trace+0x1af/0x3a0 [<ffffffff8101a4f9>] ? read_tsc+0x9/0x20 [<ffffffff8104f61c>] ? enqueue_task+0x5c/0x70 [<ffffffff8104fff9>] ? __wake_up_common+0x59/0x90 [<ffffffff810507f8>] ? resched_task+0x68/0x80 [<ffffffff810508a5>] ? check_preempt_curr_idle+0x15/0x20 [<ffffffff81056303>] ? __wake_up+0x53/0x70 [<ffffffff81056630>] ? __dequeue_entity+0x30/0x50 [<ffffffff81059d12>] ? finish_task_switch+0x42/0xd0 [<ffffffff8105a808>] ? pull_task+0x58/0x70 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20 [<ffffffff8105c4a2>] ? default_wake_function+0x12/0x20 [<ffffffff8105c4e5>] ? wake_up_process+0x15/0x20 [<ffffffff8105c756>] ? update_curr+0xe6/0x1e0 [<ffffffff8105fa72>] ? enqueue_entity+0x122/0x320 [<ffffffff8105fcb3>] ? enqueue_task_fair+0x43/0x90 [<ffffffff81061b71>] ? dequeue_entity+0x1a1/0x1e0 [<ffffffff81062b84>] ? find_busiest_group+0x254/0xb40 [<ffffffff8106329a>] ? find_busiest_group+0x96a/0xb40 [<ffffffff81066d6e>] ? select_task_rq_fair+0x9ee/0xab0 [<ffffffff810670c1>] ? check_preempt_wakeup+0x41/0x3c0 [<ffffffff81067244>] ? check_preempt_wakeup+0x1c4/0x3c0 [<ffffffff81067732>] migration_thread+0x1d2/0x310 [<ffffffff81069207>] ? dup_mm+0x2a7/0x520 [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0 [<ffffffff8106b9f5>] ? __call_console_drivers+0x75/0x90 [<ffffffff8106d0a1>] do_syslog+0x461/0x4c0 [<ffffffff8106f805>] do_wait+0x1c5/0x250 [<ffffffff8107064f>] do_exit+0x56f/0x820 [<ffffffff810737a5>] ksoftirqd+0xd5/0x110 [<ffffffff8107d5ac>] ? lock_timer_base+0x3c/0x70 [<ffffffff8107e616>] ? mod_timer+0x146/0x230 [<ffffffff8107e718>] ? add_timer+0x18/0x30 [<ffffffff8108ac20>] ? __call_usermodehelper+0x0/0xa0 [<ffffffff8108c4a0>] ? worker_thread+0x0/0x2a0 [<ffffffff8108cc82>] ? queue_work_on+0x42/0x60 [<ffffffff81091cb6>] ? autoremove_wake_function+0x16/0x40 [<ffffffff81091eae>] ? prepare_to_wait_exclusive+0x4e/0x80 [<ffffffff81091f8e>] ? prepare_to_wait+0x4e/0x80 [<ffffffff81095da3>] ? __hrtimer_start_range_ns+0x1a3/0x430 [<ffffffff8109638a>] ? down_read_trylock+0x1a/0x30 [<ffffffff81096bff>] ? up+0x2f/0x50 [<ffffffff81098f05>] async_manager_thread+0xc5/0x100 [<ffffffff8109b9a9>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff810a25a9>] futex_wait_queue_me+0xb9/0xf0 [<ffffffff810a666b>] ? rt_mutex_adjust_pi+0x7b/0x90 [<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0 [<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0 [<ffffffff810ca7b6>] ? audit_hold_skb+0x26/0x50 [<ffffffff810cab7b>] ? kauditd_send_skb+0x3b/0x90 [<ffffffff810d3d4b>] ? audit_syscall_exit+0x25b/0x290 [<ffffffff8110351b>] slow_work_thread+0x32b/0x3a0 [<ffffffff81108047>] ? perf_event_exit_task+0x37/0x160 [<ffffffff8110b832>] ? iov_iter_copy_from_user_atomic+0x92/0x130 [<ffffffff8110bb70>] ? find_get_pages_tag+0x40/0x120 [<ffffffff8110c060>] ? sync_page+0x0/0x50 [<ffffffff8110c0b0>] ? sync_page_killable+0x0/0x40 [<ffffffff8110eecb>] oom_kill_process+0xcb/0x2e0 [<ffffffff8111b3a5>] ? __rmqueue+0xc5/0x490 [<ffffffff8111bd57>] bad_page+0x107/0x160 [<ffffffff8111cf91>] ? get_page_from_freelist+0x3d1/0x820 [<ffffffff8111e1c6>] ? __alloc_pages_nodemask+0xf6/0x810 [<ffffffff8111e48d>] ? __alloc_pages_nodemask+0x3bd/0x810 [<ffffffff8111e745>] __alloc_pages_nodemask+0x675/0x810 [<ffffffff8111f78a>] ? determine_dirtyable_memory+0x1a/0x30 [<ffffffff81120951>] ? do_writepages+0x21/0x40 [<ffffffff8112bc27>] ? vma_prio_tree_next+0x47/0x70 [<ffffffff8112d14d>] ? zone_statistics+0x7d/0xa0 [<ffffffff8112d980>] ? vmstat_update+0x0/0x40 [<ffffffff8112de70>] ? bdi_sync_supers+0x0/0x60 [<ffffffff811336b5>] ? unmap_vmas+0xa85/0xc00 [<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150 [<ffffffff81135a85>] ? handle_pte_fault+0x345/0xad0 [<ffffffff81136455>] ? handle_mm_fault+0x245/0x2b0 [<ffffffff81139582>] ? unlink_file_vma+0x42/0x70 [<ffffffff8113e59d>] ? rmap_walk+0x7d/0x1c0 [<ffffffff8113f2de>] ? page_referenced+0x9e/0x2f0 [<ffffffff8113fb72>] ? try_to_unmap_file+0x42/0x750 [<ffffffff81156007>] ? cache_grow+0x217/0x320 [<ffffffff811560bf>] ? cache_grow+0x2cf/0x320 [<ffffffff81157e51>] ? drain_array+0xe1/0x100 [<ffffffff81158d38>] ? drain_freelist+0x78/0xc0 [<ffffffff81158d80>] ? cache_reap+0x0/0x260 [<ffffffff8115fe28>] ? __mem_cgroup_uncharge_common+0x78/0x260 [<ffffffff81161c89>] ? mem_cgroup_charge_common+0x99/0xc0 [<ffffffff81165218>] khugepaged+0x958/0x1190 [<ffffffff8116c65a>] ? do_sync_read+0xfa/0x140 [<ffffffff81175fdb>] pipe_wait+0x5b/0x80 [<ffffffff81258839>] ? cpumask_next_and+0x29/0x50 [<ffffffff81262a54>] ? vsnprintf+0x484/0x5f0 [<ffffffff81264025>] ? memmove+0x45/0x50 [<ffffffff812fcaa0>] ? flush_to_ldisc+0x0/0x1b0 [<ffffffff812fee81>] vt_event_wait+0xa1/0x100 [<ffffffff8137fe39>] hub_thread+0x369/0x17f0 [<ffffffff8138a164>] ? usb_suspend_both+0x1a4/0x320 [<ffffffff814277d0>] ? eth_type_trans+0x40/0x140 [<ffffffff81445e95>] ? ip_local_out+0x25/0x30 [<ffffffff8144e7e6>] ? tcp_sendmsg+0x756/0xa30 [<ffffffff8149b2d6>] ? unix_stream_sendmsg+0x3c6/0x3e0 [<ffffffff814c7b23>] panic+0x78/0x137 [<ffffffff814c8286>] ? thread_return+0x4e/0x778 [<ffffffff814c8b00>] ? _cond_resched+0x30/0x40 [<ffffffff814c8c5c>] ? wait_for_common+0x14c/0x180 [<ffffffff814c8d4d>] ? wait_for_completion+0x1d/0x20 [<ffffffff814c8f34>] schedule_timeout+0x194/0x2f0 [<ffffffff814c8f3c>] ? schedule_timeout+0x19c/0x2f0 [<ffffffff814c8fc5>] schedule_timeout+0x225/0x2f0 [<ffffffff814c96e0>] ? __mutex_lock_slowpath+0x70/0x180 [<ffffffff814c97ae>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff814c9ad8>] schedule_hrtimeout_range+0xc8/0x160 [<ffffffff814c9b4d>] schedule_hrtimeout_range+0x13d/0x160 [<ffffffff814c9c1b>] do_nanosleep+0x8b/0xc0 [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814cac1b>] ? _spin_unlock_bh+0x1b/0x20 [<ffffffff814cd766>] ? notifier_call_chain+0x16/0x80 [<ffffffffa00a78be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs] [<ffffffffa00a9e10>] ? fib6_clean_node+0x0/0xd0 [ipv6] [<ffffffffa00b0540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs] [<ffffffffa01407fd>] ? call_transmit_status+0x4d/0xe0 [sunrpc] [<ffffffffa01433e9>] ? xprt_release_xprt+0x89/0x90 [sunrpc] [<ffffffffa01435bf>] ? xprt_reserve+0x1cf/0x1f0 [sunrpc] [<ffffffffa01444a0>] ? xprt_autoclose+0x0/0x70 [sunrpc] [<ffffffffa0146210>] ? xs_tcp_connect_worker4+0x0/0x30 [sunrpc] [<ffffffffa01488a0>] ? rpc_async_release+0x0/0x20 [sunrpc] [<ffffffffa0148d00>] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc] [<ffffffffa0149760>] ? rpc_async_schedule+0x0/0x20 [sunrpc] [<ffffffffa01e68be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs] [<ffffffffa01e7560>] ? nfs_wait_bit_killable+0x0/0x40 [nfs] [<ffffffffa01ef540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs] [<ffffffffa01f40cd>] ? nfs_commit_free+0x3d/0x50 [nfs] [<ffffffffa01f4688>] ? nfs_writeback_release_full+0x128/0x1b0 [nfs] [<ffffffffa01fe3a5>] xpmem_clear_PFNtable+0x185/0x340 [xpmem] [<ffffffffa02467b0>] ? process_req+0x0/0x1a0 [ib_addr] [<ffffffffa02745ae>] ? mlx4_ib_post_send+0x4be/0xf10 [mlx4_ib] [<ffffffffa02a80cd>] ? mcast_work_handler+0xed/0x830 [ib_sa] [<ffffffffa030073a>] xpmem_make+0x9a/0x360 [xpmem] [<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem] [<ffffffffa03054f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem] [<ffffffffa0308445>] xpmem_clear_PFNtable+0x185/0x340 [xpmem] [<ffffffffa0309ec8>] ? xpmem_recall_PFNs_of_tg+0xf8/0x2d0 [xpmem] [<ffffffffa030a40b>] xpmem_pgcl_thread+0x1db/0x220 [xpmem] [<ffffffffa0320ab2>] lcw_dispatch_main+0xd2/0x400 [libcfs] [<ffffffffa0353b8b>] ? mlx4_ib_poll_cq+0x2ab/0x780 [mlx4_ib] [<ffffffffa0379c9d>] ? LNetMDAttach+0x35d/0x4c0 [lnet] [<ffffffffa03dbc5a>] obd_zombie_impexp_thread+0x15a/0x2b0 [obdclass] [<ffffffffa046a330>] ? ipoib_reap_ah+0x0/0x50 [ib_ipoib] [<ffffffffa04e6c3a>] ? kiblnd_queue_tx+0x4a/0x60 [ko2iblnd] [<ffffffffa04f3eb6>] ? loi_list_maint+0xa6/0x130 [osc] [<ffffffffa050fb64>] ? cache_add_extent+0x134/0x640 [osc] [<ffffffffa056efd0>] ? ib_mad_completion_handler+0x0/0x810 [ib_mad] [<ffffffffa057a492>] ? cm_process_work+0x32/0x110 [ib_cm] [<ffffffffa057bcff>] ? cm_rep_handler+0x31f/0x590 [ib_cm] [<ffffffffa057bf70>] ? cm_work_handler+0x0/0x11d6 [ib_cm] [<ffffffffa0584330>] ? cma_work_handler+0x0/0xb0 [rdma_cm] [<ffffffffa059fc81>] ? kiblnd_init_tx_msg+0x91/0x200 [ko2iblnd] [<ffffffffa05a4465>] kiblnd_scheduler+0x325/0x760 [ko2iblnd] [<ffffffffa05bafed>] ? ldlm_lock_put+0x19d/0x450 [ptlrpc] [<ffffffffa05bffb1>] ? ldlm_lock_decref+0x41/0xb0 [ptlrpc] [<ffffffffa05c0af3>] ? ldlm_resource_putref_internal+0xb3/0x4c0 [ptlrpc] [<ffffffffa05e3397>] ? ldlm_callback_handler+0xa57/0x1e10 [ptlrpc] [<ffffffffa05e6140>] ldlm_bl_thread_main+0x3f0/0x440 [ptlrpc] [<ffffffffa060d1d0>] ptlrpc_wait_event+0x3b0/0x3c0 [ptlrpc] [<ffffffffa060e6a7>] ? lov_merge_lvb+0xb7/0x240 [lov] [<ffffffffa0684ac2>] ? ll_removepage+0x352/0x8d0 [lustre] [<ffffffffa0695c9c>] ? ll_file_mmap+0x12c/0x180 [lustre] [<ffffffffa06ef6a7>] ? lov_merge_lvb+0xb7/0x240 [lov] [<ffffffffa06f20f5>] ? lov_finish_set+0x435/0x710 [lov] [<ffffffffa07056a7>] ? lov_merge_lvb+0xb7/0x240 [lov] [<ffffffffa073f1a4>] ll_close_thread+0x124/0x260 [lustre] [<ffffffffa075aac2>] ? ll_removepage+0x352/0x8d0 [lustre] [<ffffffffa09d7c9c>] ? ll_file_mmap+0x12c/0x180 [lustre] <IRQ> <IRQ> [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0 <IRQ> [<ffffffff810d8740>] ? handle_IRQ_event+0x60/0x170 <IRQ> [<ffffffff814c7b23>] panic+0x78/0x137 It is probably not unexpected there are many places because guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | wc -l 372554 In the history if this cluster (as reflected in the console logs) we have had BUG: soft lockup 119496 times. There are a wide variety of places where the back trace starts but the two most dominant are grep - binary-files=text -h -A1 "Call Trace" r* | grep -v "Call Trace" | grep -v ^ - | grep unmap_mapping_range | wc -l 49558 grep - binary-files=text -h -A1 "Call Trace" r* | grep -v "Call Trace" | grep -v ^ - | grep xpmem_tg_ref_by_tgid | wc -l 30446 After the first function in the dominant ones it starts to diverge for unmap_mapping_range grep --binary-files=text -h -A1 "unmap_mapping_range" r* | sort | uniq – [<ffffffff81013cce>] ? invalidate_interrupt0+0xe/0x20 [<ffffffff810ddc95>] ? call_rcu_sched+0x15/0x20 [<ffffffff811343b4>] unmap_mapping_range_vma+0x64/0xf0 [<ffffffff811343ea>] ? unmap_mapping_range_vma+0x9a/0xf0 [<ffffffff811344d7>] ? unmap_mapping_range_tree+0x97/0xf0 [<ffffffff811344d7>] unmap_mapping_range_tree+0x97/0xf0 [<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150 [<ffffffff811345a2>] unmap_mapping_range+0x72/0x150 [<ffffffff81134661>] ? unmap_mapping_range+0x131/0x150 [<ffffffff81134661>] unmap_mapping_range+0x131/0x150 [<ffffffff814caa3e>] ? _spin_lock+0x1e/0x30 [<ffffffff814caa41>] ? _spin_lock+0x21/0x30 [<ffffffffa01fb4f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem] [<ffffffffa042231c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa042231c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa069631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa069631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa076c31c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa076c31c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa09d831c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa09d831c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa0ac631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa0ac631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] For xpmem_tg_ref_by_tgid it is only get_task_mm

Andreas Dilger added a comment - 24/Feb/12 5:03 PM

This doesn't appear to be the same as ~~LU-93~~, which was causing the client to crash.

In this case, it looks like all of the threads are stuck in ll_teardown_mmaps->unmap_mapping_range() because the node is trying to free memory under memory pressure.

This is a somewhat unusual workload for Lustre, because while mmap IO is functional, it is quite inefficient (single page RPCs) and rarely used.

Has this application been running in the past on Lustre? Are there any changes in the environment that might have caused the application to start failing (e.g. kernel, Lustre, or application upgrade)?

Andreas Dilger added a comment - 24/Feb/12 5:03 PM This doesn't appear to be the same as LU-93 , which was causing the client to crash. In this case, it looks like all of the threads are stuck in ll_teardown_mmaps->unmap_mapping_range() because the node is trying to free memory under memory pressure. This is a somewhat unusual workload for Lustre, because while mmap IO is functional, it is quite inefficient (single page RPCs) and rarely used. Has this application been running in the past on Lustre? Are there any changes in the environment that might have caused the application to start failing (e.g. kernel, Lustre, or application upgrade)?

Dennis Nelson added a comment - 24/Feb/12 4:04 PM

Customer just asked me to bump up the priority on this one. They just reported that this issue has caused hundreds of nodes to become unresponsive on their system.

Dennis Nelson added a comment - 24/Feb/12 4:04 PM Customer just asked me to bump up the priority on this one. They just reported that this issue has caused hundreds of nodes to become unresponsive on their system.

People

Assignee:: Zhenyu Xu

Reporter:: Dennis Nelson

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Feb/12 3:15 PM

Updated:: 22/Jan/24 11:01 PM

Resolved:: 24/Apr/12 7:19 PM