Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 1.8.6
-
None
-
RHEL 6 2.6.32-71.el6.x86_64 kernel
-
3
-
6447
Description
Customer reports that a few compute nodes have been panic'ing. They have seen the behavior on 7 nodes. Each node has seen the problem numerous times. It looks like it may be similar to LU-93. I'd like to get Whamcloud to weigh in on whether you think it is related or if it is a known issue. The trace backs and console messages are attached.
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
Activity
Just looking at open cases. Customer found this was not a Lustre issue after all. I believe that they upgraded the kernel to fix the issue. Please close this.
When did this situation happen? Did it happen after switched to RHEL6? Upgrading from older Lustre 1.8.6, or since starting to use specific kernel version or other software?
Bobi
Andreas is rather busy at the moment so could you please review and comment on this latest information from our customer?
Thanks
Peter
I received the following from the customer today:
Please ask WC to stand down on it being P1. We found in a sample that there was
lustre so we went with that. Once we started looking at all of the traces lustre is present it SOME of
the stack traces but it is not in the most common. I would appreciate if Andreas
can have a look at some more strack traces to see if there is anything
he's seen before though.
ftp://shell.sgi.com/collect/jhanson/nodeswithsoftlockupconsoles.tar.bz2
What I've found by looking at these
Once there is a BUG: soft lockup the next lines are like this (example chosen at random)
BUG: soft lockup - CPU#0 stuck for 61s! [global_fcst:30024]
Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt
iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler]
CPU 0:
Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt
iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler]
Pid: 30024, comm: global_fcst Tainted: G W ---------------- 2.6.32-71.el6.x86_64 #1 AltixICE8400IP105
RIP: 0010:[<ffffffff814caa3e>] [<ffffffff814caa3e>] _spin_lock+0x1e/0x30
RSP: 0018:ffff8802e9b3fc38 EFLAGS: 00000297
RAX: 000000000000e364 RBX: ffff8802e9b3fc38 RCX: ffff8804b764de80
RDX: 0000000000000000 RSI: ffff88033d53d208 RDI: ffff880637837268
RBP: ffffffff81013c8e R08: ffff8802e9b3fe10 R09: 0000000000100000
R10: 00007fffffff2dc0 R11: 0000000000000213 R12: ffff88033b712100
R13: ffffffff817300c0 R14: ffff88033b7126b8 R15: 0000000000010518
FS: 00002aaaaf3e0800(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaae8f0840 CR3: 000000033ca90000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
[<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem]
[<ffffffff81068598>] ? get_task_mm+0x28/0x70
[<ffffffffa030073a>] ? xpmem_make+0x9a/0x360 [xpmem]
[<ffffffff8110c037>] ? __lock_page+0x67/0x70
[<ffffffffa02ff19d>] ? xpmem_ioctl+0xdd/0x3f0 [xpmem]
[<ffffffff8110dade>] ? filemap_fault+0xbe/0x510
[<ffffffff8110c177>] ? unlock_page+0x27/0x30
[<ffffffff81135837>] ? handle_pte_fault+0xf7/0xad0
[<ffffffff811502a7>] ? alloc_pages_current+0x87/0xd0
[<ffffffff8117f182>] ? vfs_ioctl+0x22/0xa0
[<ffffffff81258ae5>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff8117f324>] ? do_vfs_ioctl+0x84/0x580
[<ffffffff811363fd>] ? handle_mm_fault+0x1ed/0x2b0
[<ffffffff8117f8a1>] ? sys_ioctl+0x81/0xa0
[<ffffffff81013172>] ? system_call_fastpath+0x16/0x1b
So I went to look for commonality after Call Trace: and found little with lots
of possible places to check.
guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | sort | uniq
–
Call Trace:
[<ffffffff810117bc>] ? __switch_to+0x1ac/0x320
[<ffffffff81013ace>] ? common_interrupt+0xe/0x13
[<ffffffff81013b76>] retint_careful+0x14/0x32
[<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff81013cee>] ? invalidate_interrupt1+0xe/0x20
[<ffffffff81013d4e>] ? invalidate_interrupt4+0xe/0x20
[<ffffffff81013d6e>] ? invalidate_interrupt5+0xe/0x20
[<ffffffff81014162>] ? kernel_thread+0x82/0xe0
[<ffffffff81014645>] ? math_state_restore+0x45/0x60
[<ffffffff8101660f>] ? dump_trace+0x1af/0x3a0
[<ffffffff8101a4f9>] ? read_tsc+0x9/0x20
[<ffffffff8104f61c>] ? enqueue_task+0x5c/0x70
[<ffffffff8104fff9>] ? __wake_up_common+0x59/0x90
[<ffffffff810507f8>] ? resched_task+0x68/0x80
[<ffffffff810508a5>] ? check_preempt_curr_idle+0x15/0x20
[<ffffffff81056303>] ? __wake_up+0x53/0x70
[<ffffffff81056630>] ? __dequeue_entity+0x30/0x50
[<ffffffff81059d12>] ? finish_task_switch+0x42/0xd0
[<ffffffff8105a808>] ? pull_task+0x58/0x70
[<ffffffff8105c490>] ? default_wake_function+0x0/0x20
[<ffffffff8105c4a2>] ? default_wake_function+0x12/0x20
[<ffffffff8105c4e5>] ? wake_up_process+0x15/0x20
[<ffffffff8105c756>] ? update_curr+0xe6/0x1e0
[<ffffffff8105fa72>] ? enqueue_entity+0x122/0x320
[<ffffffff8105fcb3>] ? enqueue_task_fair+0x43/0x90
[<ffffffff81061b71>] ? dequeue_entity+0x1a1/0x1e0
[<ffffffff81062b84>] ? find_busiest_group+0x254/0xb40
[<ffffffff8106329a>] ? find_busiest_group+0x96a/0xb40
[<ffffffff81066d6e>] ? select_task_rq_fair+0x9ee/0xab0
[<ffffffff810670c1>] ? check_preempt_wakeup+0x41/0x3c0
[<ffffffff81067244>] ? check_preempt_wakeup+0x1c4/0x3c0
[<ffffffff81067732>] migration_thread+0x1d2/0x310
[<ffffffff81069207>] ? dup_mm+0x2a7/0x520
[<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0
[<ffffffff8106b9f5>] ? __call_console_drivers+0x75/0x90
[<ffffffff8106d0a1>] do_syslog+0x461/0x4c0
[<ffffffff8106f805>] do_wait+0x1c5/0x250
[<ffffffff8107064f>] do_exit+0x56f/0x820
[<ffffffff810737a5>] ksoftirqd+0xd5/0x110
[<ffffffff8107d5ac>] ? lock_timer_base+0x3c/0x70
[<ffffffff8107e616>] ? mod_timer+0x146/0x230
[<ffffffff8107e718>] ? add_timer+0x18/0x30
[<ffffffff8108ac20>] ? __call_usermodehelper+0x0/0xa0
[<ffffffff8108c4a0>] ? worker_thread+0x0/0x2a0
[<ffffffff8108cc82>] ? queue_work_on+0x42/0x60
[<ffffffff81091cb6>] ? autoremove_wake_function+0x16/0x40
[<ffffffff81091eae>] ? prepare_to_wait_exclusive+0x4e/0x80
[<ffffffff81091f8e>] ? prepare_to_wait+0x4e/0x80
[<ffffffff81095da3>] ? __hrtimer_start_range_ns+0x1a3/0x430
[<ffffffff8109638a>] ? down_read_trylock+0x1a/0x30
[<ffffffff81096bff>] ? up+0x2f/0x50
[<ffffffff81098f05>] async_manager_thread+0xc5/0x100
[<ffffffff8109b9a9>] ? ktime_get_ts+0xa9/0xe0
[<ffffffff810a25a9>] futex_wait_queue_me+0xb9/0xf0
[<ffffffff810a666b>] ? rt_mutex_adjust_pi+0x7b/0x90
[<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff810ca7b6>] ? audit_hold_skb+0x26/0x50
[<ffffffff810cab7b>] ? kauditd_send_skb+0x3b/0x90
[<ffffffff810d3d4b>] ? audit_syscall_exit+0x25b/0x290
[<ffffffff8110351b>] slow_work_thread+0x32b/0x3a0
[<ffffffff81108047>] ? perf_event_exit_task+0x37/0x160
[<ffffffff8110b832>] ? iov_iter_copy_from_user_atomic+0x92/0x130
[<ffffffff8110bb70>] ? find_get_pages_tag+0x40/0x120
[<ffffffff8110c060>] ? sync_page+0x0/0x50
[<ffffffff8110c0b0>] ? sync_page_killable+0x0/0x40
[<ffffffff8110eecb>] oom_kill_process+0xcb/0x2e0
[<ffffffff8111b3a5>] ? __rmqueue+0xc5/0x490
[<ffffffff8111bd57>] bad_page+0x107/0x160
[<ffffffff8111cf91>] ? get_page_from_freelist+0x3d1/0x820
[<ffffffff8111e1c6>] ? __alloc_pages_nodemask+0xf6/0x810
[<ffffffff8111e48d>] ? __alloc_pages_nodemask+0x3bd/0x810
[<ffffffff8111e745>] __alloc_pages_nodemask+0x675/0x810
[<ffffffff8111f78a>] ? determine_dirtyable_memory+0x1a/0x30
[<ffffffff81120951>] ? do_writepages+0x21/0x40
[<ffffffff8112bc27>] ? vma_prio_tree_next+0x47/0x70
[<ffffffff8112d14d>] ? zone_statistics+0x7d/0xa0
[<ffffffff8112d980>] ? vmstat_update+0x0/0x40
[<ffffffff8112de70>] ? bdi_sync_supers+0x0/0x60
[<ffffffff811336b5>] ? unmap_vmas+0xa85/0xc00
[<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150
[<ffffffff81135a85>] ? handle_pte_fault+0x345/0xad0
[<ffffffff81136455>] ? handle_mm_fault+0x245/0x2b0
[<ffffffff81139582>] ? unlink_file_vma+0x42/0x70
[<ffffffff8113e59d>] ? rmap_walk+0x7d/0x1c0
[<ffffffff8113f2de>] ? page_referenced+0x9e/0x2f0
[<ffffffff8113fb72>] ? try_to_unmap_file+0x42/0x750
[<ffffffff81156007>] ? cache_grow+0x217/0x320
[<ffffffff811560bf>] ? cache_grow+0x2cf/0x320
[<ffffffff81157e51>] ? drain_array+0xe1/0x100
[<ffffffff81158d38>] ? drain_freelist+0x78/0xc0
[<ffffffff81158d80>] ? cache_reap+0x0/0x260
[<ffffffff8115fe28>] ? __mem_cgroup_uncharge_common+0x78/0x260
[<ffffffff81161c89>] ? mem_cgroup_charge_common+0x99/0xc0
[<ffffffff81165218>] khugepaged+0x958/0x1190
[<ffffffff8116c65a>] ? do_sync_read+0xfa/0x140
[<ffffffff81175fdb>] pipe_wait+0x5b/0x80
[<ffffffff81258839>] ? cpumask_next_and+0x29/0x50
[<ffffffff81262a54>] ? vsnprintf+0x484/0x5f0
[<ffffffff81264025>] ? memmove+0x45/0x50
[<ffffffff812fcaa0>] ? flush_to_ldisc+0x0/0x1b0
[<ffffffff812fee81>] vt_event_wait+0xa1/0x100
[<ffffffff8137fe39>] hub_thread+0x369/0x17f0
[<ffffffff8138a164>] ? usb_suspend_both+0x1a4/0x320
[<ffffffff814277d0>] ? eth_type_trans+0x40/0x140
[<ffffffff81445e95>] ? ip_local_out+0x25/0x30
[<ffffffff8144e7e6>] ? tcp_sendmsg+0x756/0xa30
[<ffffffff8149b2d6>] ? unix_stream_sendmsg+0x3c6/0x3e0
[<ffffffff814c7b23>] panic+0x78/0x137
[<ffffffff814c8286>] ? thread_return+0x4e/0x778
[<ffffffff814c8b00>] ? _cond_resched+0x30/0x40
[<ffffffff814c8c5c>] ? wait_for_common+0x14c/0x180
[<ffffffff814c8d4d>] ? wait_for_completion+0x1d/0x20
[<ffffffff814c8f34>] schedule_timeout+0x194/0x2f0
[<ffffffff814c8f3c>] ? schedule_timeout+0x19c/0x2f0
[<ffffffff814c8fc5>] schedule_timeout+0x225/0x2f0
[<ffffffff814c96e0>] ? __mutex_lock_slowpath+0x70/0x180
[<ffffffff814c97ae>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814c9ad8>] schedule_hrtimeout_range+0xc8/0x160
[<ffffffff814c9b4d>] schedule_hrtimeout_range+0x13d/0x160
[<ffffffff814c9c1b>] do_nanosleep+0x8b/0xc0
[<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
[<ffffffff814cac1b>] ? _spin_unlock_bh+0x1b/0x20
[<ffffffff814cd766>] ? notifier_call_chain+0x16/0x80
[<ffffffffa00a78be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs]
[<ffffffffa00a9e10>] ? fib6_clean_node+0x0/0xd0 [ipv6]
[<ffffffffa00b0540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[<ffffffffa01407fd>] ? call_transmit_status+0x4d/0xe0 [sunrpc]
[<ffffffffa01433e9>] ? xprt_release_xprt+0x89/0x90 [sunrpc]
[<ffffffffa01435bf>] ? xprt_reserve+0x1cf/0x1f0 [sunrpc]
[<ffffffffa01444a0>] ? xprt_autoclose+0x0/0x70 [sunrpc]
[<ffffffffa0146210>] ? xs_tcp_connect_worker4+0x0/0x30 [sunrpc]
[<ffffffffa01488a0>] ? rpc_async_release+0x0/0x20 [sunrpc]
[<ffffffffa0148d00>] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc]
[<ffffffffa0149760>] ? rpc_async_schedule+0x0/0x20 [sunrpc]
[<ffffffffa01e68be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs]
[<ffffffffa01e7560>] ? nfs_wait_bit_killable+0x0/0x40 [nfs]
[<ffffffffa01ef540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[<ffffffffa01f40cd>] ? nfs_commit_free+0x3d/0x50 [nfs]
[<ffffffffa01f4688>] ? nfs_writeback_release_full+0x128/0x1b0 [nfs]
[<ffffffffa01fe3a5>] xpmem_clear_PFNtable+0x185/0x340 [xpmem]
[<ffffffffa02467b0>] ? process_req+0x0/0x1a0 [ib_addr]
[<ffffffffa02745ae>] ? mlx4_ib_post_send+0x4be/0xf10 [mlx4_ib]
[<ffffffffa02a80cd>] ? mcast_work_handler+0xed/0x830 [ib_sa]
[<ffffffffa030073a>] xpmem_make+0x9a/0x360 [xpmem]
[<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem]
[<ffffffffa03054f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem]
[<ffffffffa0308445>] xpmem_clear_PFNtable+0x185/0x340 [xpmem]
[<ffffffffa0309ec8>] ? xpmem_recall_PFNs_of_tg+0xf8/0x2d0 [xpmem]
[<ffffffffa030a40b>] xpmem_pgcl_thread+0x1db/0x220 [xpmem]
[<ffffffffa0320ab2>] lcw_dispatch_main+0xd2/0x400 [libcfs]
[<ffffffffa0353b8b>] ? mlx4_ib_poll_cq+0x2ab/0x780 [mlx4_ib]
[<ffffffffa0379c9d>] ? LNetMDAttach+0x35d/0x4c0 [lnet]
[<ffffffffa03dbc5a>] obd_zombie_impexp_thread+0x15a/0x2b0 [obdclass]
[<ffffffffa046a330>] ? ipoib_reap_ah+0x0/0x50 [ib_ipoib]
[<ffffffffa04e6c3a>] ? kiblnd_queue_tx+0x4a/0x60 [ko2iblnd]
[<ffffffffa04f3eb6>] ? loi_list_maint+0xa6/0x130 [osc]
[<ffffffffa050fb64>] ? cache_add_extent+0x134/0x640 [osc]
[<ffffffffa056efd0>] ? ib_mad_completion_handler+0x0/0x810 [ib_mad]
[<ffffffffa057a492>] ? cm_process_work+0x32/0x110 [ib_cm]
[<ffffffffa057bcff>] ? cm_rep_handler+0x31f/0x590 [ib_cm]
[<ffffffffa057bf70>] ? cm_work_handler+0x0/0x11d6 [ib_cm]
[<ffffffffa0584330>] ? cma_work_handler+0x0/0xb0 [rdma_cm]
[<ffffffffa059fc81>] ? kiblnd_init_tx_msg+0x91/0x200 [ko2iblnd]
[<ffffffffa05a4465>] kiblnd_scheduler+0x325/0x760 [ko2iblnd]
[<ffffffffa05bafed>] ? ldlm_lock_put+0x19d/0x450 [ptlrpc]
[<ffffffffa05bffb1>] ? ldlm_lock_decref+0x41/0xb0 [ptlrpc]
[<ffffffffa05c0af3>] ? ldlm_resource_putref_internal+0xb3/0x4c0 [ptlrpc]
[<ffffffffa05e3397>] ? ldlm_callback_handler+0xa57/0x1e10 [ptlrpc]
[<ffffffffa05e6140>] ldlm_bl_thread_main+0x3f0/0x440 [ptlrpc]
[<ffffffffa060d1d0>] ptlrpc_wait_event+0x3b0/0x3c0 [ptlrpc]
[<ffffffffa060e6a7>] ? lov_merge_lvb+0xb7/0x240 [lov]
[<ffffffffa0684ac2>] ? ll_removepage+0x352/0x8d0 [lustre]
[<ffffffffa0695c9c>] ? ll_file_mmap+0x12c/0x180 [lustre]
[<ffffffffa06ef6a7>] ? lov_merge_lvb+0xb7/0x240 [lov]
[<ffffffffa06f20f5>] ? lov_finish_set+0x435/0x710 [lov]
[<ffffffffa07056a7>] ? lov_merge_lvb+0xb7/0x240 [lov]
[<ffffffffa073f1a4>] ll_close_thread+0x124/0x260 [lustre]
[<ffffffffa075aac2>] ? ll_removepage+0x352/0x8d0 [lustre]
[<ffffffffa09d7c9c>] ? ll_file_mmap+0x12c/0x180 [lustre]
<IRQ>
<IRQ> [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0
<IRQ> [<ffffffff810d8740>] ? handle_IRQ_event+0x60/0x170
<IRQ> [<ffffffff814c7b23>] panic+0x78/0x137
It is probably not unexpected there are many places because
guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | wc -l
372554
In the history if this cluster (as reflected in the console logs) we have had BUG: soft lockup 119496 times.
There are a wide variety of places where the back trace starts but the two most dominant are
grep -binary-files=text -h -A1 "Call Trace" r* | grep -v "Call Trace" | grep -v ^- | grep unmap_mapping_range | wc -l
49558
grep -binary-files=text -h -A1 "Call Trace" r* | grep -v "Call Trace" | grep -v ^- | grep xpmem_tg_ref_by_tgid | wc -l
30446
After the first function in the dominant ones it starts to diverge for unmap_mapping_range
grep --binary-files=text -h -A1 "unmap_mapping_range" r* | sort | uniq
–
[<ffffffff81013cce>] ? invalidate_interrupt0+0xe/0x20
[<ffffffff810ddc95>] ? call_rcu_sched+0x15/0x20
[<ffffffff811343b4>] unmap_mapping_range_vma+0x64/0xf0
[<ffffffff811343ea>] ? unmap_mapping_range_vma+0x9a/0xf0
[<ffffffff811344d7>] ? unmap_mapping_range_tree+0x97/0xf0
[<ffffffff811344d7>] unmap_mapping_range_tree+0x97/0xf0
[<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150
[<ffffffff811345a2>] unmap_mapping_range+0x72/0x150
[<ffffffff81134661>] ? unmap_mapping_range+0x131/0x150
[<ffffffff81134661>] unmap_mapping_range+0x131/0x150
[<ffffffff814caa3e>] ? _spin_lock+0x1e/0x30
[<ffffffff814caa41>] ? _spin_lock+0x21/0x30
[<ffffffffa01fb4f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem]
[<ffffffffa042231c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa042231c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa069631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa069631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa076c31c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa076c31c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa09d831c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa09d831c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa0ac631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
[<ffffffffa0ac631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
For xpmem_tg_ref_by_tgid it is only get_task_mm
This doesn't appear to be the same as LU-93, which was causing the client to crash.
In this case, it looks like all of the threads are stuck in ll_teardown_mmaps->unmap_mapping_range() because the node is trying to free memory under memory pressure.
This is a somewhat unusual workload for Lustre, because while mmap IO is functional, it is quite inefficient (single page RPCs) and rarely used.
Has this application been running in the past on Lustre? Are there any changes in the environment that might have caused the application to start failing (e.g. kernel, Lustre, or application upgrade)?
Customer just asked me to bump up the priority on this one. They just reported that this issue has caused hundreds of nodes to become unresponsive on their system.
ok thanks for the update Dennis