Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1138

Client Panic on Lustre 1.8.6 and RHEL 6

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 1.8.6
    • None
    • RHEL 6 2.6.32-71.el6.x86_64 kernel
    • 3
    • 6447

    Description

      Customer reports that a few compute nodes have been panic'ing. They have seen the behavior on 7 nodes. Each node has seen the problem numerous times. It looks like it may be similar to LU-93. I'd like to get Whamcloud to weigh in on whether you think it is related or if it is a known issue. The trace backs and console messages are attached.

      Attachments

        Issue Links

          Activity

            [LU-1138] Client Panic on Lustre 1.8.6 and RHEL 6
            pjones Peter Jones added a comment -

            ok thanks for the update Dennis

            pjones Peter Jones added a comment - ok thanks for the update Dennis

            Just looking at open cases. Customer found this was not a Lustre issue after all. I believe that they upgraded the kernel to fix the issue. Please close this.

            dnelson@ddn.com Dennis Nelson added a comment - Just looking at open cases. Customer found this was not a Lustre issue after all. I believe that they upgraded the kernel to fix the issue. Please close this.
            bobijam Zhenyu Xu added a comment -

            When did this situation happen? Did it happen after switched to RHEL6? Upgrading from older Lustre 1.8.6, or since starting to use specific kernel version or other software?

            bobijam Zhenyu Xu added a comment - When did this situation happen? Did it happen after switched to RHEL6? Upgrading from older Lustre 1.8.6, or since starting to use specific kernel version or other software?
            pjones Peter Jones added a comment -

            Bobi

            Andreas is rather busy at the moment so could you please review and comment on this latest information from our customer?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bobi Andreas is rather busy at the moment so could you please review and comment on this latest information from our customer? Thanks Peter

            I received the following from the customer today:

            Please ask WC to stand down on it being P1. We found in a sample that there was
            lustre so we went with that. Once we started looking at all of the traces lustre is present it SOME of
            the stack traces but it is not in the most common. I would appreciate if Andreas
            can have a look at some more strack traces to see if there is anything
            he's seen before though.

            ftp://shell.sgi.com/collect/jhanson/nodeswithsoftlockupconsoles.tar.bz2

            What I've found by looking at these

            Once there is a BUG: soft lockup the next lines are like this (example chosen at random)

            BUG: soft lockup - CPU#0 stuck for 61s! [global_fcst:30024]
            Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt
            iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler]
            CPU 0:
            Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt
            iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler]
            Pid: 30024, comm: global_fcst Tainted: G W ---------------- 2.6.32-71.el6.x86_64 #1 AltixICE8400IP105
            RIP: 0010:[<ffffffff814caa3e>] [<ffffffff814caa3e>] _spin_lock+0x1e/0x30
            RSP: 0018:ffff8802e9b3fc38 EFLAGS: 00000297
            RAX: 000000000000e364 RBX: ffff8802e9b3fc38 RCX: ffff8804b764de80
            RDX: 0000000000000000 RSI: ffff88033d53d208 RDI: ffff880637837268
            RBP: ffffffff81013c8e R08: ffff8802e9b3fe10 R09: 0000000000100000
            R10: 00007fffffff2dc0 R11: 0000000000000213 R12: ffff88033b712100
            R13: ffffffff817300c0 R14: ffff88033b7126b8 R15: 0000000000010518
            FS: 00002aaaaf3e0800(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
            CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
            CR2: 00002aaaae8f0840 CR3: 000000033ca90000 CR4: 00000000000006f0
            DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            Call Trace:
            [<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem]
            [<ffffffff81068598>] ? get_task_mm+0x28/0x70
            [<ffffffffa030073a>] ? xpmem_make+0x9a/0x360 [xpmem]
            [<ffffffff8110c037>] ? __lock_page+0x67/0x70
            [<ffffffffa02ff19d>] ? xpmem_ioctl+0xdd/0x3f0 [xpmem]
            [<ffffffff8110dade>] ? filemap_fault+0xbe/0x510
            [<ffffffff8110c177>] ? unlock_page+0x27/0x30
            [<ffffffff81135837>] ? handle_pte_fault+0xf7/0xad0
            [<ffffffff811502a7>] ? alloc_pages_current+0x87/0xd0
            [<ffffffff8117f182>] ? vfs_ioctl+0x22/0xa0
            [<ffffffff81258ae5>] ? _atomic_dec_and_lock+0x55/0x80
            [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
            [<ffffffff8117f324>] ? do_vfs_ioctl+0x84/0x580
            [<ffffffff811363fd>] ? handle_mm_fault+0x1ed/0x2b0
            [<ffffffff8117f8a1>] ? sys_ioctl+0x81/0xa0
            [<ffffffff81013172>] ? system_call_fastpath+0x16/0x1b

            So I went to look for commonality after Call Trace: and found little with lots
            of possible places to check.

            guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | sort | uniq

            Call Trace:
            [<ffffffff810117bc>] ? __switch_to+0x1ac/0x320
            [<ffffffff81013ace>] ? common_interrupt+0xe/0x13
            [<ffffffff81013b76>] retint_careful+0x14/0x32
            [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20
            [<ffffffff81013cee>] ? invalidate_interrupt1+0xe/0x20
            [<ffffffff81013d4e>] ? invalidate_interrupt4+0xe/0x20
            [<ffffffff81013d6e>] ? invalidate_interrupt5+0xe/0x20
            [<ffffffff81014162>] ? kernel_thread+0x82/0xe0
            [<ffffffff81014645>] ? math_state_restore+0x45/0x60
            [<ffffffff8101660f>] ? dump_trace+0x1af/0x3a0
            [<ffffffff8101a4f9>] ? read_tsc+0x9/0x20
            [<ffffffff8104f61c>] ? enqueue_task+0x5c/0x70
            [<ffffffff8104fff9>] ? __wake_up_common+0x59/0x90
            [<ffffffff810507f8>] ? resched_task+0x68/0x80
            [<ffffffff810508a5>] ? check_preempt_curr_idle+0x15/0x20
            [<ffffffff81056303>] ? __wake_up+0x53/0x70
            [<ffffffff81056630>] ? __dequeue_entity+0x30/0x50
            [<ffffffff81059d12>] ? finish_task_switch+0x42/0xd0
            [<ffffffff8105a808>] ? pull_task+0x58/0x70
            [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
            [<ffffffff8105c4a2>] ? default_wake_function+0x12/0x20
            [<ffffffff8105c4e5>] ? wake_up_process+0x15/0x20
            [<ffffffff8105c756>] ? update_curr+0xe6/0x1e0
            [<ffffffff8105fa72>] ? enqueue_entity+0x122/0x320
            [<ffffffff8105fcb3>] ? enqueue_task_fair+0x43/0x90
            [<ffffffff81061b71>] ? dequeue_entity+0x1a1/0x1e0
            [<ffffffff81062b84>] ? find_busiest_group+0x254/0xb40
            [<ffffffff8106329a>] ? find_busiest_group+0x96a/0xb40
            [<ffffffff81066d6e>] ? select_task_rq_fair+0x9ee/0xab0
            [<ffffffff810670c1>] ? check_preempt_wakeup+0x41/0x3c0
            [<ffffffff81067244>] ? check_preempt_wakeup+0x1c4/0x3c0
            [<ffffffff81067732>] migration_thread+0x1d2/0x310
            [<ffffffff81069207>] ? dup_mm+0x2a7/0x520
            [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0
            [<ffffffff8106b9f5>] ? __call_console_drivers+0x75/0x90
            [<ffffffff8106d0a1>] do_syslog+0x461/0x4c0
            [<ffffffff8106f805>] do_wait+0x1c5/0x250
            [<ffffffff8107064f>] do_exit+0x56f/0x820
            [<ffffffff810737a5>] ksoftirqd+0xd5/0x110
            [<ffffffff8107d5ac>] ? lock_timer_base+0x3c/0x70
            [<ffffffff8107e616>] ? mod_timer+0x146/0x230
            [<ffffffff8107e718>] ? add_timer+0x18/0x30
            [<ffffffff8108ac20>] ? __call_usermodehelper+0x0/0xa0
            [<ffffffff8108c4a0>] ? worker_thread+0x0/0x2a0
            [<ffffffff8108cc82>] ? queue_work_on+0x42/0x60
            [<ffffffff81091cb6>] ? autoremove_wake_function+0x16/0x40
            [<ffffffff81091eae>] ? prepare_to_wait_exclusive+0x4e/0x80
            [<ffffffff81091f8e>] ? prepare_to_wait+0x4e/0x80
            [<ffffffff81095da3>] ? __hrtimer_start_range_ns+0x1a3/0x430
            [<ffffffff8109638a>] ? down_read_trylock+0x1a/0x30
            [<ffffffff81096bff>] ? up+0x2f/0x50
            [<ffffffff81098f05>] async_manager_thread+0xc5/0x100
            [<ffffffff8109b9a9>] ? ktime_get_ts+0xa9/0xe0
            [<ffffffff810a25a9>] futex_wait_queue_me+0xb9/0xf0
            [<ffffffff810a666b>] ? rt_mutex_adjust_pi+0x7b/0x90
            [<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0
            [<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0
            [<ffffffff810ca7b6>] ? audit_hold_skb+0x26/0x50
            [<ffffffff810cab7b>] ? kauditd_send_skb+0x3b/0x90
            [<ffffffff810d3d4b>] ? audit_syscall_exit+0x25b/0x290
            [<ffffffff8110351b>] slow_work_thread+0x32b/0x3a0
            [<ffffffff81108047>] ? perf_event_exit_task+0x37/0x160
            [<ffffffff8110b832>] ? iov_iter_copy_from_user_atomic+0x92/0x130
            [<ffffffff8110bb70>] ? find_get_pages_tag+0x40/0x120
            [<ffffffff8110c060>] ? sync_page+0x0/0x50
            [<ffffffff8110c0b0>] ? sync_page_killable+0x0/0x40
            [<ffffffff8110eecb>] oom_kill_process+0xcb/0x2e0
            [<ffffffff8111b3a5>] ? __rmqueue+0xc5/0x490
            [<ffffffff8111bd57>] bad_page+0x107/0x160
            [<ffffffff8111cf91>] ? get_page_from_freelist+0x3d1/0x820
            [<ffffffff8111e1c6>] ? __alloc_pages_nodemask+0xf6/0x810
            [<ffffffff8111e48d>] ? __alloc_pages_nodemask+0x3bd/0x810
            [<ffffffff8111e745>] __alloc_pages_nodemask+0x675/0x810
            [<ffffffff8111f78a>] ? determine_dirtyable_memory+0x1a/0x30
            [<ffffffff81120951>] ? do_writepages+0x21/0x40
            [<ffffffff8112bc27>] ? vma_prio_tree_next+0x47/0x70
            [<ffffffff8112d14d>] ? zone_statistics+0x7d/0xa0
            [<ffffffff8112d980>] ? vmstat_update+0x0/0x40
            [<ffffffff8112de70>] ? bdi_sync_supers+0x0/0x60
            [<ffffffff811336b5>] ? unmap_vmas+0xa85/0xc00
            [<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150
            [<ffffffff81135a85>] ? handle_pte_fault+0x345/0xad0
            [<ffffffff81136455>] ? handle_mm_fault+0x245/0x2b0
            [<ffffffff81139582>] ? unlink_file_vma+0x42/0x70
            [<ffffffff8113e59d>] ? rmap_walk+0x7d/0x1c0
            [<ffffffff8113f2de>] ? page_referenced+0x9e/0x2f0
            [<ffffffff8113fb72>] ? try_to_unmap_file+0x42/0x750
            [<ffffffff81156007>] ? cache_grow+0x217/0x320
            [<ffffffff811560bf>] ? cache_grow+0x2cf/0x320
            [<ffffffff81157e51>] ? drain_array+0xe1/0x100
            [<ffffffff81158d38>] ? drain_freelist+0x78/0xc0
            [<ffffffff81158d80>] ? cache_reap+0x0/0x260
            [<ffffffff8115fe28>] ? __mem_cgroup_uncharge_common+0x78/0x260
            [<ffffffff81161c89>] ? mem_cgroup_charge_common+0x99/0xc0
            [<ffffffff81165218>] khugepaged+0x958/0x1190
            [<ffffffff8116c65a>] ? do_sync_read+0xfa/0x140
            [<ffffffff81175fdb>] pipe_wait+0x5b/0x80
            [<ffffffff81258839>] ? cpumask_next_and+0x29/0x50
            [<ffffffff81262a54>] ? vsnprintf+0x484/0x5f0
            [<ffffffff81264025>] ? memmove+0x45/0x50
            [<ffffffff812fcaa0>] ? flush_to_ldisc+0x0/0x1b0
            [<ffffffff812fee81>] vt_event_wait+0xa1/0x100
            [<ffffffff8137fe39>] hub_thread+0x369/0x17f0
            [<ffffffff8138a164>] ? usb_suspend_both+0x1a4/0x320
            [<ffffffff814277d0>] ? eth_type_trans+0x40/0x140
            [<ffffffff81445e95>] ? ip_local_out+0x25/0x30
            [<ffffffff8144e7e6>] ? tcp_sendmsg+0x756/0xa30
            [<ffffffff8149b2d6>] ? unix_stream_sendmsg+0x3c6/0x3e0
            [<ffffffff814c7b23>] panic+0x78/0x137
            [<ffffffff814c8286>] ? thread_return+0x4e/0x778
            [<ffffffff814c8b00>] ? _cond_resched+0x30/0x40
            [<ffffffff814c8c5c>] ? wait_for_common+0x14c/0x180
            [<ffffffff814c8d4d>] ? wait_for_completion+0x1d/0x20
            [<ffffffff814c8f34>] schedule_timeout+0x194/0x2f0
            [<ffffffff814c8f3c>] ? schedule_timeout+0x19c/0x2f0
            [<ffffffff814c8fc5>] schedule_timeout+0x225/0x2f0
            [<ffffffff814c96e0>] ? __mutex_lock_slowpath+0x70/0x180
            [<ffffffff814c97ae>] __mutex_lock_slowpath+0x13e/0x180
            [<ffffffff814c9ad8>] schedule_hrtimeout_range+0xc8/0x160
            [<ffffffff814c9b4d>] schedule_hrtimeout_range+0x13d/0x160
            [<ffffffff814c9c1b>] do_nanosleep+0x8b/0xc0
            [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
            [<ffffffff814cac1b>] ? _spin_unlock_bh+0x1b/0x20
            [<ffffffff814cd766>] ? notifier_call_chain+0x16/0x80
            [<ffffffffa00a78be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs]
            [<ffffffffa00a9e10>] ? fib6_clean_node+0x0/0xd0 [ipv6]
            [<ffffffffa00b0540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
            [<ffffffffa01407fd>] ? call_transmit_status+0x4d/0xe0 [sunrpc]
            [<ffffffffa01433e9>] ? xprt_release_xprt+0x89/0x90 [sunrpc]
            [<ffffffffa01435bf>] ? xprt_reserve+0x1cf/0x1f0 [sunrpc]
            [<ffffffffa01444a0>] ? xprt_autoclose+0x0/0x70 [sunrpc]
            [<ffffffffa0146210>] ? xs_tcp_connect_worker4+0x0/0x30 [sunrpc]
            [<ffffffffa01488a0>] ? rpc_async_release+0x0/0x20 [sunrpc]
            [<ffffffffa0148d00>] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc]
            [<ffffffffa0149760>] ? rpc_async_schedule+0x0/0x20 [sunrpc]
            [<ffffffffa01e68be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs]
            [<ffffffffa01e7560>] ? nfs_wait_bit_killable+0x0/0x40 [nfs]
            [<ffffffffa01ef540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
            [<ffffffffa01f40cd>] ? nfs_commit_free+0x3d/0x50 [nfs]
            [<ffffffffa01f4688>] ? nfs_writeback_release_full+0x128/0x1b0 [nfs]
            [<ffffffffa01fe3a5>] xpmem_clear_PFNtable+0x185/0x340 [xpmem]
            [<ffffffffa02467b0>] ? process_req+0x0/0x1a0 [ib_addr]
            [<ffffffffa02745ae>] ? mlx4_ib_post_send+0x4be/0xf10 [mlx4_ib]
            [<ffffffffa02a80cd>] ? mcast_work_handler+0xed/0x830 [ib_sa]
            [<ffffffffa030073a>] xpmem_make+0x9a/0x360 [xpmem]
            [<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem]
            [<ffffffffa03054f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem]
            [<ffffffffa0308445>] xpmem_clear_PFNtable+0x185/0x340 [xpmem]
            [<ffffffffa0309ec8>] ? xpmem_recall_PFNs_of_tg+0xf8/0x2d0 [xpmem]
            [<ffffffffa030a40b>] xpmem_pgcl_thread+0x1db/0x220 [xpmem]
            [<ffffffffa0320ab2>] lcw_dispatch_main+0xd2/0x400 [libcfs]
            [<ffffffffa0353b8b>] ? mlx4_ib_poll_cq+0x2ab/0x780 [mlx4_ib]
            [<ffffffffa0379c9d>] ? LNetMDAttach+0x35d/0x4c0 [lnet]
            [<ffffffffa03dbc5a>] obd_zombie_impexp_thread+0x15a/0x2b0 [obdclass]
            [<ffffffffa046a330>] ? ipoib_reap_ah+0x0/0x50 [ib_ipoib]
            [<ffffffffa04e6c3a>] ? kiblnd_queue_tx+0x4a/0x60 [ko2iblnd]
            [<ffffffffa04f3eb6>] ? loi_list_maint+0xa6/0x130 [osc]
            [<ffffffffa050fb64>] ? cache_add_extent+0x134/0x640 [osc]
            [<ffffffffa056efd0>] ? ib_mad_completion_handler+0x0/0x810 [ib_mad]
            [<ffffffffa057a492>] ? cm_process_work+0x32/0x110 [ib_cm]
            [<ffffffffa057bcff>] ? cm_rep_handler+0x31f/0x590 [ib_cm]
            [<ffffffffa057bf70>] ? cm_work_handler+0x0/0x11d6 [ib_cm]
            [<ffffffffa0584330>] ? cma_work_handler+0x0/0xb0 [rdma_cm]
            [<ffffffffa059fc81>] ? kiblnd_init_tx_msg+0x91/0x200 [ko2iblnd]
            [<ffffffffa05a4465>] kiblnd_scheduler+0x325/0x760 [ko2iblnd]
            [<ffffffffa05bafed>] ? ldlm_lock_put+0x19d/0x450 [ptlrpc]
            [<ffffffffa05bffb1>] ? ldlm_lock_decref+0x41/0xb0 [ptlrpc]
            [<ffffffffa05c0af3>] ? ldlm_resource_putref_internal+0xb3/0x4c0 [ptlrpc]
            [<ffffffffa05e3397>] ? ldlm_callback_handler+0xa57/0x1e10 [ptlrpc]
            [<ffffffffa05e6140>] ldlm_bl_thread_main+0x3f0/0x440 [ptlrpc]
            [<ffffffffa060d1d0>] ptlrpc_wait_event+0x3b0/0x3c0 [ptlrpc]
            [<ffffffffa060e6a7>] ? lov_merge_lvb+0xb7/0x240 [lov]
            [<ffffffffa0684ac2>] ? ll_removepage+0x352/0x8d0 [lustre]
            [<ffffffffa0695c9c>] ? ll_file_mmap+0x12c/0x180 [lustre]
            [<ffffffffa06ef6a7>] ? lov_merge_lvb+0xb7/0x240 [lov]
            [<ffffffffa06f20f5>] ? lov_finish_set+0x435/0x710 [lov]
            [<ffffffffa07056a7>] ? lov_merge_lvb+0xb7/0x240 [lov]
            [<ffffffffa073f1a4>] ll_close_thread+0x124/0x260 [lustre]
            [<ffffffffa075aac2>] ? ll_removepage+0x352/0x8d0 [lustre]
            [<ffffffffa09d7c9c>] ? ll_file_mmap+0x12c/0x180 [lustre]
            <IRQ>
            <IRQ> [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0
            <IRQ> [<ffffffff810d8740>] ? handle_IRQ_event+0x60/0x170
            <IRQ> [<ffffffff814c7b23>] panic+0x78/0x137

            It is probably not unexpected there are many places because
            guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | wc -l
            372554

            In the history if this cluster (as reflected in the console logs) we have had BUG: soft lockup 119496 times.

            There are a wide variety of places where the back trace starts but the two most dominant are

            grep -binary-files=text -h -A1 "Call Trace" r* | grep -v "Call Trace" | grep -v ^- | grep unmap_mapping_range | wc -l
            49558
            grep -binary-files=text -h -A1 "Call Trace" r* | grep -v "Call Trace" | grep -v ^- | grep xpmem_tg_ref_by_tgid | wc -l
            30446

            After the first function in the dominant ones it starts to diverge for unmap_mapping_range
            grep --binary-files=text -h -A1 "unmap_mapping_range" r* | sort | uniq

            [<ffffffff81013cce>] ? invalidate_interrupt0+0xe/0x20
            [<ffffffff810ddc95>] ? call_rcu_sched+0x15/0x20
            [<ffffffff811343b4>] unmap_mapping_range_vma+0x64/0xf0
            [<ffffffff811343ea>] ? unmap_mapping_range_vma+0x9a/0xf0
            [<ffffffff811344d7>] ? unmap_mapping_range_tree+0x97/0xf0
            [<ffffffff811344d7>] unmap_mapping_range_tree+0x97/0xf0
            [<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150
            [<ffffffff811345a2>] unmap_mapping_range+0x72/0x150
            [<ffffffff81134661>] ? unmap_mapping_range+0x131/0x150
            [<ffffffff81134661>] unmap_mapping_range+0x131/0x150
            [<ffffffff814caa3e>] ? _spin_lock+0x1e/0x30
            [<ffffffff814caa41>] ? _spin_lock+0x21/0x30
            [<ffffffffa01fb4f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem]
            [<ffffffffa042231c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
            [<ffffffffa042231c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
            [<ffffffffa069631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
            [<ffffffffa069631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
            [<ffffffffa076c31c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
            [<ffffffffa076c31c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
            [<ffffffffa09d831c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
            [<ffffffffa09d831c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]
            [<ffffffffa0ac631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre]
            [<ffffffffa0ac631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre]

            For xpmem_tg_ref_by_tgid it is only get_task_mm

            dnelson@ddn.com Dennis Nelson added a comment - I received the following from the customer today: Please ask WC to stand down on it being P1. We found in a sample that there was lustre so we went with that. Once we started looking at all of the traces lustre is present it SOME of the stack traces but it is not in the most common. I would appreciate if Andreas can have a look at some more strack traces to see if there is anything he's seen before though. ftp://shell.sgi.com/collect/jhanson/nodeswithsoftlockupconsoles.tar.bz2 What I've found by looking at these Once there is a BUG: soft lockup the next lines are like this (example chosen at random) BUG: soft lockup - CPU#0 stuck for 61s! [global_fcst:30024] Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler] CPU 0: Modules linked in: acpi_cpufreq freq_table mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib iw_cxgb3 ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr obdclass(U) lnet(U) lvfs(U) libcfs(U) xpmem(U) xp gru xvma(U) numatools(U) microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma ahci mlx4_en mlx4_core igb dca dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi qla4xxx scsi_transport_iscsi [last unloaded: ipmi_msghandler] Pid: 30024, comm: global_fcst Tainted: G W ---------------- 2.6.32-71.el6.x86_64 #1 AltixICE8400IP105 RIP: 0010: [<ffffffff814caa3e>] [<ffffffff814caa3e>] _spin_lock+0x1e/0x30 RSP: 0018:ffff8802e9b3fc38 EFLAGS: 00000297 RAX: 000000000000e364 RBX: ffff8802e9b3fc38 RCX: ffff8804b764de80 RDX: 0000000000000000 RSI: ffff88033d53d208 RDI: ffff880637837268 RBP: ffffffff81013c8e R08: ffff8802e9b3fe10 R09: 0000000000100000 R10: 00007fffffff2dc0 R11: 0000000000000213 R12: ffff88033b712100 R13: ffffffff817300c0 R14: ffff88033b7126b8 R15: 0000000000010518 FS: 00002aaaaf3e0800(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaae8f0840 CR3: 000000033ca90000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem] [<ffffffff81068598>] ? get_task_mm+0x28/0x70 [<ffffffffa030073a>] ? xpmem_make+0x9a/0x360 [xpmem] [<ffffffff8110c037>] ? __lock_page+0x67/0x70 [<ffffffffa02ff19d>] ? xpmem_ioctl+0xdd/0x3f0 [xpmem] [<ffffffff8110dade>] ? filemap_fault+0xbe/0x510 [<ffffffff8110c177>] ? unlock_page+0x27/0x30 [<ffffffff81135837>] ? handle_pte_fault+0xf7/0xad0 [<ffffffff811502a7>] ? alloc_pages_current+0x87/0xd0 [<ffffffff8117f182>] ? vfs_ioctl+0x22/0xa0 [<ffffffff81258ae5>] ? _atomic_dec_and_lock+0x55/0x80 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff8117f324>] ? do_vfs_ioctl+0x84/0x580 [<ffffffff811363fd>] ? handle_mm_fault+0x1ed/0x2b0 [<ffffffff8117f8a1>] ? sys_ioctl+0x81/0xa0 [<ffffffff81013172>] ? system_call_fastpath+0x16/0x1b So I went to look for commonality after Call Trace: and found little with lots of possible places to check. guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | sort | uniq – Call Trace: [<ffffffff810117bc>] ? __switch_to+0x1ac/0x320 [<ffffffff81013ace>] ? common_interrupt+0xe/0x13 [<ffffffff81013b76>] retint_careful+0x14/0x32 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff81013cee>] ? invalidate_interrupt1+0xe/0x20 [<ffffffff81013d4e>] ? invalidate_interrupt4+0xe/0x20 [<ffffffff81013d6e>] ? invalidate_interrupt5+0xe/0x20 [<ffffffff81014162>] ? kernel_thread+0x82/0xe0 [<ffffffff81014645>] ? math_state_restore+0x45/0x60 [<ffffffff8101660f>] ? dump_trace+0x1af/0x3a0 [<ffffffff8101a4f9>] ? read_tsc+0x9/0x20 [<ffffffff8104f61c>] ? enqueue_task+0x5c/0x70 [<ffffffff8104fff9>] ? __wake_up_common+0x59/0x90 [<ffffffff810507f8>] ? resched_task+0x68/0x80 [<ffffffff810508a5>] ? check_preempt_curr_idle+0x15/0x20 [<ffffffff81056303>] ? __wake_up+0x53/0x70 [<ffffffff81056630>] ? __dequeue_entity+0x30/0x50 [<ffffffff81059d12>] ? finish_task_switch+0x42/0xd0 [<ffffffff8105a808>] ? pull_task+0x58/0x70 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20 [<ffffffff8105c4a2>] ? default_wake_function+0x12/0x20 [<ffffffff8105c4e5>] ? wake_up_process+0x15/0x20 [<ffffffff8105c756>] ? update_curr+0xe6/0x1e0 [<ffffffff8105fa72>] ? enqueue_entity+0x122/0x320 [<ffffffff8105fcb3>] ? enqueue_task_fair+0x43/0x90 [<ffffffff81061b71>] ? dequeue_entity+0x1a1/0x1e0 [<ffffffff81062b84>] ? find_busiest_group+0x254/0xb40 [<ffffffff8106329a>] ? find_busiest_group+0x96a/0xb40 [<ffffffff81066d6e>] ? select_task_rq_fair+0x9ee/0xab0 [<ffffffff810670c1>] ? check_preempt_wakeup+0x41/0x3c0 [<ffffffff81067244>] ? check_preempt_wakeup+0x1c4/0x3c0 [<ffffffff81067732>] migration_thread+0x1d2/0x310 [<ffffffff81069207>] ? dup_mm+0x2a7/0x520 [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0 [<ffffffff8106b9f5>] ? __call_console_drivers+0x75/0x90 [<ffffffff8106d0a1>] do_syslog+0x461/0x4c0 [<ffffffff8106f805>] do_wait+0x1c5/0x250 [<ffffffff8107064f>] do_exit+0x56f/0x820 [<ffffffff810737a5>] ksoftirqd+0xd5/0x110 [<ffffffff8107d5ac>] ? lock_timer_base+0x3c/0x70 [<ffffffff8107e616>] ? mod_timer+0x146/0x230 [<ffffffff8107e718>] ? add_timer+0x18/0x30 [<ffffffff8108ac20>] ? __call_usermodehelper+0x0/0xa0 [<ffffffff8108c4a0>] ? worker_thread+0x0/0x2a0 [<ffffffff8108cc82>] ? queue_work_on+0x42/0x60 [<ffffffff81091cb6>] ? autoremove_wake_function+0x16/0x40 [<ffffffff81091eae>] ? prepare_to_wait_exclusive+0x4e/0x80 [<ffffffff81091f8e>] ? prepare_to_wait+0x4e/0x80 [<ffffffff81095da3>] ? __hrtimer_start_range_ns+0x1a3/0x430 [<ffffffff8109638a>] ? down_read_trylock+0x1a/0x30 [<ffffffff81096bff>] ? up+0x2f/0x50 [<ffffffff81098f05>] async_manager_thread+0xc5/0x100 [<ffffffff8109b9a9>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff810a25a9>] futex_wait_queue_me+0xb9/0xf0 [<ffffffff810a666b>] ? rt_mutex_adjust_pi+0x7b/0x90 [<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0 [<ffffffff810c2b01>] ? cpuset_print_task_mems_allowed+0x91/0xb0 [<ffffffff810ca7b6>] ? audit_hold_skb+0x26/0x50 [<ffffffff810cab7b>] ? kauditd_send_skb+0x3b/0x90 [<ffffffff810d3d4b>] ? audit_syscall_exit+0x25b/0x290 [<ffffffff8110351b>] slow_work_thread+0x32b/0x3a0 [<ffffffff81108047>] ? perf_event_exit_task+0x37/0x160 [<ffffffff8110b832>] ? iov_iter_copy_from_user_atomic+0x92/0x130 [<ffffffff8110bb70>] ? find_get_pages_tag+0x40/0x120 [<ffffffff8110c060>] ? sync_page+0x0/0x50 [<ffffffff8110c0b0>] ? sync_page_killable+0x0/0x40 [<ffffffff8110eecb>] oom_kill_process+0xcb/0x2e0 [<ffffffff8111b3a5>] ? __rmqueue+0xc5/0x490 [<ffffffff8111bd57>] bad_page+0x107/0x160 [<ffffffff8111cf91>] ? get_page_from_freelist+0x3d1/0x820 [<ffffffff8111e1c6>] ? __alloc_pages_nodemask+0xf6/0x810 [<ffffffff8111e48d>] ? __alloc_pages_nodemask+0x3bd/0x810 [<ffffffff8111e745>] __alloc_pages_nodemask+0x675/0x810 [<ffffffff8111f78a>] ? determine_dirtyable_memory+0x1a/0x30 [<ffffffff81120951>] ? do_writepages+0x21/0x40 [<ffffffff8112bc27>] ? vma_prio_tree_next+0x47/0x70 [<ffffffff8112d14d>] ? zone_statistics+0x7d/0xa0 [<ffffffff8112d980>] ? vmstat_update+0x0/0x40 [<ffffffff8112de70>] ? bdi_sync_supers+0x0/0x60 [<ffffffff811336b5>] ? unmap_vmas+0xa85/0xc00 [<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150 [<ffffffff81135a85>] ? handle_pte_fault+0x345/0xad0 [<ffffffff81136455>] ? handle_mm_fault+0x245/0x2b0 [<ffffffff81139582>] ? unlink_file_vma+0x42/0x70 [<ffffffff8113e59d>] ? rmap_walk+0x7d/0x1c0 [<ffffffff8113f2de>] ? page_referenced+0x9e/0x2f0 [<ffffffff8113fb72>] ? try_to_unmap_file+0x42/0x750 [<ffffffff81156007>] ? cache_grow+0x217/0x320 [<ffffffff811560bf>] ? cache_grow+0x2cf/0x320 [<ffffffff81157e51>] ? drain_array+0xe1/0x100 [<ffffffff81158d38>] ? drain_freelist+0x78/0xc0 [<ffffffff81158d80>] ? cache_reap+0x0/0x260 [<ffffffff8115fe28>] ? __mem_cgroup_uncharge_common+0x78/0x260 [<ffffffff81161c89>] ? mem_cgroup_charge_common+0x99/0xc0 [<ffffffff81165218>] khugepaged+0x958/0x1190 [<ffffffff8116c65a>] ? do_sync_read+0xfa/0x140 [<ffffffff81175fdb>] pipe_wait+0x5b/0x80 [<ffffffff81258839>] ? cpumask_next_and+0x29/0x50 [<ffffffff81262a54>] ? vsnprintf+0x484/0x5f0 [<ffffffff81264025>] ? memmove+0x45/0x50 [<ffffffff812fcaa0>] ? flush_to_ldisc+0x0/0x1b0 [<ffffffff812fee81>] vt_event_wait+0xa1/0x100 [<ffffffff8137fe39>] hub_thread+0x369/0x17f0 [<ffffffff8138a164>] ? usb_suspend_both+0x1a4/0x320 [<ffffffff814277d0>] ? eth_type_trans+0x40/0x140 [<ffffffff81445e95>] ? ip_local_out+0x25/0x30 [<ffffffff8144e7e6>] ? tcp_sendmsg+0x756/0xa30 [<ffffffff8149b2d6>] ? unix_stream_sendmsg+0x3c6/0x3e0 [<ffffffff814c7b23>] panic+0x78/0x137 [<ffffffff814c8286>] ? thread_return+0x4e/0x778 [<ffffffff814c8b00>] ? _cond_resched+0x30/0x40 [<ffffffff814c8c5c>] ? wait_for_common+0x14c/0x180 [<ffffffff814c8d4d>] ? wait_for_completion+0x1d/0x20 [<ffffffff814c8f34>] schedule_timeout+0x194/0x2f0 [<ffffffff814c8f3c>] ? schedule_timeout+0x19c/0x2f0 [<ffffffff814c8fc5>] schedule_timeout+0x225/0x2f0 [<ffffffff814c96e0>] ? __mutex_lock_slowpath+0x70/0x180 [<ffffffff814c97ae>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff814c9ad8>] schedule_hrtimeout_range+0xc8/0x160 [<ffffffff814c9b4d>] schedule_hrtimeout_range+0x13d/0x160 [<ffffffff814c9c1b>] do_nanosleep+0x8b/0xc0 [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814cac1b>] ? _spin_unlock_bh+0x1b/0x20 [<ffffffff814cd766>] ? notifier_call_chain+0x16/0x80 [<ffffffffa00a78be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs] [<ffffffffa00a9e10>] ? fib6_clean_node+0x0/0xd0 [ipv6] [<ffffffffa00b0540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs] [<ffffffffa01407fd>] ? call_transmit_status+0x4d/0xe0 [sunrpc] [<ffffffffa01433e9>] ? xprt_release_xprt+0x89/0x90 [sunrpc] [<ffffffffa01435bf>] ? xprt_reserve+0x1cf/0x1f0 [sunrpc] [<ffffffffa01444a0>] ? xprt_autoclose+0x0/0x70 [sunrpc] [<ffffffffa0146210>] ? xs_tcp_connect_worker4+0x0/0x30 [sunrpc] [<ffffffffa01488a0>] ? rpc_async_release+0x0/0x20 [sunrpc] [<ffffffffa0148d00>] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc] [<ffffffffa0149760>] ? rpc_async_schedule+0x0/0x20 [sunrpc] [<ffffffffa01e68be>] ? __put_nfs_open_context+0x3e/0xc0 [nfs] [<ffffffffa01e7560>] ? nfs_wait_bit_killable+0x0/0x40 [nfs] [<ffffffffa01ef540>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs] [<ffffffffa01f40cd>] ? nfs_commit_free+0x3d/0x50 [nfs] [<ffffffffa01f4688>] ? nfs_writeback_release_full+0x128/0x1b0 [nfs] [<ffffffffa01fe3a5>] xpmem_clear_PFNtable+0x185/0x340 [xpmem] [<ffffffffa02467b0>] ? process_req+0x0/0x1a0 [ib_addr] [<ffffffffa02745ae>] ? mlx4_ib_post_send+0x4be/0xf10 [mlx4_ib] [<ffffffffa02a80cd>] ? mcast_work_handler+0xed/0x830 [ib_sa] [<ffffffffa030073a>] xpmem_make+0x9a/0x360 [xpmem] [<ffffffffa0304fb1>] ? xpmem_tg_ref_by_tgid+0x41/0xe0 [xpmem] [<ffffffffa03054f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem] [<ffffffffa0308445>] xpmem_clear_PFNtable+0x185/0x340 [xpmem] [<ffffffffa0309ec8>] ? xpmem_recall_PFNs_of_tg+0xf8/0x2d0 [xpmem] [<ffffffffa030a40b>] xpmem_pgcl_thread+0x1db/0x220 [xpmem] [<ffffffffa0320ab2>] lcw_dispatch_main+0xd2/0x400 [libcfs] [<ffffffffa0353b8b>] ? mlx4_ib_poll_cq+0x2ab/0x780 [mlx4_ib] [<ffffffffa0379c9d>] ? LNetMDAttach+0x35d/0x4c0 [lnet] [<ffffffffa03dbc5a>] obd_zombie_impexp_thread+0x15a/0x2b0 [obdclass] [<ffffffffa046a330>] ? ipoib_reap_ah+0x0/0x50 [ib_ipoib] [<ffffffffa04e6c3a>] ? kiblnd_queue_tx+0x4a/0x60 [ko2iblnd] [<ffffffffa04f3eb6>] ? loi_list_maint+0xa6/0x130 [osc] [<ffffffffa050fb64>] ? cache_add_extent+0x134/0x640 [osc] [<ffffffffa056efd0>] ? ib_mad_completion_handler+0x0/0x810 [ib_mad] [<ffffffffa057a492>] ? cm_process_work+0x32/0x110 [ib_cm] [<ffffffffa057bcff>] ? cm_rep_handler+0x31f/0x590 [ib_cm] [<ffffffffa057bf70>] ? cm_work_handler+0x0/0x11d6 [ib_cm] [<ffffffffa0584330>] ? cma_work_handler+0x0/0xb0 [rdma_cm] [<ffffffffa059fc81>] ? kiblnd_init_tx_msg+0x91/0x200 [ko2iblnd] [<ffffffffa05a4465>] kiblnd_scheduler+0x325/0x760 [ko2iblnd] [<ffffffffa05bafed>] ? ldlm_lock_put+0x19d/0x450 [ptlrpc] [<ffffffffa05bffb1>] ? ldlm_lock_decref+0x41/0xb0 [ptlrpc] [<ffffffffa05c0af3>] ? ldlm_resource_putref_internal+0xb3/0x4c0 [ptlrpc] [<ffffffffa05e3397>] ? ldlm_callback_handler+0xa57/0x1e10 [ptlrpc] [<ffffffffa05e6140>] ldlm_bl_thread_main+0x3f0/0x440 [ptlrpc] [<ffffffffa060d1d0>] ptlrpc_wait_event+0x3b0/0x3c0 [ptlrpc] [<ffffffffa060e6a7>] ? lov_merge_lvb+0xb7/0x240 [lov] [<ffffffffa0684ac2>] ? ll_removepage+0x352/0x8d0 [lustre] [<ffffffffa0695c9c>] ? ll_file_mmap+0x12c/0x180 [lustre] [<ffffffffa06ef6a7>] ? lov_merge_lvb+0xb7/0x240 [lov] [<ffffffffa06f20f5>] ? lov_finish_set+0x435/0x710 [lov] [<ffffffffa07056a7>] ? lov_merge_lvb+0xb7/0x240 [lov] [<ffffffffa073f1a4>] ll_close_thread+0x124/0x260 [lustre] [<ffffffffa075aac2>] ? ll_removepage+0x352/0x8d0 [lustre] [<ffffffffa09d7c9c>] ? ll_file_mmap+0x12c/0x180 [lustre] <IRQ> <IRQ> [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0 <IRQ> [<ffffffff810d8740>] ? handle_IRQ_event+0x60/0x170 <IRQ> [<ffffffff814c7b23>] panic+0x78/0x137 It is probably not unexpected there are many places because guest@globe:/cores/people/jhanson/noaa/softlockup/nodeswithsoftlockupconsoles> grep --binary-files=text -h -A1 "Call Trace" r* | wc -l 372554 In the history if this cluster (as reflected in the console logs) we have had BUG: soft lockup 119496 times. There are a wide variety of places where the back trace starts but the two most dominant are grep - binary-files=text -h -A1 "Call Trace" r* | grep -v "Call Trace" | grep -v ^ - | grep unmap_mapping_range | wc -l 49558 grep - binary-files=text -h -A1 "Call Trace" r* | grep -v "Call Trace" | grep -v ^ - | grep xpmem_tg_ref_by_tgid | wc -l 30446 After the first function in the dominant ones it starts to diverge for unmap_mapping_range grep --binary-files=text -h -A1 "unmap_mapping_range" r* | sort | uniq – [<ffffffff81013cce>] ? invalidate_interrupt0+0xe/0x20 [<ffffffff810ddc95>] ? call_rcu_sched+0x15/0x20 [<ffffffff811343b4>] unmap_mapping_range_vma+0x64/0xf0 [<ffffffff811343ea>] ? unmap_mapping_range_vma+0x9a/0xf0 [<ffffffff811344d7>] ? unmap_mapping_range_tree+0x97/0xf0 [<ffffffff811344d7>] unmap_mapping_range_tree+0x97/0xf0 [<ffffffff811345a2>] ? unmap_mapping_range+0x72/0x150 [<ffffffff811345a2>] unmap_mapping_range+0x72/0x150 [<ffffffff81134661>] ? unmap_mapping_range+0x131/0x150 [<ffffffff81134661>] unmap_mapping_range+0x131/0x150 [<ffffffff814caa3e>] ? _spin_lock+0x1e/0x30 [<ffffffff814caa41>] ? _spin_lock+0x21/0x30 [<ffffffffa01fb4f1>] ? xpmem_PFNs_exist_in_range_l3+0x51/0xa0 [xpmem] [<ffffffffa042231c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa042231c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa069631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa069631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa076c31c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa076c31c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa09d831c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa09d831c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa0ac631c>] ? ll_teardown_mmaps+0x6c/0x1c0 [lustre] [<ffffffffa0ac631c>] ll_teardown_mmaps+0x6c/0x1c0 [lustre] For xpmem_tg_ref_by_tgid it is only get_task_mm

            This doesn't appear to be the same as LU-93, which was causing the client to crash.

            In this case, it looks like all of the threads are stuck in ll_teardown_mmaps->unmap_mapping_range() because the node is trying to free memory under memory pressure.

            This is a somewhat unusual workload for Lustre, because while mmap IO is functional, it is quite inefficient (single page RPCs) and rarely used.

            Has this application been running in the past on Lustre? Are there any changes in the environment that might have caused the application to start failing (e.g. kernel, Lustre, or application upgrade)?

            adilger Andreas Dilger added a comment - This doesn't appear to be the same as LU-93 , which was causing the client to crash. In this case, it looks like all of the threads are stuck in ll_teardown_mmaps->unmap_mapping_range() because the node is trying to free memory under memory pressure. This is a somewhat unusual workload for Lustre, because while mmap IO is functional, it is quite inefficient (single page RPCs) and rarely used. Has this application been running in the past on Lustre? Are there any changes in the environment that might have caused the application to start failing (e.g. kernel, Lustre, or application upgrade)?

            Customer just asked me to bump up the priority on this one. They just reported that this issue has caused hundreds of nodes to become unresponsive on their system.

            dnelson@ddn.com Dennis Nelson added a comment - Customer just asked me to bump up the priority on this one. They just reported that this issue has caused hundreds of nodes to become unresponsive on their system.

            People

              bobijam Zhenyu Xu
              dnelson@ddn.com Dennis Nelson
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: