Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13212

Lustre client hangs machine under memory pressure

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • Lustre 2.10.3
    • None
    • 3
    • 9223372036854775807

    Description

      Hello,

      When a userspace process goes crazy with memory allocation, sometimes OOM killer does not manage to kick in because Lustre is still trying to free its memory.
      I am not sure if it deadlocked or there is just too many locks which it is trying to free but it has been in this state for more than 12 hours before it was manually crashed.
      This is Centos 7.4 with kernel 3.10.0-693.5.2.el7.x86_64
      Machine still responds to pings when it is in this state.

      Here is one of the kernel task stack:

      [223483.032862]  [<ffffffff81196b27>] ? putback_inactive_pages+0x117/0x2d0
      [223483.050260]  [<ffffffff81196f0a>] ? shrink_inactive_list+0x22a/0x5d0
      [223483.062319]  [<ffffffff811979a5>] shrink_lruvec+0x385/0x730
      [223483.073571]  [<ffffffffc085ee07>] ? ldlm_cli_pool_shrink+0x67/0x100 [ptlrpc]
      [223483.086214]  [<ffffffff81197dc6>] shrink_zone+0x76/0x1a0
      [223483.096773]  [<ffffffff811982d0>] do_try_to_free_pages+0xf0/0x4e0
      [223483.108086]  [<ffffffff811987bc>] try_to_free_pages+0xfc/0x180
      [223483.119023]  [<ffffffff8169fbcb>] __alloc_pages_slowpath+0x457/0x724
      [223483.130417]  [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
      [223483.141673]  [<ffffffff811d081a>] alloc_page_interleave+0x3a/0xa0
      [223483.152526]  [<ffffffff811d4133>] alloc_pages_vma+0x143/0x200
      [223483.162848]  [<ffffffff811c37a0>] ? end_swap_bio_write+0x80/0x80
      [223483.173345]  [<ffffffff811c44ad>] read_swap_cache_async+0xed/0x160
      [223483.183938]  [<ffffffff811c45c8>] swapin_readahead+0xa8/0x110
      [223483.193933]  [<ffffffff811b22cb>] handle_mm_fault+0xadb/0xfa0
      [223483.203823]  [<ffffffff816b00b4>] __do_page_fault+0x154/0x450
      [223483.213621]  [<ffffffff816b03e5>] do_page_fault+0x35/0x90
      [223483.222983]  [<ffffffff816ac608>] page_fault+0x28/0x30
      

      Please let me know if you need more information.
      Regards.
      Jacek Tomaka

      Attachments

        Issue Links

          Activity

            [LU-13212] Lustre client hangs machine under memory pressure
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43281/
            Subject: LU-13212 osc: fall back to vmalloc for large RPCs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 037a9e2cf6d5b8d6fdbcde02c1c22e22272c5c07

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43281/ Subject: LU-13212 osc: fall back to vmalloc for large RPCs Project: fs/lustre-release Branch: master Current Patch Set: Commit: 037a9e2cf6d5b8d6fdbcde02c1c22e22272c5c07

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43281
            Subject: LU-13212 osc: fall back to vmalloc for large RPCs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: af1bc8b2f62b47e79633768ef4a5182a737de6ae

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43281 Subject: LU-13212 osc: fall back to vmalloc for large RPCs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: af1bc8b2f62b47e79633768ef4a5182a737de6ae
            lflis Lukasz Flis added a comment -

            I am not sure if it's the same issue but we've seen the problem with tasks getting hung on memory allocation instead of getting killed by oomkiller.

            we can trigger the problem in 2.10 (also ddn22) and 2.12.5 simply by running simple process which is allocating 18GB of ram inside the singularity container contained inside the mem cgroup with memory limit of 16GB.

            [632523.103572] allocator-rss R running task 0 2310 2258 0x00000002
            [632523.137483] Call Trace:
            [632523.150045] [<ffffffff82b06142>] ? ktime_get_ts64+0x52/0xf0
            [632523.177593] [<ffffffffc074fc05>] ? ktime_get_seconds+0x25/0x40 [libcfs]
            [632523.209245] [<ffffffffc0e33c35>] ? osc_cache_shrink_count+0x15/0x90 [osc]
            [632523.241671] [<ffffffffc0e1f162>] ? osc_cache_shrink+0x42/0x60 [osc]
            [632523.271743] [<ffffffff82bd15ce>] ? shrink_slab+0xae/0x340
            [632523.297716] [<ffffffff82bd495a>] ? do_try_to_free_pages+0x3ca/0x520
            [632523.327818] [<ffffffff82bd4bac>] ? try_to_free_pages+0xfc/0x180
            [632523.356911] [<ffffffff82c81c1e>] ? free_more_memory+0xae/0x100
            [632523.385397] [<ffffffff82c82f8b>] ? __getblk+0x15b/0x300
            [632523.410757] [<ffffffffc05c715c>] ? squashfs_read_data+0x15c/0x620 [squashfs]
            [632523.444063] [<ffffffffc05c78e6>] ? squashfs_cache_get+0x2c6/0x3c0 [squashfs]
            [632523.477907] [<ffffffffc05c78da>] ? squashfs_cache_get+0x2ba/0x3c0 [squashfs]
            [632523.511713] [<ffffffffc05c7ec8>] ? squashfs_read_metadata+0x58/0x130 [squashfs]
            [632523.546580] [<ffffffffc05c7ff1>] ? squashfs_get_datablock+0x21/0x30 [squashfs]
            [632523.581404] [<ffffffffc05c9272>] ? squashfs_readpage+0x8a2/0xc30 [squashfs]
            [632523.616332] [<ffffffff82bcb0a8>] ? __do_page_cache_readahead+0x248/0x260
            [632523.648564] [<ffffffff82bcb691>] ? ra_submit+0x21/0x30
            [632523.674147] [<ffffffff82bc0045>] ? filemap_fault+0x105/0x490
            [632523.701357] [<ffffffff82bec15a>] ? __do_fault.isra.61+0x8a/0x100
            [632523.730618] [<ffffffff82bec70c>] ? do_read_fault.isra.63+0x4c/0x1b0
            [632523.760999] [<ffffffff82bf11ba>] ? handle_pte_fault+0x22a/0xe20
            [632523.789538] [<ffffffff82be94cc>] ? __get_user_pages+0x16c/0x7a0
            [632523.817942] [<ffffffff82bf3ecd>] ? handle_mm_fault+0x39d/0x9b0
            [632523.846014] [<ffffffff83188653>] ? __do_page_fault+0x213/0x500
            [632523.874110] [<ffffffff83188975>] ? do_page_fault+0x35/0x90
            [632523.900554] [<ffffffff83184778>] ? page_fault+0x28/0x30

             

            For some reason running the same allocator outside of the container but inside the same cgroup with 16GB limit allways ends up with successful OOM kill

            lflis Lukasz Flis added a comment - I am not sure if it's the same issue but we've seen the problem with tasks getting hung on memory allocation instead of getting killed by oomkiller. we can trigger the problem in 2.10 (also ddn22) and 2.12.5 simply by running simple process which is allocating 18GB of ram inside the singularity container contained inside the mem cgroup with memory limit of 16GB. [632523.103572] allocator-rss R running task 0 2310 2258 0x00000002 [632523.137483] Call Trace: [632523.150045] [<ffffffff82b06142>] ? ktime_get_ts64+0x52/0xf0 [632523.177593] [<ffffffffc074fc05>] ? ktime_get_seconds+0x25/0x40 [libcfs] [632523.209245] [<ffffffffc0e33c35>] ? osc_cache_shrink_count+0x15/0x90 [osc] [632523.241671] [<ffffffffc0e1f162>] ? osc_cache_shrink+0x42/0x60 [osc] [632523.271743] [<ffffffff82bd15ce>] ? shrink_slab+0xae/0x340 [632523.297716] [<ffffffff82bd495a>] ? do_try_to_free_pages+0x3ca/0x520 [632523.327818] [<ffffffff82bd4bac>] ? try_to_free_pages+0xfc/0x180 [632523.356911] [<ffffffff82c81c1e>] ? free_more_memory+0xae/0x100 [632523.385397] [<ffffffff82c82f8b>] ? __getblk+0x15b/0x300 [632523.410757] [<ffffffffc05c715c>] ? squashfs_read_data+0x15c/0x620 [squashfs] [632523.444063] [<ffffffffc05c78e6>] ? squashfs_cache_get+0x2c6/0x3c0 [squashfs] [632523.477907] [<ffffffffc05c78da>] ? squashfs_cache_get+0x2ba/0x3c0 [squashfs] [632523.511713] [<ffffffffc05c7ec8>] ? squashfs_read_metadata+0x58/0x130 [squashfs] [632523.546580] [<ffffffffc05c7ff1>] ? squashfs_get_datablock+0x21/0x30 [squashfs] [632523.581404] [<ffffffffc05c9272>] ? squashfs_readpage+0x8a2/0xc30 [squashfs] [632523.616332] [<ffffffff82bcb0a8>] ? __do_page_cache_readahead+0x248/0x260 [632523.648564] [<ffffffff82bcb691>] ? ra_submit+0x21/0x30 [632523.674147] [<ffffffff82bc0045>] ? filemap_fault+0x105/0x490 [632523.701357] [<ffffffff82bec15a>] ? __do_fault.isra.61+0x8a/0x100 [632523.730618] [<ffffffff82bec70c>] ? do_read_fault.isra.63+0x4c/0x1b0 [632523.760999] [<ffffffff82bf11ba>] ? handle_pte_fault+0x22a/0xe20 [632523.789538] [<ffffffff82be94cc>] ? __get_user_pages+0x16c/0x7a0 [632523.817942] [<ffffffff82bf3ecd>] ? handle_mm_fault+0x39d/0x9b0 [632523.846014] [<ffffffff83188653>] ? __do_page_fault+0x213/0x500 [632523.874110] [<ffffffff83188975>] ? do_page_fault+0x35/0x90 [632523.900554] [<ffffffff83184778>] ? page_fault+0x28/0x30   For some reason running the same allocator outside of the container but inside the same cgroup with 16GB limit allways ends up with successful OOM kill
            Tomaka Jacek Tomaka (Inactive) added a comment - - edited

            Here is some more (not necessarily from the same instance of the incident):

            #23 [ffff881e9d09b620] shrink_slab at ffffffff81195389
            #24 [ffff881e9d09b6c0] do_try_to_free_pages at ffffffff811985a2
            #25 [ffff881e9d09b738] try_to_free_pages at ffffffff811987bc
            #26 [ffff881e9d09b7d0] __alloc_pages_slowpath at ffffffff8169fbcb
            #27 [ffff881e9d09b8c0] __alloc_pages_nodemask at ffffffff8118cdb5
            #28 [ffff881e9d09b970] alloc_pages_current at ffffffff811d1078
            #29 [ffff881e9d09b9b8] __get_free_pages at ffffffff8118761e
            #30 [ffff881e9d09b9c8] kmalloc_order_trace at ffffffff811dca2e
            #31 [ffff881e9d09ba10] __kmalloc at ffffffff811e05c1
            #32 [ffff881e9d09ba50] osc_build_rpc at ffffffffc0a3588e [osc]
            #33 [ffff881e9d09bb10] osc_io_unplug0 at ffffffffc0a4e380 [osc]
            #34 [ffff881e9d09bc50] osc_io_unplug at ffffffffc0a50780 [osc]
            #35 [ffff881e9d09bc60] brw_queue_work at ffffffffc0a2b881 [osc]
            #36 [ffff881e9d09bc80] work_interpreter at ffffffffc08531d7 [ptlrpc]
            #37 [ffff881e9d09bca8] ptlrpc_check_set at ffffffffc084ff58 [ptlrpc]
            #38 [ffff881e9d09bd48] ptlrpc_check_set at ffffffffc08518fb [ptlrpc]
            #39 [ffff881e9d09bd68] ptlrpcd_check at ffffffffc087e74b [ptlrpc]
            #40 [ffff881e9d09bdb8] ptlrpcd at ffffffffc087eb69 [ptlrpc]
            #41 [ffff881e9d09bec8] kthread at ffffffff810b099f
            #42 [ffff881e9d09bf50] ret_from_fork at ffffffff816b5018
            
            
             #4 [ffff881d986f3a78] native_queued_spin_lock_slowpath at ffffffff810fa336
             #5 [ffff881d986f3a80] queued_spin_lock_slowpath at ffffffff8169e6bf
             #6 [ffff881d986f3a90] _raw_spin_lock at ffffffff816abc50
             #7 [ffff881d986f3aa0] osc_cache_shrink_count at ffffffffc0a3e9e5 [osc]
             #8 [ffff881d986f3ab0] osc_cache_shrink at ffffffffc0a2b152 [osc]
             #9 [ffff881d986f3ae0] shrink_slab at ffffffff81195389
            #10 [ffff881d986f3b80] do_try_to_free_pages at ffffffff811985a2
            #11 [ffff881d986f3bf8] try_to_free_pages at ffffffff811987bc
            #12 [ffff881d986f3c90] __alloc_pages_slowpath at ffffffff8169fbcb
            #13 [ffff881d986f3d80] __alloc_pages_nodemask at ffffffff8118cdb5
            #14 [ffff881d986f3e30] copy_process at ffffffff8108511d
            #15 [ffff881d986f3ec0] do_fork at ffffffff81086a61
            #16 [ffff881d986f3f38] sys_clone at ffffffff81086d76
            #17 [ffff881d986f3f48] stub_clone at ffffffff816b5419
            
            
            
             #4 [ffff8801736af938] osq_lock at ffffffff810fa6e5
             #5 [ffff8801736af948] __mutex_lock_slowpath at ffffffff816a837a
             #6 [ffff8801736af9a8] mutex_lock at ffffffff816a77ef
             #7 [ffff8801736af9c0] ldlm_pools_shrink at ffffffffc0842cb4 [ptlrpc]
             #8 [ffff8801736afa08] ldlm_pools_cli_shrink at ffffffffc084308b [ptlrpc]
             #9 [ffff8801736afa18] shrink_slab at ffffffff81195389
            #10 [ffff8801736afab8] do_try_to_free_pages at ffffffff811985a2
            #11 [ffff8801736afb30] try_to_free_pages at ffffffff811987bc
            #12 [ffff8801736afbc8] __alloc_pages_slowpath at ffffffff8169fbcb
            #13 [ffff8801736afcb8] __alloc_pages_nodemask at ffffffff8118cdb5
            #14 [ffff8801736afd68] copy_process at ffffffff8108511d
            #15 [ffff8801736afdf8] do_fork at ffffffff81086a61
            #16 [ffff8801736afe70] kernel_thread at ffffffff81086d16
            #17 [ffff8801736afe80] kthreadd at ffffffff810b1351
            #18 [ffff8801736aff50] ret_from_fork at ffffffff816b5018
            
            
             #4 [ffff881eea3935c0] prune_super at ffffffff81203768
             #5 [ffff881eea3935f0] shrink_slab at ffffffff81195389
             #6 [ffff881eea393690] do_try_to_free_pages at ffffffff811985a2
             #7 [ffff881eea393708] try_to_free_pages at ffffffff811987bc
             #8 [ffff881eea3937a0] __alloc_pages_slowpath at ffffffff8169fbcb
             #9 [ffff881eea393890] __alloc_pages_nodemask at ffffffff8118cdb5
            #10 [ffff881eea393940] alloc_pages_current at ffffffff811d1078
            #11 [ffff881eea393988] new_slab at ffffffff811dbdfc
            #12 [ffff881eea3939c0] ___slab_alloc at ffffffff811dd68c
            #13 [ffff881eea393a98] __slab_alloc at ffffffff816a118e
            #14 [ffff881eea393ad8] kmem_cache_alloc at ffffffff811df623
            #15 [ffff881eea393b18] ptlrpc_request_cache_alloc at ffffffffc084a1b7 [ptlrpc]
            #16 [ffff881eea393b30] ptlrpc_request_alloc_internal at ffffffffc084a2c5 [ptlrpc]
            #17 [ffff881eea393b68] ptlrpc_request_alloc at ffffffffc084a6f3 [ptlrpc]
            #18 [ffff881eea393b78] ptlrpc_connect_import at ffffffffc087a29e [ptlrpc]
            #19 [ffff881eea393c30] ptlrpc_request_handle_notconn at ffffffffc0854128 [ptlrpc]
            #20 [ffff881eea393c50] after_reply at ffffffffc084f662 [ptlrpc]
            #21 [ffff881eea393ca8] ptlrpc_check_set at ffffffffc0850cd4 [ptlrpc]
            #22 [ffff881eea393d48] ptlrpc_check_set at ffffffffc08518fb [ptlrpc]
            #23 [ffff881eea393d68] ptlrpcd_check at ffffffffc087e74b [ptlrpc]
            #24 [ffff881eea393db8] ptlrpcd at ffffffffc087eb0b [ptlrpc]
            #25 [ffff881eea393ec8] kthread at ffffffff810b099f
            #26 [ffff881eea393f50] ret_from_fork at ffffffff816b5018
            
            
             #4 [ffff880b26186fa8] native_queued_spin_lock_slowpath at ffffffff810fa336
             #5 [ffff880b26186fb0] queued_spin_lock_slowpath at ffffffff8169e6bf
             #6 [ffff880b26186fc0] _raw_spin_lock at ffffffff816abc50
             #7 [ffff880b26186fd0] osc_cache_shrink_count at ffffffffc0a3e9e5 [osc]
             #8 [ffff880b26186fe0] osc_cache_shrink at ffffffffc0a2b152 [osc]
             #9 [ffff880b26187010] shrink_slab at ffffffff81195389
            #10 [ffff880b261870b0] do_try_to_free_pages at ffffffff811985a2
            #11 [ffff880b26187128] try_to_free_pages at ffffffff811987bc
            #12 [ffff880b261871c0] __alloc_pages_slowpath at ffffffff8169fbcb
            #13 [ffff880b261872b0] __alloc_pages_nodemask at ffffffff8118cdb5
            #14 [ffff880b26187360] alloc_pages_current at ffffffff811d1078
            #15 [ffff880b261873a8] new_slab at ffffffff811dbdfc
            #16 [ffff880b261873e0] ___slab_alloc at ffffffff811dd68c
            #17 [ffff880b261874b8] __slab_alloc at ffffffff816a118e
            #18 [ffff880b261874f8] kmem_cache_alloc at ffffffff811df623
            #19 [ffff880b26187538] ptlrpc_request_cache_alloc at ffffffffc084a1b7 [ptlrpc]
            #20 [ffff880b26187550] ptlrpc_request_alloc_internal at ffffffffc084a2c5 [ptlrpc]
            #21 [ffff880b26187588] ptlrpc_request_alloc_pool at ffffffffc084a70e [ptlrpc]
            #22 [ffff880b26187598] osc_brw_prep_request at ffffffffc0a32393 [osc]
            #23 [ffff880b26187670] osc_build_rpc at ffffffffc0a35c82 [osc]
            #24 [ffff880b26187730] osc_io_unplug0 at ffffffffc0a4e657 [osc]
            #25 [ffff880b26187870] osc_cache_writeback_range at ffffffffc0a57500 [osc]
            #26 [ffff880b261879c8] osc_io_fsync_start at ffffffffc0a44438 [osc]
            #27 [ffff880b26187a10] cl_io_start at ffffffffc0615eb5 [obdclass]
            #28 [ffff880b26187a38] lov_io_call at ffffffffc0aa7565 [lov]
            #29 [ffff880b26187a68] lov_io_start at ffffffffc0aa7726 [lov]
            #30 [ffff880b26187a88] cl_io_start at ffffffffc0615eb5 [obdclass]
            #31 [ffff880b26187ab0] cl_io_loop at ffffffffc061827e [obdclass]
            #32 [ffff880b26187b28] cl_sync_file_range at ffffffffc0b591db [lustre]
            #33 [ffff880b26187b80] ll_writepages at ffffffffc0b7a4b7 [lustre]
            #34 [ffff880b26187bb8] do_writepages at ffffffff8118f02e
            #35 [ffff880b26187bc8] __writeback_single_inode at ffffffff8122d8e0
            #36 [ffff880b26187c08] writeback_sb_inodes at ffffffff8122e524
            #37 [ffff880b26187cb0] __writeback_inodes_wb at ffffffff8122e88f
            #38 [ffff880b26187cf8] wb_writeback at ffffffff8122f0c3
            #39 [ffff880b26187d70] bdi_writeback_workfn at ffffffff8122f46c
            #40 [ffff880b26187e20] process_one_work at ffffffff810a882a
            #41 [ffff880b26187e68] worker_thread at ffffffff810a94f6
            #42 [ffff880b26187ec8] kthread at ffffffff810b099f
            #43 [ffff880b26187f50] ret_from_fork at ffffffff816b5018
            #4 [ffff881fb3c3b9c8] native_queued_spin_lock_slowpath at ffffffff810fa336
            
            
            
            
             #5 [ffff881fb3c3b9d0] queued_spin_lock_slowpath at ffffffff8169e6bf
             #6 [ffff881fb3c3b9e0] _raw_spin_lock at ffffffff816abc50
             #7 [ffff881fb3c3b9f0] osc_cache_shrink_count at ffffffffc0a3e9e5 [osc]
             #8 [ffff881fb3c3ba00] osc_cache_shrink at ffffffffc0a2b152 [osc]
             #9 [ffff881fb3c3ba30] shrink_slab at ffffffff81195389
            #10 [ffff881fb3c3bad0] do_try_to_free_pages at ffffffff811985a2
            #11 [ffff881fb3c3bb48] try_to_free_pages at ffffffff811987bc
            #12 [ffff881fb3c3bbe0] __alloc_pages_slowpath at ffffffff8169fbcb
            #13 [ffff881fb3c3bcd0] __alloc_pages_nodemask at ffffffff8118cdb5
            #14 [ffff881fb3c3bd80] alloc_pages_current at ffffffff811d1078
            #15 [ffff881fb3c3bdc8] __get_free_pages at ffffffff8118761e
            #16 [ffff881fb3c3bdd8] kmalloc_order_trace at ffffffff811dca2e
            #17 [ffff881fb3c3be20] __kmalloc at ffffffff811e05c1
            #18 [ffff881fb3c3be60] seq_buf_alloc at ffffffff81225f07
            #19 [ffff881fb3c3be78] seq_read at ffffffff8122645e
            #20 [ffff881fb3c3bee8] proc_reg_read at ffffffff812702cd
            #21 [ffff881fb3c3bf08] vfs_read at ffffffff81200b1c
            #22 [ffff881fb3c3bf38] sys_read at ffffffff812019df
            #23 [ffff881fb3c3bf80] tracesys at ffffffff816b52ce (via system_call)
            
            
            
            
            #4 [ffff881fba2ef838] osq_lock at ffffffff810fa6e5
             #5 [ffff881fba2ef848] __mutex_lock_slowpath at ffffffff816a837a
             #6 [ffff881fba2ef8a8] mutex_lock at ffffffff816a77ef
             #7 [ffff881fba2ef8c0] ldlm_pools_shrink at ffffffffc0842cb4 [ptlrpc]
             #8 [ffff881fba2ef908] ldlm_pools_cli_shrink at ffffffffc084308b [ptlrpc]
             #9 [ffff881fba2ef918] shrink_slab at ffffffff81195389
            #10 [ffff881fba2ef9b8] do_try_to_free_pages at ffffffff811985a2
            #11 [ffff881fba2efa30] try_to_free_pages at ffffffff811987bc
            #12 [ffff881fba2efac8] __alloc_pages_slowpath at ffffffff8169fbcb
            #13 [ffff881fba2efbb8] __alloc_pages_nodemask at ffffffff8118cdb5
            #14 [ffff881fba2efc68] alloc_pages_current at ffffffff811d1078
            #15 [ffff881fba2efcb0] __page_cache_alloc at ffffffff81182927
            #16 [ffff881fba2efce8] filemap_fault at ffffffff81184ec0
            #17 [ffff881fba2efd48] ext4_filemap_fault at ffffffffc0331156 [ext4]
            #18 [ffff881fba2efd70] __do_fault at ffffffff811ad0d2
            #19 [ffff881fba2efdd0] do_read_fault at ffffffff811ad57b
            #20 [ffff881fba2efe28] handle_mm_fault at ffffffff811b1e81
            #21 [ffff881fba2efec0] __do_page_fault at ffffffff816b00b4
            #22 [ffff881fba2eff20] do_page_fault at ffffffff816b03e5
            #23 [ffff881fba2eff50] page_fault at ffffffff816ac608
            

            I have the crash dump so can upload if you are interested.

            Tomaka Jacek Tomaka (Inactive) added a comment - - edited Here is some more (not necessarily from the same instance of the incident): #23 [ffff881e9d09b620] shrink_slab at ffffffff81195389 #24 [ffff881e9d09b6c0] do_try_to_free_pages at ffffffff811985a2 #25 [ffff881e9d09b738] try_to_free_pages at ffffffff811987bc #26 [ffff881e9d09b7d0] __alloc_pages_slowpath at ffffffff8169fbcb #27 [ffff881e9d09b8c0] __alloc_pages_nodemask at ffffffff8118cdb5 #28 [ffff881e9d09b970] alloc_pages_current at ffffffff811d1078 #29 [ffff881e9d09b9b8] __get_free_pages at ffffffff8118761e #30 [ffff881e9d09b9c8] kmalloc_order_trace at ffffffff811dca2e #31 [ffff881e9d09ba10] __kmalloc at ffffffff811e05c1 #32 [ffff881e9d09ba50] osc_build_rpc at ffffffffc0a3588e [osc] #33 [ffff881e9d09bb10] osc_io_unplug0 at ffffffffc0a4e380 [osc] #34 [ffff881e9d09bc50] osc_io_unplug at ffffffffc0a50780 [osc] #35 [ffff881e9d09bc60] brw_queue_work at ffffffffc0a2b881 [osc] #36 [ffff881e9d09bc80] work_interpreter at ffffffffc08531d7 [ptlrpc] #37 [ffff881e9d09bca8] ptlrpc_check_set at ffffffffc084ff58 [ptlrpc] #38 [ffff881e9d09bd48] ptlrpc_check_set at ffffffffc08518fb [ptlrpc] #39 [ffff881e9d09bd68] ptlrpcd_check at ffffffffc087e74b [ptlrpc] #40 [ffff881e9d09bdb8] ptlrpcd at ffffffffc087eb69 [ptlrpc] #41 [ffff881e9d09bec8] kthread at ffffffff810b099f #42 [ffff881e9d09bf50] ret_from_fork at ffffffff816b5018 #4 [ffff881d986f3a78] native_queued_spin_lock_slowpath at ffffffff810fa336 #5 [ffff881d986f3a80] queued_spin_lock_slowpath at ffffffff8169e6bf #6 [ffff881d986f3a90] _raw_spin_lock at ffffffff816abc50 #7 [ffff881d986f3aa0] osc_cache_shrink_count at ffffffffc0a3e9e5 [osc] #8 [ffff881d986f3ab0] osc_cache_shrink at ffffffffc0a2b152 [osc] #9 [ffff881d986f3ae0] shrink_slab at ffffffff81195389 #10 [ffff881d986f3b80] do_try_to_free_pages at ffffffff811985a2 #11 [ffff881d986f3bf8] try_to_free_pages at ffffffff811987bc #12 [ffff881d986f3c90] __alloc_pages_slowpath at ffffffff8169fbcb #13 [ffff881d986f3d80] __alloc_pages_nodemask at ffffffff8118cdb5 #14 [ffff881d986f3e30] copy_process at ffffffff8108511d #15 [ffff881d986f3ec0] do_fork at ffffffff81086a61 #16 [ffff881d986f3f38] sys_clone at ffffffff81086d76 #17 [ffff881d986f3f48] stub_clone at ffffffff816b5419 #4 [ffff8801736af938] osq_lock at ffffffff810fa6e5 #5 [ffff8801736af948] __mutex_lock_slowpath at ffffffff816a837a #6 [ffff8801736af9a8] mutex_lock at ffffffff816a77ef #7 [ffff8801736af9c0] ldlm_pools_shrink at ffffffffc0842cb4 [ptlrpc] #8 [ffff8801736afa08] ldlm_pools_cli_shrink at ffffffffc084308b [ptlrpc] #9 [ffff8801736afa18] shrink_slab at ffffffff81195389 #10 [ffff8801736afab8] do_try_to_free_pages at ffffffff811985a2 #11 [ffff8801736afb30] try_to_free_pages at ffffffff811987bc #12 [ffff8801736afbc8] __alloc_pages_slowpath at ffffffff8169fbcb #13 [ffff8801736afcb8] __alloc_pages_nodemask at ffffffff8118cdb5 #14 [ffff8801736afd68] copy_process at ffffffff8108511d #15 [ffff8801736afdf8] do_fork at ffffffff81086a61 #16 [ffff8801736afe70] kernel_thread at ffffffff81086d16 #17 [ffff8801736afe80] kthreadd at ffffffff810b1351 #18 [ffff8801736aff50] ret_from_fork at ffffffff816b5018 #4 [ffff881eea3935c0] prune_super at ffffffff81203768 #5 [ffff881eea3935f0] shrink_slab at ffffffff81195389 #6 [ffff881eea393690] do_try_to_free_pages at ffffffff811985a2 #7 [ffff881eea393708] try_to_free_pages at ffffffff811987bc #8 [ffff881eea3937a0] __alloc_pages_slowpath at ffffffff8169fbcb #9 [ffff881eea393890] __alloc_pages_nodemask at ffffffff8118cdb5 #10 [ffff881eea393940] alloc_pages_current at ffffffff811d1078 #11 [ffff881eea393988] new_slab at ffffffff811dbdfc #12 [ffff881eea3939c0] ___slab_alloc at ffffffff811dd68c #13 [ffff881eea393a98] __slab_alloc at ffffffff816a118e #14 [ffff881eea393ad8] kmem_cache_alloc at ffffffff811df623 #15 [ffff881eea393b18] ptlrpc_request_cache_alloc at ffffffffc084a1b7 [ptlrpc] #16 [ffff881eea393b30] ptlrpc_request_alloc_internal at ffffffffc084a2c5 [ptlrpc] #17 [ffff881eea393b68] ptlrpc_request_alloc at ffffffffc084a6f3 [ptlrpc] #18 [ffff881eea393b78] ptlrpc_connect_import at ffffffffc087a29e [ptlrpc] #19 [ffff881eea393c30] ptlrpc_request_handle_notconn at ffffffffc0854128 [ptlrpc] #20 [ffff881eea393c50] after_reply at ffffffffc084f662 [ptlrpc] #21 [ffff881eea393ca8] ptlrpc_check_set at ffffffffc0850cd4 [ptlrpc] #22 [ffff881eea393d48] ptlrpc_check_set at ffffffffc08518fb [ptlrpc] #23 [ffff881eea393d68] ptlrpcd_check at ffffffffc087e74b [ptlrpc] #24 [ffff881eea393db8] ptlrpcd at ffffffffc087eb0b [ptlrpc] #25 [ffff881eea393ec8] kthread at ffffffff810b099f #26 [ffff881eea393f50] ret_from_fork at ffffffff816b5018 #4 [ffff880b26186fa8] native_queued_spin_lock_slowpath at ffffffff810fa336 #5 [ffff880b26186fb0] queued_spin_lock_slowpath at ffffffff8169e6bf #6 [ffff880b26186fc0] _raw_spin_lock at ffffffff816abc50 #7 [ffff880b26186fd0] osc_cache_shrink_count at ffffffffc0a3e9e5 [osc] #8 [ffff880b26186fe0] osc_cache_shrink at ffffffffc0a2b152 [osc] #9 [ffff880b26187010] shrink_slab at ffffffff81195389 #10 [ffff880b261870b0] do_try_to_free_pages at ffffffff811985a2 #11 [ffff880b26187128] try_to_free_pages at ffffffff811987bc #12 [ffff880b261871c0] __alloc_pages_slowpath at ffffffff8169fbcb #13 [ffff880b261872b0] __alloc_pages_nodemask at ffffffff8118cdb5 #14 [ffff880b26187360] alloc_pages_current at ffffffff811d1078 #15 [ffff880b261873a8] new_slab at ffffffff811dbdfc #16 [ffff880b261873e0] ___slab_alloc at ffffffff811dd68c #17 [ffff880b261874b8] __slab_alloc at ffffffff816a118e #18 [ffff880b261874f8] kmem_cache_alloc at ffffffff811df623 #19 [ffff880b26187538] ptlrpc_request_cache_alloc at ffffffffc084a1b7 [ptlrpc] #20 [ffff880b26187550] ptlrpc_request_alloc_internal at ffffffffc084a2c5 [ptlrpc] #21 [ffff880b26187588] ptlrpc_request_alloc_pool at ffffffffc084a70e [ptlrpc] #22 [ffff880b26187598] osc_brw_prep_request at ffffffffc0a32393 [osc] #23 [ffff880b26187670] osc_build_rpc at ffffffffc0a35c82 [osc] #24 [ffff880b26187730] osc_io_unplug0 at ffffffffc0a4e657 [osc] #25 [ffff880b26187870] osc_cache_writeback_range at ffffffffc0a57500 [osc] #26 [ffff880b261879c8] osc_io_fsync_start at ffffffffc0a44438 [osc] #27 [ffff880b26187a10] cl_io_start at ffffffffc0615eb5 [obdclass] #28 [ffff880b26187a38] lov_io_call at ffffffffc0aa7565 [lov] #29 [ffff880b26187a68] lov_io_start at ffffffffc0aa7726 [lov] #30 [ffff880b26187a88] cl_io_start at ffffffffc0615eb5 [obdclass] #31 [ffff880b26187ab0] cl_io_loop at ffffffffc061827e [obdclass] #32 [ffff880b26187b28] cl_sync_file_range at ffffffffc0b591db [lustre] #33 [ffff880b26187b80] ll_writepages at ffffffffc0b7a4b7 [lustre] #34 [ffff880b26187bb8] do_writepages at ffffffff8118f02e #35 [ffff880b26187bc8] __writeback_single_inode at ffffffff8122d8e0 #36 [ffff880b26187c08] writeback_sb_inodes at ffffffff8122e524 #37 [ffff880b26187cb0] __writeback_inodes_wb at ffffffff8122e88f #38 [ffff880b26187cf8] wb_writeback at ffffffff8122f0c3 #39 [ffff880b26187d70] bdi_writeback_workfn at ffffffff8122f46c #40 [ffff880b26187e20] process_one_work at ffffffff810a882a #41 [ffff880b26187e68] worker_thread at ffffffff810a94f6 #42 [ffff880b26187ec8] kthread at ffffffff810b099f #43 [ffff880b26187f50] ret_from_fork at ffffffff816b5018 #4 [ffff881fb3c3b9c8] native_queued_spin_lock_slowpath at ffffffff810fa336 #5 [ffff881fb3c3b9d0] queued_spin_lock_slowpath at ffffffff8169e6bf #6 [ffff881fb3c3b9e0] _raw_spin_lock at ffffffff816abc50 #7 [ffff881fb3c3b9f0] osc_cache_shrink_count at ffffffffc0a3e9e5 [osc] #8 [ffff881fb3c3ba00] osc_cache_shrink at ffffffffc0a2b152 [osc] #9 [ffff881fb3c3ba30] shrink_slab at ffffffff81195389 #10 [ffff881fb3c3bad0] do_try_to_free_pages at ffffffff811985a2 #11 [ffff881fb3c3bb48] try_to_free_pages at ffffffff811987bc #12 [ffff881fb3c3bbe0] __alloc_pages_slowpath at ffffffff8169fbcb #13 [ffff881fb3c3bcd0] __alloc_pages_nodemask at ffffffff8118cdb5 #14 [ffff881fb3c3bd80] alloc_pages_current at ffffffff811d1078 #15 [ffff881fb3c3bdc8] __get_free_pages at ffffffff8118761e #16 [ffff881fb3c3bdd8] kmalloc_order_trace at ffffffff811dca2e #17 [ffff881fb3c3be20] __kmalloc at ffffffff811e05c1 #18 [ffff881fb3c3be60] seq_buf_alloc at ffffffff81225f07 #19 [ffff881fb3c3be78] seq_read at ffffffff8122645e #20 [ffff881fb3c3bee8] proc_reg_read at ffffffff812702cd #21 [ffff881fb3c3bf08] vfs_read at ffffffff81200b1c #22 [ffff881fb3c3bf38] sys_read at ffffffff812019df #23 [ffff881fb3c3bf80] tracesys at ffffffff816b52ce (via system_call) #4 [ffff881fba2ef838] osq_lock at ffffffff810fa6e5 #5 [ffff881fba2ef848] __mutex_lock_slowpath at ffffffff816a837a #6 [ffff881fba2ef8a8] mutex_lock at ffffffff816a77ef #7 [ffff881fba2ef8c0] ldlm_pools_shrink at ffffffffc0842cb4 [ptlrpc] #8 [ffff881fba2ef908] ldlm_pools_cli_shrink at ffffffffc084308b [ptlrpc] #9 [ffff881fba2ef918] shrink_slab at ffffffff81195389 #10 [ffff881fba2ef9b8] do_try_to_free_pages at ffffffff811985a2 #11 [ffff881fba2efa30] try_to_free_pages at ffffffff811987bc #12 [ffff881fba2efac8] __alloc_pages_slowpath at ffffffff8169fbcb #13 [ffff881fba2efbb8] __alloc_pages_nodemask at ffffffff8118cdb5 #14 [ffff881fba2efc68] alloc_pages_current at ffffffff811d1078 #15 [ffff881fba2efcb0] __page_cache_alloc at ffffffff81182927 #16 [ffff881fba2efce8] filemap_fault at ffffffff81184ec0 #17 [ffff881fba2efd48] ext4_filemap_fault at ffffffffc0331156 [ext4] #18 [ffff881fba2efd70] __do_fault at ffffffff811ad0d2 #19 [ffff881fba2efdd0] do_read_fault at ffffffff811ad57b #20 [ffff881fba2efe28] handle_mm_fault at ffffffff811b1e81 #21 [ffff881fba2efec0] __do_page_fault at ffffffff816b00b4 #22 [ffff881fba2eff20] do_page_fault at ffffffff816b03e5 #23 [ffff881fba2eff50] page_fault at ffffffff816ac608 I have the crash dump so can upload if you are interested.

            People

              adilger Andreas Dilger
              Tomaka Jacek Tomaka (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: