Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16690

kernel: obd_memory max: 1854996506, obd_memory current: 1854996506

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • None
    • 2
    • 9223372036854775807

    Description

      After upgrade to 2.15.2 server hung with errors

      kernel: obd_memory max: 1854996506, obd_memory current: 1854996506

      see attached logs for full set of longs.

      Attachments

        1. nbp13_hung
          70 kB
        2. stack.out
          80 kB
        3. stack.out2
          82 kB
        4. stack.out3
          80 kB

        Issue Links

          Activity

            [LU-16690] kernel: obd_memory max: 1854996506, obd_memory current: 1854996506
            pjones Peter Jones added a comment -

            Ok so in that case let's assume that this is fixed unless future evidence comes to light that this is not the case

            pjones Peter Jones added a comment - Ok so in that case let's assume that this is fixed unless future evidence comes to light that this is not the case

            Peter,

            I haven't been able to reproduce it on our test filesystem, I think it is too small. We may need to wait on a production filesystem during an extended dedicated. 

            mhanafi Mahmoud Hanafi added a comment - Peter, I haven't been able to reproduce it on our test filesystem, I think it is too small. We may need to wait on a production filesystem during an extended dedicated. 
            pjones Peter Jones added a comment -

            Mahmoud

            The LU-16413 fix has been merged to b2_15 and will be in the upcoming 2.15.3 release. Have you tested the effectiveness of this release?

            Peter 

            pjones Peter Jones added a comment - Mahmoud The LU-16413 fix has been merged to b2_15 and will be in the upcoming 2.15.3 release. Have you tested the effectiveness of this release? Peter 
            dongyang Dongyang Li added a comment -

            Andreas,
            From the logs the kernel is 4.18.0-425.3.1.el8_lustre.x86_64.

            I think we are seeing memory allocation issue is because of LU-16413, the patch was landed in master but not in 2.15.2.
            Before the patch the bio-integrity kernel patch is broken for 4.18 kernels, it removed check in bio_integrity_prep() if the integrity payload is already allocated or not, and from osd we are calling bio_integrity_prep() twice, so we are leaking the integrity payload. Both are fixed in LU-16413.
            I will port LU-16413 to b2_15.

            dongyang Dongyang Li added a comment - Andreas, From the logs the kernel is 4.18.0-425.3.1.el8_lustre.x86_64. I think we are seeing memory allocation issue is because of LU-16413 , the patch was landed in master but not in 2.15.2. Before the patch the bio-integrity kernel patch is broken for 4.18 kernels, it removed check in bio_integrity_prep() if the integrity payload is already allocated or not, and from osd we are calling bio_integrity_prep() twice, so we are leaking the integrity payload. Both are fixed in LU-16413 . I will port LU-16413 to b2_15.

            Mahmoud, which kernel is running on the servers for this system?

            adilger Andreas Dilger added a comment - Mahmoud, which kernel is running on the servers for this system?

            Lower the number threads help but eventualy the service hit the issue. But host that had ib_iser t10pi disable was find. So for now we have disabled t10pi and lowered the number of threads.

             

            I will try to see if I can get a reproducer to help debug the issue. 

            mhanafi Mahmoud Hanafi added a comment - Lower the number threads help but eventualy the service hit the issue. But host that had ib_iser t10pi disable was find. So for now we have disabled t10pi and lowered the number of threads.   I will try to see if I can get a reproducer to help debug the issue. 
            pjones Peter Jones added a comment -

            Dongyang

            What is your advise here?

            Peter

            pjones Peter Jones added a comment - Dongyang What is your advise here? Peter

            All the servers have 192GB of memory. Memory has never been an issue before the upgrade.

            Filesystem crashing all have small files with hight number of IOPs. We changed thread count on one filesystem and turned off ib_iser T10pi checking. I saw threads waiting in bio_integrity_prep.

             

             

            mhanafi Mahmoud Hanafi added a comment - All the servers have 192GB of memory. Memory has never been an issue before the upgrade. Filesystem crashing all have small files with hight number of IOPs. We changed thread count on one filesystem and turned off ib_iser T10pi checking. I saw threads waiting in bio_integrity_prep.    
            green Oleg Drokin added a comment -

            to me this looks like something is wrong on the disk backend, the system is super slow (all the traces trying to get data from storage) and then eventually clients start to reconnect and I imagine whatever dirty data there is is not being flushed super fast either.

            And then the system thinks it's running out of memory and frantically asks everybody around to free some (and this is the memory message you see is), the lustre only allocated 1.8G of RAM though which is not all that much (how much RAM is there?)

            green Oleg Drokin added a comment - to me this looks like something is wrong on the disk backend, the system is super slow (all the traces trying to get data from storage) and then eventually clients start to reconnect and I imagine whatever dirty data there is is not being flushed super fast either. And then the system thinks it's running out of memory and frantically asks everybody around to free some (and this is the memory message you see is), the lustre only allocated 1.8G of RAM though which is not all that much (how much RAM is there?)

            From the stack traces, it looks like a lot of OSS threads are handling clients connecting to the OST, while at the same time a lot are disconnecting, and all of them are stuck starting journal transactions:

            295816 ll_ost03_057
            [<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
            [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
            [<0>] start_this_handle+0x10a/0x520 [jbd2]
            [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
            [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
            [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs]
            [<0>] tgt_client_data_update+0x468/0x6c0 [ptlrpc]
            [<0>] tgt_client_new+0x5c2/0x880 [ptlrpc]
            [<0>] ofd_obd_connect+0x385/0x4f0 [ofd]
            [<0>] target_handle_connect+0x611/0x29a0 [ptlrpc]
            [<0>] tgt_request_handle+0x569/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            [<0>] kthread+0x10a/0x120
            [<0>] ret_from_fork+0x1f/0x40
            
            295822 ll_ost03_063
            [<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
            [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
            [<0>] start_this_handle+0x10a/0x520 [jbd2]
            [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
            [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
            [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs]
            [<0>] tgt_server_data_update+0x3db/0x5a0 [ptlrpc]
            [<0>] tgt_client_del+0x368/0x710 [ptlrpc]
            [<0>] ofd_obd_disconnect+0x1f8/0x210 [ofd]
            [<0>] target_handle_disconnect+0x22f/0x500 [ptlrpc]
            [<0>] tgt_disconnect+0x4a/0x1a0 [ptlrpc]
            [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            

            When clients disconnect they need to force a synchronous journal commit, because otherwise if the OSS crashes shortly thereafter same time the clients will not be around to participate in recovery and recovery will take the maximum time. This might be significantly impacting the filesystem IO rate. Is this for an HDD or NVMe filesystem?

            There are also a handful of OSS threads are handling file unlinks:

            295743 ll_ost06_061
            [<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
            [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
            [<0>] start_this_handle+0x10a/0x520 [jbd2]
            [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
            [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
            [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs]
            [<0>] ofd_destroy+0x247/0xb20 [ofd]
            [<0>] ofd_destroy_by_fid+0x25e/0x4a0 [ofd]
            [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd]
            [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            

            and

            295763 ll_ost06_063
            [<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
            [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
            [<0>] start_this_handle+0x10a/0x520 [jbd2]
            [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
            [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
            [<0>] ldiskfs_evict_inode+0x273/0x6b0 [ldiskfs]
            [<0>] evict+0xd2/0x1a0
            [<0>] osd_object_delete+0x21a/0x320 [osd_ldiskfs]
            [<0>] lu_object_free.isra.37+0x90/0x1e0 [obdclass]
            [<0>] ofd_destroy_by_fid+0x2d6/0x4a0 [ofd]
            [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd]
            [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            

            some threads are doing writes:

            293423 ll_ost_io02_061
            [<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
            [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
            [<0>] start_this_handle+0x10a/0x520 [jbd2]
            [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
            [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
            [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs]
            [<0>] ofd_write_attr_set+0x11d/0x1070 [ofd]
            [<0>] ofd_commitrw_write+0x226/0x1ad0 [ofd]
            [<0>] ofd_commitrw+0x5b4/0xd20 [ofd]
            [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc]
            [<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc]
            [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            

            A bunch of threads are doing reads (and shouldn't have journal handles), but appear to be stuck in memory allocations:

            293418 ll_ost_io02_056
            [<0>] shrink_lruvec+0x2f1/0x6c0
            [<0>] shrink_node+0x22e/0x700
            [<0>] do_try_to_free_pages+0xc9/0x3e0
            [<0>] try_to_free_pages+0xf3/0x1c0
            [<0>] __alloc_pages_slowpath+0x372/0xd10
            [<0>] __alloc_pages_nodemask+0x2e2/0x320
            [<0>] kmalloc_order+0x28/0x90
            [<0>] kmalloc_order_trace+0x1d/0xb0
            [<0>] __kmalloc+0x203/0x250
            [<0>] bio_integrity_prep+0xee/0x2b0
            [<0>] blk_mq_make_request+0xad/0x5b0
            [<0>] generic_make_request_no_check+0xe1/0x330
            [<0>] submit_bio+0x3c/0x160
            [<0>] osd_do_bio.constprop.51+0x510/0xc40 [osd_ldiskfs]
            [<0>] osd_read_prep+0x456/0x540 [osd_ldiskfs]
            [<0>] ofd_preprw_read.isra.27+0x4a0/0x13b0 [ofd]
            [<0>] ofd_preprw+0x6ef/0x900 [ofd]
            [<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc]
            [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            

            and

            293127 ll_ost_io01_057
            [<0>] __lock_page+0x12d/0x230
            [<0>] pagecache_get_page+0x1e5/0x310
            [<0>] osd_bufs_get+0x56a/0xb20 [osd_ldiskfs]
            [<0>] ofd_preprw_read.isra.27+0x3dc/0x13b0 [ofd]
            [<0>] ofd_preprw+0x6ef/0x900 [ofd]
            [<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc]
            [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            

            All of the above threads are blocked waiting to access the journal or for memory, so they shouldn't actually be doing anything to modify the filesystem yet. There are a handful of threads that are actually holding a transaction open:

            293304 ll_ost_io05_051
            [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2]
            [<0>] jbd2_journal_stop+0x2d5/0x330 [jbd2]
            [<0>] __ldiskfs_journal_stop+0x36/0xb0 [ldiskfs]
            [<0>] osd_trans_stop+0x235/0xa60 [osd_ldiskfs]
            [<0>] ofd_object_punch+0xdbf/0x1330 [ofd]
            [<0>] ofd_punch_hdl+0x4b0/0x710 [ofd]
            [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            

            and

            293411 ll_ost_io04_061
            [<0>] range_lock+0x148/0x270 [obdclass]
            [<0>] ofd_preprw_write.isra.28+0x720/0x1240 [ofd]
            [<0>] ofd_preprw+0x7b2/0x900 [ofd]
            [<0>] obd_preprw+0x1a1/0x360 [ptlrpc]
            [<0>] tgt_brw_write+0x11cf/0x1ce0 [ptlrpc]
            [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            

            One possible cause of this blockage is a handful of threads that are holding an open transaction, but are blocked on memory allocation:

            227209 ll_ost_io06_004
            292970 ll_ost_io00_021
            292989 ll_ost_io06_054
            227821 ll_ost_io06_006
            271578 ll_ost_io03_023
            293355 ll_ost_io00_059
            [<0>] shrink_lruvec+0x2f1/0x6c0
            [<0>] shrink_node+0x22e/0x700
            [<0>] do_try_to_free_pages+0xc9/0x3e0
            [<0>] try_to_free_pages+0xf3/0x1c0
            [<0>] __alloc_pages_slowpath+0x372/0xd10
            [<0>] __alloc_pages_nodemask+0x2e2/0x320
            [<0>] kmalloc_order+0x28/0x90
            [<0>] kmalloc_order_trace+0x1d/0xb0
            [<0>] __kmalloc+0x203/0x250
            [<0>] bio_integrity_prep+0xee/0x2b0
            [<0>] blk_mq_make_request+0xad/0x5b0
            [<0>] generic_make_request_no_check+0xe1/0x330
            [<0>] submit_bio+0x3c/0x160
            [<0>] osd_do_bio.constprop.51+0xb63/0xc40 [osd_ldiskfs]
            [<0>] osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs]
            [<0>] osd_write_commit+0x5e2/0x990 [osd_ldiskfs]
            [<0>] ofd_commitrw_write+0x77e/0x1ad0 [ofd]
            [<0>] ofd_commitrw+0x5b4/0xd20 [ofd]
            [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc]
            [<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc]
            [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
            [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
            [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
            

            It looks like the underlying storage is using T10-PI at the block level, which is doing an allocation late in the IO submission process, which is bad. The kmalloc() itself looks like it is using GFP_NOIO, which is good, so it doesn't recurse into the filesystem again, but it doesn't appear that there is much memory available for the number of running threads, and this is causing it to block.

            That may also be problematic because kswapd is blocked in the filesystem when trying to free an inode, so it is unable to do any memory reclaim:

            349 kswapd0
            [<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
            [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
            [<0>] start_this_handle+0x10a/0x520 [jbd2]
            [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
            [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
            [<0>] ldiskfs_release_dquot+0x60/0xb0 [ldiskfs]
            [<0>] dqput.part.19+0x82/0x1e0
            [<0>] __dquot_drop+0x69/0x90
            [<0>] ldiskfs_clear_inode+0x1e/0x80 [ldiskfs]
            [<0>] ldiskfs_evict_inode+0x58/0x6b0 [ldiskfs]
            [<0>] evict+0xd2/0x1a0
            [<0>] dispose_list+0x48/0x70
            [<0>] prune_icache_sb+0x52/0x80
            [<0>] super_cache_scan+0x123/0x1b0
            [<0>] do_shrink_slab+0x11d/0x330
            [<0>] shrink_slab+0xbe/0x2f0
            [<0>] shrink_node+0x246/0x700
            [<0>] balance_pgdat+0x2d7/0x550
            [<0>] kswapd+0x201/0x3c0
            

            I see the same kswapd0 process stuck in nbp13_hung so that seems significant.

            I don't have the configuration, system, and storage details, but it may just be that there are too many threads running on this system for the amount of RAM that it has? You could try tuning oss_max_threads=256 or similar to reduce the concurrent thread count so the OSS memory isn't running out. If large RPCs are in use (e.g. obdfilter.*.brw_size=16 or =32) then this could be reduced so the OST threads are not preallocating so much memory for large RPCs.

            adilger Andreas Dilger added a comment - From the stack traces, it looks like a lot of OSS threads are handling clients connecting to the OST, while at the same time a lot are disconnecting, and all of them are stuck starting journal transactions: 295816 ll_ost03_057 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] tgt_client_data_update+0x468/0x6c0 [ptlrpc] [<0>] tgt_client_new+0x5c2/0x880 [ptlrpc] [<0>] ofd_obd_connect+0x385/0x4f0 [ofd] [<0>] target_handle_connect+0x611/0x29a0 [ptlrpc] [<0>] tgt_request_handle+0x569/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] [<0>] kthread+0x10a/0x120 [<0>] ret_from_fork+0x1f/0x40 295822 ll_ost03_063 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] tgt_server_data_update+0x3db/0x5a0 [ptlrpc] [<0>] tgt_client_del+0x368/0x710 [ptlrpc] [<0>] ofd_obd_disconnect+0x1f8/0x210 [ofd] [<0>] target_handle_disconnect+0x22f/0x500 [ptlrpc] [<0>] tgt_disconnect+0x4a/0x1a0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] When clients disconnect they need to force a synchronous journal commit, because otherwise if the OSS crashes shortly thereafter same time the clients will not be around to participate in recovery and recovery will take the maximum time. This might be significantly impacting the filesystem IO rate. Is this for an HDD or NVMe filesystem? There are also a handful of OSS threads are handling file unlinks: 295743 ll_ost06_061 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] ofd_destroy+0x247/0xb20 [ofd] [<0>] ofd_destroy_by_fid+0x25e/0x4a0 [ofd] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] and 295763 ll_ost06_063 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] ldiskfs_evict_inode+0x273/0x6b0 [ldiskfs] [<0>] evict+0xd2/0x1a0 [<0>] osd_object_delete+0x21a/0x320 [osd_ldiskfs] [<0>] lu_object_free.isra.37+0x90/0x1e0 [obdclass] [<0>] ofd_destroy_by_fid+0x2d6/0x4a0 [ofd] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] some threads are doing writes: 293423 ll_ost_io02_061 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] ofd_write_attr_set+0x11d/0x1070 [ofd] [<0>] ofd_commitrw_write+0x226/0x1ad0 [ofd] [<0>] ofd_commitrw+0x5b4/0xd20 [ofd] [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc] [<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] A bunch of threads are doing reads (and shouldn't have journal handles), but appear to be stuck in memory allocations: 293418 ll_ost_io02_056 [<0>] shrink_lruvec+0x2f1/0x6c0 [<0>] shrink_node+0x22e/0x700 [<0>] do_try_to_free_pages+0xc9/0x3e0 [<0>] try_to_free_pages+0xf3/0x1c0 [<0>] __alloc_pages_slowpath+0x372/0xd10 [<0>] __alloc_pages_nodemask+0x2e2/0x320 [<0>] kmalloc_order+0x28/0x90 [<0>] kmalloc_order_trace+0x1d/0xb0 [<0>] __kmalloc+0x203/0x250 [<0>] bio_integrity_prep+0xee/0x2b0 [<0>] blk_mq_make_request+0xad/0x5b0 [<0>] generic_make_request_no_check+0xe1/0x330 [<0>] submit_bio+0x3c/0x160 [<0>] osd_do_bio.constprop.51+0x510/0xc40 [osd_ldiskfs] [<0>] osd_read_prep+0x456/0x540 [osd_ldiskfs] [<0>] ofd_preprw_read.isra.27+0x4a0/0x13b0 [ofd] [<0>] ofd_preprw+0x6ef/0x900 [ofd] [<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] and 293127 ll_ost_io01_057 [<0>] __lock_page+0x12d/0x230 [<0>] pagecache_get_page+0x1e5/0x310 [<0>] osd_bufs_get+0x56a/0xb20 [osd_ldiskfs] [<0>] ofd_preprw_read.isra.27+0x3dc/0x13b0 [ofd] [<0>] ofd_preprw+0x6ef/0x900 [ofd] [<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] All of the above threads are blocked waiting to access the journal or for memory, so they shouldn't actually be doing anything to modify the filesystem yet. There are a handful of threads that are actually holding a transaction open: 293304 ll_ost_io05_051 [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2] [<0>] jbd2_journal_stop+0x2d5/0x330 [jbd2] [<0>] __ldiskfs_journal_stop+0x36/0xb0 [ldiskfs] [<0>] osd_trans_stop+0x235/0xa60 [osd_ldiskfs] [<0>] ofd_object_punch+0xdbf/0x1330 [ofd] [<0>] ofd_punch_hdl+0x4b0/0x710 [ofd] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] and 293411 ll_ost_io04_061 [<0>] range_lock+0x148/0x270 [obdclass] [<0>] ofd_preprw_write.isra.28+0x720/0x1240 [ofd] [<0>] ofd_preprw+0x7b2/0x900 [ofd] [<0>] obd_preprw+0x1a1/0x360 [ptlrpc] [<0>] tgt_brw_write+0x11cf/0x1ce0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] One possible cause of this blockage is a handful of threads that are holding an open transaction, but are blocked on memory allocation: 227209 ll_ost_io06_004 292970 ll_ost_io00_021 292989 ll_ost_io06_054 227821 ll_ost_io06_006 271578 ll_ost_io03_023 293355 ll_ost_io00_059 [<0>] shrink_lruvec+0x2f1/0x6c0 [<0>] shrink_node+0x22e/0x700 [<0>] do_try_to_free_pages+0xc9/0x3e0 [<0>] try_to_free_pages+0xf3/0x1c0 [<0>] __alloc_pages_slowpath+0x372/0xd10 [<0>] __alloc_pages_nodemask+0x2e2/0x320 [<0>] kmalloc_order+0x28/0x90 [<0>] kmalloc_order_trace+0x1d/0xb0 [<0>] __kmalloc+0x203/0x250 [<0>] bio_integrity_prep+0xee/0x2b0 [<0>] blk_mq_make_request+0xad/0x5b0 [<0>] generic_make_request_no_check+0xe1/0x330 [<0>] submit_bio+0x3c/0x160 [<0>] osd_do_bio.constprop.51+0xb63/0xc40 [osd_ldiskfs] [<0>] osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs] [<0>] osd_write_commit+0x5e2/0x990 [osd_ldiskfs] [<0>] ofd_commitrw_write+0x77e/0x1ad0 [ofd] [<0>] ofd_commitrw+0x5b4/0xd20 [ofd] [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc] [<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] It looks like the underlying storage is using T10-PI at the block level, which is doing an allocation late in the IO submission process, which is bad. The kmalloc() itself looks like it is using GFP_NOIO , which is good, so it doesn't recurse into the filesystem again, but it doesn't appear that there is much memory available for the number of running threads, and this is causing it to block. That may also be problematic because kswapd is blocked in the filesystem when trying to free an inode, so it is unable to do any memory reclaim: 349 kswapd0 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] ldiskfs_release_dquot+0x60/0xb0 [ldiskfs] [<0>] dqput.part.19+0x82/0x1e0 [<0>] __dquot_drop+0x69/0x90 [<0>] ldiskfs_clear_inode+0x1e/0x80 [ldiskfs] [<0>] ldiskfs_evict_inode+0x58/0x6b0 [ldiskfs] [<0>] evict+0xd2/0x1a0 [<0>] dispose_list+0x48/0x70 [<0>] prune_icache_sb+0x52/0x80 [<0>] super_cache_scan+0x123/0x1b0 [<0>] do_shrink_slab+0x11d/0x330 [<0>] shrink_slab+0xbe/0x2f0 [<0>] shrink_node+0x246/0x700 [<0>] balance_pgdat+0x2d7/0x550 [<0>] kswapd+0x201/0x3c0 I see the same kswapd0 process stuck in nbp13_hung so that seems significant. I don't have the configuration, system, and storage details, but it may just be that there are too many threads running on this system for the amount of RAM that it has? You could try tuning oss_max_threads=256 or similar to reduce the concurrent thread count so the OSS memory isn't running out. If large RPCs are in use (e.g. obdfilter.*.brw_size=16 or =32 ) then this could be reduced so the OST threads are not preallocating so much memory for large RPCs.

            People

              dongyang Dongyang Li
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: