[LU-16690] kernel: obd_memory max: 1854996506, obd_memory current: 1854996506 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

After upgrade to 2.15.2 server hung with errors

kernel: obd_memory max: 1854996506, obd_memory current: 1854996506

see attached logs for full set of longs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

nbp13_hung
70 kB
30/Mar/23 9:33 PM
stack.out
80 kB
31/Mar/23 7:04 PM
stack.out2
82 kB
31/Mar/23 7:04 PM
stack.out3
80 kB
31/Mar/23 7:04 PM

Issue Links

is related to

LU-16413 T10PI is broken for CentOS 8.x

Resolved

Activity

[LU-16690] kernel: obd_memory max: 1854996506, obd_memory current: 1854996506

Peter Jones added a comment - 13/May/23 2:42 PM

Ok so in that case let's assume that this is fixed unless future evidence comes to light that this is not the case

Peter Jones added a comment - 13/May/23 2:42 PM Ok so in that case let's assume that this is fixed unless future evidence comes to light that this is not the case

Mahmoud Hanafi added a comment - 26/Apr/23 5:32 PM

Peter,

I haven't been able to reproduce it on our test filesystem, I think it is too small. We may need to wait on a production filesystem during an extended dedicated.

Mahmoud Hanafi added a comment - 26/Apr/23 5:32 PM Peter, I haven't been able to reproduce it on our test filesystem, I think it is too small. We may need to wait on a production filesystem during an extended dedicated.

Peter Jones added a comment - 21/Apr/23 3:35 PM

Mahmoud

The ~~LU-16413~~ fix has been merged to b2_15 and will be in the upcoming 2.15.3 release. Have you tested the effectiveness of this release?

Peter

Peter Jones added a comment - 21/Apr/23 3:35 PM Mahmoud The LU-16413 fix has been merged to b2_15 and will be in the upcoming 2.15.3 release. Have you tested the effectiveness of this release? Peter

Dongyang Li added a comment - 04/Apr/23 1:42 AM

Andreas,
From the logs the kernel is 4.18.0-425.3.1.el8_lustre.x86_64.

I think we are seeing memory allocation issue is because of ~~LU-16413~~, the patch was landed in master but not in 2.15.2.
Before the patch the bio-integrity kernel patch is broken for 4.18 kernels, it removed check in bio_integrity_prep() if the integrity payload is already allocated or not, and from osd we are calling bio_integrity_prep() twice, so we are leaking the integrity payload. Both are fixed in ~~LU-16413~~.
I will port ~~LU-16413~~ to b2_15.

Dongyang Li added a comment - 04/Apr/23 1:42 AM Andreas, From the logs the kernel is 4.18.0-425.3.1.el8_lustre.x86_64. I think we are seeing memory allocation issue is because of LU-16413 , the patch was landed in master but not in 2.15.2. Before the patch the bio-integrity kernel patch is broken for 4.18 kernels, it removed check in bio_integrity_prep() if the integrity payload is already allocated or not, and from osd we are calling bio_integrity_prep() twice, so we are leaking the integrity payload. Both are fixed in LU-16413 . I will port LU-16413 to b2_15.

Andreas Dilger added a comment - 03/Apr/23 10:26 PM

Mahmoud, which kernel is running on the servers for this system?

Andreas Dilger added a comment - 03/Apr/23 10:26 PM Mahmoud, which kernel is running on the servers for this system?

Mahmoud Hanafi added a comment - 03/Apr/23 8:23 PM

Lower the number threads help but eventualy the service hit the issue. But host that had ib_iser t10pi disable was find. So for now we have disabled t10pi and lowered the number of threads.

I will try to see if I can get a reproducer to help debug the issue.

Mahmoud Hanafi added a comment - 03/Apr/23 8:23 PM Lower the number threads help but eventualy the service hit the issue. But host that had ib_iser t10pi disable was find. So for now we have disabled t10pi and lowered the number of threads. I will try to see if I can get a reproducer to help debug the issue.

Peter Jones added a comment - 01/Apr/23 2:35 AM

Dongyang

What is your advise here?

Peter

Peter Jones added a comment - 01/Apr/23 2:35 AM Dongyang What is your advise here? Peter

Mahmoud Hanafi added a comment - 31/Mar/23 9:51 PM

All the servers have 192GB of memory. Memory has never been an issue before the upgrade.

Filesystem crashing all have small files with hight number of IOPs. We changed thread count on one filesystem and turned off ib_iser T10pi checking. I saw threads waiting in bio_integrity_prep.

Mahmoud Hanafi added a comment - 31/Mar/23 9:51 PM All the servers have 192GB of memory. Memory has never been an issue before the upgrade. Filesystem crashing all have small files with hight number of IOPs. We changed thread count on one filesystem and turned off ib_iser T10pi checking. I saw threads waiting in bio_integrity_prep.

Oleg Drokin added a comment - 31/Mar/23 8:33 PM

to me this looks like something is wrong on the disk backend, the system is super slow (all the traces trying to get data from storage) and then eventually clients start to reconnect and I imagine whatever dirty data there is is not being flushed super fast either.

And then the system thinks it's running out of memory and frantically asks everybody around to free some (and this is the memory message you see is), the lustre only allocated 1.8G of RAM though which is not all that much (how much RAM is there?)

Oleg Drokin added a comment - 31/Mar/23 8:33 PM to me this looks like something is wrong on the disk backend, the system is super slow (all the traces trying to get data from storage) and then eventually clients start to reconnect and I imagine whatever dirty data there is is not being flushed super fast either. And then the system thinks it's running out of memory and frantically asks everybody around to free some (and this is the memory message you see is), the lustre only allocated 1.8G of RAM though which is not all that much (how much RAM is there?)

Andreas Dilger added a comment - 31/Mar/23 8:14 PM

From the stack traces, it looks like a lot of OSS threads are handling clients connecting to the OST, while at the same time a lot are disconnecting, and all of them are stuck starting journal transactions:

295816 ll_ost03_057
[<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
[<0>] add_transaction_credits+0xd4/0x290 [jbd2]
[<0>] start_this_handle+0x10a/0x520 [jbd2]
[<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
[<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
[<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs]
[<0>] tgt_client_data_update+0x468/0x6c0 [ptlrpc]
[<0>] tgt_client_new+0x5c2/0x880 [ptlrpc]
[<0>] ofd_obd_connect+0x385/0x4f0 [ofd]
[<0>] target_handle_connect+0x611/0x29a0 [ptlrpc]
[<0>] tgt_request_handle+0x569/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
[<0>] kthread+0x10a/0x120
[<0>] ret_from_fork+0x1f/0x40

295822 ll_ost03_063
[<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
[<0>] add_transaction_credits+0xd4/0x290 [jbd2]
[<0>] start_this_handle+0x10a/0x520 [jbd2]
[<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
[<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
[<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs]
[<0>] tgt_server_data_update+0x3db/0x5a0 [ptlrpc]
[<0>] tgt_client_del+0x368/0x710 [ptlrpc]
[<0>] ofd_obd_disconnect+0x1f8/0x210 [ofd]
[<0>] target_handle_disconnect+0x22f/0x500 [ptlrpc]
[<0>] tgt_disconnect+0x4a/0x1a0 [ptlrpc]
[<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]

When clients disconnect they need to force a synchronous journal commit, because otherwise if the OSS crashes shortly thereafter same time the clients will not be around to participate in recovery and recovery will take the maximum time. This might be significantly impacting the filesystem IO rate. Is this for an HDD or NVMe filesystem?

There are also a handful of OSS threads are handling file unlinks:

295743 ll_ost06_061
[<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
[<0>] add_transaction_credits+0xd4/0x290 [jbd2]
[<0>] start_this_handle+0x10a/0x520 [jbd2]
[<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
[<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
[<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs]
[<0>] ofd_destroy+0x247/0xb20 [ofd]
[<0>] ofd_destroy_by_fid+0x25e/0x4a0 [ofd]
[<0>] ofd_destroy_hdl+0x263/0xa10 [ofd]
[<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]

and

295763 ll_ost06_063
[<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
[<0>] add_transaction_credits+0xd4/0x290 [jbd2]
[<0>] start_this_handle+0x10a/0x520 [jbd2]
[<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
[<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
[<0>] ldiskfs_evict_inode+0x273/0x6b0 [ldiskfs]
[<0>] evict+0xd2/0x1a0
[<0>] osd_object_delete+0x21a/0x320 [osd_ldiskfs]
[<0>] lu_object_free.isra.37+0x90/0x1e0 [obdclass]
[<0>] ofd_destroy_by_fid+0x2d6/0x4a0 [ofd]
[<0>] ofd_destroy_hdl+0x263/0xa10 [ofd]
[<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]

some threads are doing writes:

293423 ll_ost_io02_061
[<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
[<0>] add_transaction_credits+0xd4/0x290 [jbd2]
[<0>] start_this_handle+0x10a/0x520 [jbd2]
[<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
[<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
[<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs]
[<0>] ofd_write_attr_set+0x11d/0x1070 [ofd]
[<0>] ofd_commitrw_write+0x226/0x1ad0 [ofd]
[<0>] ofd_commitrw+0x5b4/0xd20 [ofd]
[<0>] obd_commitrw+0x1b0/0x380 [ptlrpc]
[<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc]
[<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]

A bunch of threads are doing reads (and shouldn't have journal handles), but appear to be stuck in memory allocations:

293418 ll_ost_io02_056
[<0>] shrink_lruvec+0x2f1/0x6c0
[<0>] shrink_node+0x22e/0x700
[<0>] do_try_to_free_pages+0xc9/0x3e0
[<0>] try_to_free_pages+0xf3/0x1c0
[<0>] __alloc_pages_slowpath+0x372/0xd10
[<0>] __alloc_pages_nodemask+0x2e2/0x320
[<0>] kmalloc_order+0x28/0x90
[<0>] kmalloc_order_trace+0x1d/0xb0
[<0>] __kmalloc+0x203/0x250
[<0>] bio_integrity_prep+0xee/0x2b0
[<0>] blk_mq_make_request+0xad/0x5b0
[<0>] generic_make_request_no_check+0xe1/0x330
[<0>] submit_bio+0x3c/0x160
[<0>] osd_do_bio.constprop.51+0x510/0xc40 [osd_ldiskfs]
[<0>] osd_read_prep+0x456/0x540 [osd_ldiskfs]
[<0>] ofd_preprw_read.isra.27+0x4a0/0x13b0 [ofd]
[<0>] ofd_preprw+0x6ef/0x900 [ofd]
[<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc]
[<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]

and

293127 ll_ost_io01_057
[<0>] __lock_page+0x12d/0x230
[<0>] pagecache_get_page+0x1e5/0x310
[<0>] osd_bufs_get+0x56a/0xb20 [osd_ldiskfs]
[<0>] ofd_preprw_read.isra.27+0x3dc/0x13b0 [ofd]
[<0>] ofd_preprw+0x6ef/0x900 [ofd]
[<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc]
[<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]

All of the above threads are blocked waiting to access the journal or for memory, so they shouldn't actually be doing anything to modify the filesystem yet. There are a handful of threads that are actually holding a transaction open:

293304 ll_ost_io05_051
[<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2]
[<0>] jbd2_journal_stop+0x2d5/0x330 [jbd2]
[<0>] __ldiskfs_journal_stop+0x36/0xb0 [ldiskfs]
[<0>] osd_trans_stop+0x235/0xa60 [osd_ldiskfs]
[<0>] ofd_object_punch+0xdbf/0x1330 [ofd]
[<0>] ofd_punch_hdl+0x4b0/0x710 [ofd]
[<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]

and

293411 ll_ost_io04_061
[<0>] range_lock+0x148/0x270 [obdclass]
[<0>] ofd_preprw_write.isra.28+0x720/0x1240 [ofd]
[<0>] ofd_preprw+0x7b2/0x900 [ofd]
[<0>] obd_preprw+0x1a1/0x360 [ptlrpc]
[<0>] tgt_brw_write+0x11cf/0x1ce0 [ptlrpc]
[<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]

One possible cause of this blockage is a handful of threads that are holding an open transaction, but are blocked on memory allocation:

227209 ll_ost_io06_004
292970 ll_ost_io00_021
292989 ll_ost_io06_054
227821 ll_ost_io06_006
271578 ll_ost_io03_023
293355 ll_ost_io00_059
[<0>] shrink_lruvec+0x2f1/0x6c0
[<0>] shrink_node+0x22e/0x700
[<0>] do_try_to_free_pages+0xc9/0x3e0
[<0>] try_to_free_pages+0xf3/0x1c0
[<0>] __alloc_pages_slowpath+0x372/0xd10
[<0>] __alloc_pages_nodemask+0x2e2/0x320
[<0>] kmalloc_order+0x28/0x90
[<0>] kmalloc_order_trace+0x1d/0xb0
[<0>] __kmalloc+0x203/0x250
[<0>] bio_integrity_prep+0xee/0x2b0
[<0>] blk_mq_make_request+0xad/0x5b0
[<0>] generic_make_request_no_check+0xe1/0x330
[<0>] submit_bio+0x3c/0x160
[<0>] osd_do_bio.constprop.51+0xb63/0xc40 [osd_ldiskfs]
[<0>] osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs]
[<0>] osd_write_commit+0x5e2/0x990 [osd_ldiskfs]
[<0>] ofd_commitrw_write+0x77e/0x1ad0 [ofd]
[<0>] ofd_commitrw+0x5b4/0xd20 [ofd]
[<0>] obd_commitrw+0x1b0/0x380 [ptlrpc]
[<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc]
[<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]

It looks like the underlying storage is using T10-PI at the block level, which is doing an allocation late in the IO submission process, which is bad. The kmalloc() itself looks like it is using GFP_NOIO, which is good, so it doesn't recurse into the filesystem again, but it doesn't appear that there is much memory available for the number of running threads, and this is causing it to block.

That may also be problematic because kswapd is blocked in the filesystem when trying to free an inode, so it is unable to do any memory reclaim:

349 kswapd0
[<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
[<0>] add_transaction_credits+0xd4/0x290 [jbd2]
[<0>] start_this_handle+0x10a/0x520 [jbd2]
[<0>] jbd2__journal_start+0xee/0x1f0 [jbd2]
[<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs]
[<0>] ldiskfs_release_dquot+0x60/0xb0 [ldiskfs]
[<0>] dqput.part.19+0x82/0x1e0
[<0>] __dquot_drop+0x69/0x90
[<0>] ldiskfs_clear_inode+0x1e/0x80 [ldiskfs]
[<0>] ldiskfs_evict_inode+0x58/0x6b0 [ldiskfs]
[<0>] evict+0xd2/0x1a0
[<0>] dispose_list+0x48/0x70
[<0>] prune_icache_sb+0x52/0x80
[<0>] super_cache_scan+0x123/0x1b0
[<0>] do_shrink_slab+0x11d/0x330
[<0>] shrink_slab+0xbe/0x2f0
[<0>] shrink_node+0x246/0x700
[<0>] balance_pgdat+0x2d7/0x550
[<0>] kswapd+0x201/0x3c0

I see the same kswapd0 process stuck in nbp13_hung so that seems significant.

I don't have the configuration, system, and storage details, but it may just be that there are too many threads running on this system for the amount of RAM that it has? You could try tuning oss_max_threads=256 or similar to reduce the concurrent thread count so the OSS memory isn't running out. If large RPCs are in use (e.g. obdfilter.*.brw_size=16 or =32) then this could be reduced so the OST threads are not preallocating so much memory for large RPCs.

Andreas Dilger added a comment - 31/Mar/23 8:14 PM From the stack traces, it looks like a lot of OSS threads are handling clients connecting to the OST, while at the same time a lot are disconnecting, and all of them are stuck starting journal transactions: 295816 ll_ost03_057 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] tgt_client_data_update+0x468/0x6c0 [ptlrpc] [<0>] tgt_client_new+0x5c2/0x880 [ptlrpc] [<0>] ofd_obd_connect+0x385/0x4f0 [ofd] [<0>] target_handle_connect+0x611/0x29a0 [ptlrpc] [<0>] tgt_request_handle+0x569/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] [<0>] kthread+0x10a/0x120 [<0>] ret_from_fork+0x1f/0x40 295822 ll_ost03_063 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] tgt_server_data_update+0x3db/0x5a0 [ptlrpc] [<0>] tgt_client_del+0x368/0x710 [ptlrpc] [<0>] ofd_obd_disconnect+0x1f8/0x210 [ofd] [<0>] target_handle_disconnect+0x22f/0x500 [ptlrpc] [<0>] tgt_disconnect+0x4a/0x1a0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] When clients disconnect they need to force a synchronous journal commit, because otherwise if the OSS crashes shortly thereafter same time the clients will not be around to participate in recovery and recovery will take the maximum time. This might be significantly impacting the filesystem IO rate. Is this for an HDD or NVMe filesystem? There are also a handful of OSS threads are handling file unlinks: 295743 ll_ost06_061 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] ofd_destroy+0x247/0xb20 [ofd] [<0>] ofd_destroy_by_fid+0x25e/0x4a0 [ofd] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] and 295763 ll_ost06_063 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] ldiskfs_evict_inode+0x273/0x6b0 [ldiskfs] [<0>] evict+0xd2/0x1a0 [<0>] osd_object_delete+0x21a/0x320 [osd_ldiskfs] [<0>] lu_object_free.isra.37+0x90/0x1e0 [obdclass] [<0>] ofd_destroy_by_fid+0x2d6/0x4a0 [ofd] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] some threads are doing writes: 293423 ll_ost_io02_061 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] ofd_write_attr_set+0x11d/0x1070 [ofd] [<0>] ofd_commitrw_write+0x226/0x1ad0 [ofd] [<0>] ofd_commitrw+0x5b4/0xd20 [ofd] [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc] [<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] A bunch of threads are doing reads (and shouldn't have journal handles), but appear to be stuck in memory allocations: 293418 ll_ost_io02_056 [<0>] shrink_lruvec+0x2f1/0x6c0 [<0>] shrink_node+0x22e/0x700 [<0>] do_try_to_free_pages+0xc9/0x3e0 [<0>] try_to_free_pages+0xf3/0x1c0 [<0>] __alloc_pages_slowpath+0x372/0xd10 [<0>] __alloc_pages_nodemask+0x2e2/0x320 [<0>] kmalloc_order+0x28/0x90 [<0>] kmalloc_order_trace+0x1d/0xb0 [<0>] __kmalloc+0x203/0x250 [<0>] bio_integrity_prep+0xee/0x2b0 [<0>] blk_mq_make_request+0xad/0x5b0 [<0>] generic_make_request_no_check+0xe1/0x330 [<0>] submit_bio+0x3c/0x160 [<0>] osd_do_bio.constprop.51+0x510/0xc40 [osd_ldiskfs] [<0>] osd_read_prep+0x456/0x540 [osd_ldiskfs] [<0>] ofd_preprw_read.isra.27+0x4a0/0x13b0 [ofd] [<0>] ofd_preprw+0x6ef/0x900 [ofd] [<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] and 293127 ll_ost_io01_057 [<0>] __lock_page+0x12d/0x230 [<0>] pagecache_get_page+0x1e5/0x310 [<0>] osd_bufs_get+0x56a/0xb20 [osd_ldiskfs] [<0>] ofd_preprw_read.isra.27+0x3dc/0x13b0 [ofd] [<0>] ofd_preprw+0x6ef/0x900 [ofd] [<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] All of the above threads are blocked waiting to access the journal or for memory, so they shouldn't actually be doing anything to modify the filesystem yet. There are a handful of threads that are actually holding a transaction open: 293304 ll_ost_io05_051 [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2] [<0>] jbd2_journal_stop+0x2d5/0x330 [jbd2] [<0>] __ldiskfs_journal_stop+0x36/0xb0 [ldiskfs] [<0>] osd_trans_stop+0x235/0xa60 [osd_ldiskfs] [<0>] ofd_object_punch+0xdbf/0x1330 [ofd] [<0>] ofd_punch_hdl+0x4b0/0x710 [ofd] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] and 293411 ll_ost_io04_061 [<0>] range_lock+0x148/0x270 [obdclass] [<0>] ofd_preprw_write.isra.28+0x720/0x1240 [ofd] [<0>] ofd_preprw+0x7b2/0x900 [ofd] [<0>] obd_preprw+0x1a1/0x360 [ptlrpc] [<0>] tgt_brw_write+0x11cf/0x1ce0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] One possible cause of this blockage is a handful of threads that are holding an open transaction, but are blocked on memory allocation: 227209 ll_ost_io06_004 292970 ll_ost_io00_021 292989 ll_ost_io06_054 227821 ll_ost_io06_006 271578 ll_ost_io03_023 293355 ll_ost_io00_059 [<0>] shrink_lruvec+0x2f1/0x6c0 [<0>] shrink_node+0x22e/0x700 [<0>] do_try_to_free_pages+0xc9/0x3e0 [<0>] try_to_free_pages+0xf3/0x1c0 [<0>] __alloc_pages_slowpath+0x372/0xd10 [<0>] __alloc_pages_nodemask+0x2e2/0x320 [<0>] kmalloc_order+0x28/0x90 [<0>] kmalloc_order_trace+0x1d/0xb0 [<0>] __kmalloc+0x203/0x250 [<0>] bio_integrity_prep+0xee/0x2b0 [<0>] blk_mq_make_request+0xad/0x5b0 [<0>] generic_make_request_no_check+0xe1/0x330 [<0>] submit_bio+0x3c/0x160 [<0>] osd_do_bio.constprop.51+0xb63/0xc40 [osd_ldiskfs] [<0>] osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs] [<0>] osd_write_commit+0x5e2/0x990 [osd_ldiskfs] [<0>] ofd_commitrw_write+0x77e/0x1ad0 [ofd] [<0>] ofd_commitrw+0x5b4/0xd20 [ofd] [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc] [<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] It looks like the underlying storage is using T10-PI at the block level, which is doing an allocation late in the IO submission process, which is bad. The kmalloc() itself looks like it is using GFP_NOIO , which is good, so it doesn't recurse into the filesystem again, but it doesn't appear that there is much memory available for the number of running threads, and this is causing it to block. That may also be problematic because kswapd is blocked in the filesystem when trying to free an inode, so it is unable to do any memory reclaim: 349 kswapd0 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] ldiskfs_release_dquot+0x60/0xb0 [ldiskfs] [<0>] dqput.part.19+0x82/0x1e0 [<0>] __dquot_drop+0x69/0x90 [<0>] ldiskfs_clear_inode+0x1e/0x80 [ldiskfs] [<0>] ldiskfs_evict_inode+0x58/0x6b0 [ldiskfs] [<0>] evict+0xd2/0x1a0 [<0>] dispose_list+0x48/0x70 [<0>] prune_icache_sb+0x52/0x80 [<0>] super_cache_scan+0x123/0x1b0 [<0>] do_shrink_slab+0x11d/0x330 [<0>] shrink_slab+0xbe/0x2f0 [<0>] shrink_node+0x246/0x700 [<0>] balance_pgdat+0x2d7/0x550 [<0>] kswapd+0x201/0x3c0 I see the same kswapd0 process stuck in nbp13_hung so that seems significant. I don't have the configuration, system, and storage details, but it may just be that there are too many threads running on this system for the amount of RAM that it has? You could try tuning oss_max_threads=256 or similar to reduce the concurrent thread count so the OSS memory isn't running out. If large RPCs are in use (e.g. obdfilter.*.brw_size=16 or =32 ) then this could be reduced so the OST threads are not preallocating so much memory for large RPCs.

People

Assignee:: Dongyang Li

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 30/Mar/23 9:33 PM

Updated:: 13/May/23 2:42 PM

Resolved:: 13/May/23 2:42 PM