Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
None
-
None
-
2
-
9223372036854775807
Description
After upgrade to 2.15.2 server hung with errors
kernel: obd_memory max: 1854996506, obd_memory current: 1854996506
see attached logs for full set of longs.
Attachments
- nbp13_hung
- 70 kB
- stack.out
- 80 kB
- stack.out2
- 82 kB
- stack.out3
- 80 kB
Issue Links
- is related to
-
LU-16413 T10PI is broken for CentOS 8.x
-
- Resolved
-
Activity
Peter,
I haven't been able to reproduce it on our test filesystem, I think it is too small. We may need to wait on a production filesystem during an extended dedicated.
Mahmoud
The LU-16413 fix has been merged to b2_15 and will be in the upcoming 2.15.3 release. Have you tested the effectiveness of this release?
Peter
Andreas,
From the logs the kernel is 4.18.0-425.3.1.el8_lustre.x86_64.
I think we are seeing memory allocation issue is because of LU-16413, the patch was landed in master but not in 2.15.2.
Before the patch the bio-integrity kernel patch is broken for 4.18 kernels, it removed check in bio_integrity_prep() if the integrity payload is already allocated or not, and from osd we are calling bio_integrity_prep() twice, so we are leaking the integrity payload. Both are fixed in LU-16413.
I will port LU-16413 to b2_15.
Lower the number threads help but eventualy the service hit the issue. But host that had ib_iser t10pi disable was find. So for now we have disabled t10pi and lowered the number of threads.
I will try to see if I can get a reproducer to help debug the issue.
All the servers have 192GB of memory. Memory has never been an issue before the upgrade.
Filesystem crashing all have small files with hight number of IOPs. We changed thread count on one filesystem and turned off ib_iser T10pi checking. I saw threads waiting in bio_integrity_prep.
to me this looks like something is wrong on the disk backend, the system is super slow (all the traces trying to get data from storage) and then eventually clients start to reconnect and I imagine whatever dirty data there is is not being flushed super fast either.
And then the system thinks it's running out of memory and frantically asks everybody around to free some (and this is the memory message you see is), the lustre only allocated 1.8G of RAM though which is not all that much (how much RAM is there?)
From the stack traces, it looks like a lot of OSS threads are handling clients connecting to the OST, while at the same time a lot are disconnecting, and all of them are stuck starting journal transactions:
295816 ll_ost03_057 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] tgt_client_data_update+0x468/0x6c0 [ptlrpc] [<0>] tgt_client_new+0x5c2/0x880 [ptlrpc] [<0>] ofd_obd_connect+0x385/0x4f0 [ofd] [<0>] target_handle_connect+0x611/0x29a0 [ptlrpc] [<0>] tgt_request_handle+0x569/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc] [<0>] kthread+0x10a/0x120 [<0>] ret_from_fork+0x1f/0x40
295822 ll_ost03_063 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] tgt_server_data_update+0x3db/0x5a0 [ptlrpc] [<0>] tgt_client_del+0x368/0x710 [ptlrpc] [<0>] ofd_obd_disconnect+0x1f8/0x210 [ofd] [<0>] target_handle_disconnect+0x22f/0x500 [ptlrpc] [<0>] tgt_disconnect+0x4a/0x1a0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
When clients disconnect they need to force a synchronous journal commit, because otherwise if the OSS crashes shortly thereafter same time the clients will not be around to participate in recovery and recovery will take the maximum time. This might be significantly impacting the filesystem IO rate. Is this for an HDD or NVMe filesystem?
There are also a handful of OSS threads are handling file unlinks:
295743 ll_ost06_061 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] ofd_destroy+0x247/0xb20 [ofd] [<0>] ofd_destroy_by_fid+0x25e/0x4a0 [ofd] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
and
295763 ll_ost06_063 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] ldiskfs_evict_inode+0x273/0x6b0 [ldiskfs] [<0>] evict+0xd2/0x1a0 [<0>] osd_object_delete+0x21a/0x320 [osd_ldiskfs] [<0>] lu_object_free.isra.37+0x90/0x1e0 [obdclass] [<0>] ofd_destroy_by_fid+0x2d6/0x4a0 [ofd] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
some threads are doing writes:
293423 ll_ost_io02_061 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] osd_trans_start+0x13b/0x500 [osd_ldiskfs] [<0>] ofd_write_attr_set+0x11d/0x1070 [ofd] [<0>] ofd_commitrw_write+0x226/0x1ad0 [ofd] [<0>] ofd_commitrw+0x5b4/0xd20 [ofd] [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc] [<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
A bunch of threads are doing reads (and shouldn't have journal handles), but appear to be stuck in memory allocations:
293418 ll_ost_io02_056 [<0>] shrink_lruvec+0x2f1/0x6c0 [<0>] shrink_node+0x22e/0x700 [<0>] do_try_to_free_pages+0xc9/0x3e0 [<0>] try_to_free_pages+0xf3/0x1c0 [<0>] __alloc_pages_slowpath+0x372/0xd10 [<0>] __alloc_pages_nodemask+0x2e2/0x320 [<0>] kmalloc_order+0x28/0x90 [<0>] kmalloc_order_trace+0x1d/0xb0 [<0>] __kmalloc+0x203/0x250 [<0>] bio_integrity_prep+0xee/0x2b0 [<0>] blk_mq_make_request+0xad/0x5b0 [<0>] generic_make_request_no_check+0xe1/0x330 [<0>] submit_bio+0x3c/0x160 [<0>] osd_do_bio.constprop.51+0x510/0xc40 [osd_ldiskfs] [<0>] osd_read_prep+0x456/0x540 [osd_ldiskfs] [<0>] ofd_preprw_read.isra.27+0x4a0/0x13b0 [ofd] [<0>] ofd_preprw+0x6ef/0x900 [ofd] [<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
and
293127 ll_ost_io01_057 [<0>] __lock_page+0x12d/0x230 [<0>] pagecache_get_page+0x1e5/0x310 [<0>] osd_bufs_get+0x56a/0xb20 [osd_ldiskfs] [<0>] ofd_preprw_read.isra.27+0x3dc/0x13b0 [ofd] [<0>] ofd_preprw+0x6ef/0x900 [ofd] [<0>] tgt_brw_read+0x6e0/0x1fc0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
All of the above threads are blocked waiting to access the journal or for memory, so they shouldn't actually be doing anything to modify the filesystem yet. There are a handful of threads that are actually holding a transaction open:
293304 ll_ost_io05_051 [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2] [<0>] jbd2_journal_stop+0x2d5/0x330 [jbd2] [<0>] __ldiskfs_journal_stop+0x36/0xb0 [ldiskfs] [<0>] osd_trans_stop+0x235/0xa60 [osd_ldiskfs] [<0>] ofd_object_punch+0xdbf/0x1330 [ofd] [<0>] ofd_punch_hdl+0x4b0/0x710 [ofd] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
and
293411 ll_ost_io04_061 [<0>] range_lock+0x148/0x270 [obdclass] [<0>] ofd_preprw_write.isra.28+0x720/0x1240 [ofd] [<0>] ofd_preprw+0x7b2/0x900 [ofd] [<0>] obd_preprw+0x1a1/0x360 [ptlrpc] [<0>] tgt_brw_write+0x11cf/0x1ce0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
One possible cause of this blockage is a handful of threads that are holding an open transaction, but are blocked on memory allocation:
227209 ll_ost_io06_004 292970 ll_ost_io00_021 292989 ll_ost_io06_054 227821 ll_ost_io06_006 271578 ll_ost_io03_023 293355 ll_ost_io00_059 [<0>] shrink_lruvec+0x2f1/0x6c0 [<0>] shrink_node+0x22e/0x700 [<0>] do_try_to_free_pages+0xc9/0x3e0 [<0>] try_to_free_pages+0xf3/0x1c0 [<0>] __alloc_pages_slowpath+0x372/0xd10 [<0>] __alloc_pages_nodemask+0x2e2/0x320 [<0>] kmalloc_order+0x28/0x90 [<0>] kmalloc_order_trace+0x1d/0xb0 [<0>] __kmalloc+0x203/0x250 [<0>] bio_integrity_prep+0xee/0x2b0 [<0>] blk_mq_make_request+0xad/0x5b0 [<0>] generic_make_request_no_check+0xe1/0x330 [<0>] submit_bio+0x3c/0x160 [<0>] osd_do_bio.constprop.51+0xb63/0xc40 [osd_ldiskfs] [<0>] osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs] [<0>] osd_write_commit+0x5e2/0x990 [osd_ldiskfs] [<0>] ofd_commitrw_write+0x77e/0x1ad0 [ofd] [<0>] ofd_commitrw+0x5b4/0xd20 [ofd] [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc] [<0>] tgt_brw_write+0x139f/0x1ce0 [ptlrpc] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
It looks like the underlying storage is using T10-PI at the block level, which is doing an allocation late in the IO submission process, which is bad. The kmalloc() itself looks like it is using GFP_NOIO, which is good, so it doesn't recurse into the filesystem again, but it doesn't appear that there is much memory available for the number of running threads, and this is causing it to block.
That may also be problematic because kswapd is blocked in the filesystem when trying to free an inode, so it is unable to do any memory reclaim:
349 kswapd0 [<0>] wait_transaction_locked+0x89/0xd0 [jbd2] [<0>] add_transaction_credits+0xd4/0x290 [jbd2] [<0>] start_this_handle+0x10a/0x520 [jbd2] [<0>] jbd2__journal_start+0xee/0x1f0 [jbd2] [<0>] __ldiskfs_journal_start_sb+0x6e/0x140 [ldiskfs] [<0>] ldiskfs_release_dquot+0x60/0xb0 [ldiskfs] [<0>] dqput.part.19+0x82/0x1e0 [<0>] __dquot_drop+0x69/0x90 [<0>] ldiskfs_clear_inode+0x1e/0x80 [ldiskfs] [<0>] ldiskfs_evict_inode+0x58/0x6b0 [ldiskfs] [<0>] evict+0xd2/0x1a0 [<0>] dispose_list+0x48/0x70 [<0>] prune_icache_sb+0x52/0x80 [<0>] super_cache_scan+0x123/0x1b0 [<0>] do_shrink_slab+0x11d/0x330 [<0>] shrink_slab+0xbe/0x2f0 [<0>] shrink_node+0x246/0x700 [<0>] balance_pgdat+0x2d7/0x550 [<0>] kswapd+0x201/0x3c0
I see the same kswapd0 process stuck in nbp13_hung so that seems significant.
I don't have the configuration, system, and storage details, but it may just be that there are too many threads running on this system for the amount of RAM that it has? You could try tuning oss_max_threads=256 or similar to reduce the concurrent thread count so the OSS memory isn't running out. If large RPCs are in use (e.g. obdfilter.*.brw_size=16 or =32) then this could be reduced so the OST threads are not preallocating so much memory for large RPCs.
Ok so in that case let's assume that this is fixed unless future evidence comes to light that this is not the case