[LU-13063] sanity test 411 times out for RHEL8.1 Created: 11/Dec/19 Updated: 29/Mar/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Dongyang Li |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | always_except, rhel8 | ||
| Environment: |
RHEL 8.1 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The last thing seen in the suite_log for sanity test 411 is == sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 04:54:33 (1575953673) 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 3.88888 s, 27.0 MB/s Normally, on successful runs, we would see a dd error reading the file just created, but the test hangs at this point. Looking at the console logs, it’s not clear why the test is hanging, but we see lnet-selftest processes hung. Looking at the stack trace on the first client (vm10), we see that there a lnet-selftest process stuck D state [14127.185129] lst_t_00_00 S 0 14488 2 0x80000080 [14127.186075] Call Trace: [14127.186561] ? __schedule+0x253/0x830 [14127.187236] ? sfw_test_unit_done.isra.14+0x9d/0x150 [lnet_selftest] [14127.188348] schedule+0x28/0x70 [14127.188929] cfs_wi_scheduler+0x40d/0x420 [libcfs] [14127.189783] ? finish_wait+0x80/0x80 [14127.190466] ? cfs_wi_sched_create+0x5a0/0x5a0 [libcfs] [14127.191397] kthread+0x112/0x130 [14127.191984] ? kthread_flush_work_fn+0x10/0x10 [14127.192782] ret_from_fork+0x35/0x40 [14127.193448] st_timer D 0 14636 2 0x80000080 [14127.194413] Call Trace: [14127.194882] ? __schedule+0x253/0x830 [14127.195555] schedule+0x28/0x70 [14127.196142] schedule_timeout+0x16b/0x390 [14127.196859] ? __next_timer_interrupt+0xc0/0xc0 [14127.197678] ? prepare_to_wait_event+0xbb/0x140 [14127.198496] stt_timer_main+0x215/0x230 [lnet_selftest] [14127.199436] ? finish_wait+0x80/0x80 [14127.200083] ? sfw_startup+0x540/0x540 [lnet_selftest] [14127.200989] kthread+0x112/0x130 [14127.201595] ? kthread_flush_work_fn+0x10/0x10 [14127.202393] ret_from_fork+0x35/0x40 Similarly in the stack-trace log on the MDS (vm12), we see the lnet process [14034.774700] st_timer D ffff9cb15b62a080 0 28114 2 0x00000080 [14034.776068] Call Trace: [14034.776493] [<ffffffffb0f6af19>] schedule+0x29/0x70 [14034.777425] [<ffffffffb0f68968>] schedule_timeout+0x168/0x2d0 [14034.778391] [<ffffffffb08cfeb4>] ? __wake_up+0x44/0x50 [14034.779358] [<ffffffffb08aab30>] ? __internal_add_timer+0x130/0x130 [14034.780432] [<ffffffffb08c3a46>] ? prepare_to_wait+0x56/0x90 [14034.781474] [<ffffffffc1542a98>] stt_timer_main+0x168/0x220 [lnet_selftest] [14034.782654] [<ffffffffb08c3f50>] ? wake_up_atomic_t+0x30/0x30 [14034.783688] [<ffffffffc1542930>] ? sfw_startup+0x580/0x580 [lnet_selftest] [14034.784856] [<ffffffffb08c2e81>] kthread+0xd1/0xe0 [14034.785787] [<ffffffffb08c2db0>] ? insert_kthread_work+0x40/0x40 [14034.786818] [<ffffffffb0f77c37>] ret_from_fork_nospec_begin+0x21/0x21 [14034.788077] [<ffffffffb08c2db0>] ? insert_kthread_work+0x40/0x40 lnet-selftest did run and fail ( We’ve see this test hang twice for RHEL 8.1 testing both in December In addition, we've seen this once in the past 3 months in PPC testing for a patch for |
| Comments |
| Comment by James A Simmons [ 11/Dec/19 ] |
|
I have a patch that replaces cfs_wi_schedular with a workqueue. Give me a bit to get the patch working and I will push it. |
| Comment by Andreas Dilger [ 12/Dec/19 ] |
|
The patch looks like https://review.whamcloud.com/36991 "LU-9859 lnet: convert selftest to use workqueues" |
| Comment by James Nunez (Inactive) [ 18/Dec/19 ] |
|
sanity test 411 still hangs for RHEL 8.1 when rebased on top of James's patch https://review.whamcloud.com/#/c/36991/. Results at https://testing.whamcloud.com/test_sets/ca87349e-21d2-11ea-80b4-52540065bddc |
| Comment by Peter Jones [ 19/Dec/19 ] |
|
dongyang could you please advise here? |
| Comment by Dongyang Li [ 24/Dec/19 ] |
|
from one of the test logs: [14170.962056] dd R running task 0 27130 26934 0x00000080 [14170.963279] Call Trace: [14170.963757] ? finish_task_switch+0x76/0x2b0 [14170.964525] ? __schedule+0x25b/0x830 [14170.965207] ? shrink_page_list+0x48e/0xc50 [14170.965970] ? page_evictable+0xe/0x40 [14170.966675] ? putback_inactive_pages+0x1f8/0x4f0 [14170.967517] ? shrink_inactive_list+0x207/0x570 [14170.968339] ? inactive_list_is_low+0xe0/0x220 [14170.969148] ? shrink_node_memcg+0x204/0x770 [14170.969920] ? shrink_node+0xce/0x440 [14170.970589] ? do_try_to_free_pages+0xc3/0x360 [14170.971396] ? try_to_free_mem_cgroup_pages+0xf9/0x210 [14170.972316] ? try_charge+0x192/0x780 [14170.972992] ? mem_cgroup_commit_charge+0x7a/0x560 [14170.973856] ? mem_cgroup_try_charge+0x8b/0x1a0 [14170.974677] ? __add_to_page_cache_locked+0x64/0x240 [14170.975560] ? add_to_page_cache_lru+0x4a/0xc0 [14170.976366] ? pagecache_get_page+0xf2/0x2c0 [14170.977294] ? ll_read_ahead_pages+0x1e8/0x8b0 [lustre] [14170.978305] ? osc_io_fini+0x10/0x10 [osc] [14170.979072] ? ll_readahead.constprop.32+0x641/0x9d0 [lustre] [14170.980105] ? ll_io_read_page+0x355/0x4c0 [lustre] [14170.980992] ? ll_readpage+0xe3/0x650 [lustre] [14170.981801] ? find_get_entry+0x19/0xf0 [14170.982508] ? pagecache_get_page+0x30/0x2c0 [14170.983286] ? generic_file_buffered_read+0x601/0xb10 [14170.984188] ? atime_needs_update+0x77/0xe0 [14170.984960] ? vvp_io_read_start+0x3ef/0x720 [lustre] [14170.986080] ? cl_lock_request+0x62/0x1b0 [obdclass] [14170.986989] ? cl_io_start+0x58/0x100 [obdclass] [14170.987851] ? cl_io_loop+0xdc/0x1b0 [obdclass] [14170.988682] ? ll_file_io_generic+0x23d/0x960 [lustre] [14170.989611] ? ll_file_read_iter+0x244/0x2e0 [lustre] [14170.990510] ? new_sync_read+0x121/0x170 [14170.991233] ? vfs_read+0x91/0x140 [14170.991864] ? ksys_read+0x4f/0xb0 [14170.992493] ? do_syscall_64+0x5b/0x1b0 [14170.993198] ? entry_SYSCALL_64_after_hwframe+0x65/0xca looks like dd is not stuck, we are reading a page and read_ahead kicks in, trying to allocate a page in page cache, hit cgroup limits and then it just keep trying to reclaim in a loop? |
| Comment by Andreas Dilger [ 07/Jan/20 ] |
|
Dongyang, in the case of a cgroup limit being hit, is there an error returned to the caller, or where is the allocation looping? It seems that ll_read_ahead_page() is already calling grab_cache_page_nowait() into pagecache_get_page(FGP_NOWAIT) so it shouldn't be blocking for pages that cannot be allocated. Maybe this is a bug in the kernel? |
| Comment by Dongyang Li [ 08/Jan/20 ] |
|
Andreas, if we hit the cgroup limit in grab_cache_page_nowait(), it calls into mem_cgroup_try_charge() and eventually oom inside the cgroup. before doing that try_charge() will try to reclaim the pages, that's what we see here. yes ll_read_ahead_page() is using FGP_NOWAIT but that only means we won't wait for the page lock when getting the page ref. mapping->gfp_mask is still used by the cgroup reclaim. It does feels like a bug in the kernel but I'm having difficulty locating it, as well as reproducing it. |
| Comment by Gerrit Updater [ 08/Jan/20 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37165 |
| Comment by Andreas Dilger [ 09/Jan/20 ] |
I was wondering about that. Are we sure that gfp_mask excludes __GFP_FS at this point? The comment says this, but it isn't clear to me how that is working since it doesn't actually clear __GFP_FS that I can see: * Clear __GFP_FS when allocating the page to avoid recursion into the fs * and deadlock against the caller's locked page. */ static inline struct page *grab_cache_page_nowait(struct address_space *mapping, pgoff_t index) { return pagecache_get_page(mapping, index, FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT, mapping_gfp_mask(mapping)); } If we can reproduce this failure fairly easily, we might consider using our own version of this function that explicitly clears __GFP_FS, though this is just speculation (I don't actually see recursion in the stack): /* Clear __GFP_FS when allocating the page to avoid recursion into the fs * and deadlock against the caller's locked page. */ static inline struct page *ll_grab_cache_page_nowait(struct address_space *mapping, pgoff_t index) { return pagecache_get_page(mapping, index, FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT, mapping_gfp_mask(mapping) & ~__GFP_FS); ] It would be worthwhile to see what has changed under linux/mm/ between the RHEL 8.0 and 8.1 kernels to see if this would identify what is causing the failures. |
| Comment by Dongyang Li [ 09/Jan/20 ] |
|
__GFP_FS is cleared in pagecache_get_page(), before we use the flag to alloc the page and add to lru, The comments from grab_cache_page_nowait() are misleading, see 45f87de57f8fad59302fd263dd81ffa4843b5b24 grab_cache_page_nowait() passes FGP_NOFS to pagecache_get_page(), so we should be good. |
| Comment by Andreas Dilger [ 09/Jan/20 ] |
|
Can you please compare RHEL8.0 and 8.1 mm/ tree and cgroups to see what has changed, since the test fails 100% on RHEL8.1 and never on 8.0, so there must be some code change in the kernel that is causing this. |
| Comment by Dongyang Li [ 10/Jan/20 ] |
|
The changes are significant and I'm going through them. On the plus side, after enabling centos8-stream I'm able to get 4.18.0-147.3.1.el8_1.x86_64 and the newer 4.18.0-151.el8.x86_64. I can reproduce it on my vm on both kernels. and you are right it's a regression on 8.1, 8.0 is fine. The issue actually has nothing to do with lustre, can be reproduced doing dd from a local file. |
| Comment by Dongyang Li [ 13/Jan/20 ] |
|
OK, turns out redhat is missing a patch for the 8.1 kernel. f9c645621a28e37813a1de96d9cbd89cde94a1e4 memcg, oom: don't require __GFP_FS when invoking memcg OOM killer how do we proceed? I can create a new 4.18-rhel8.1.series under kernel_patches and add the patch there, but that's for the server side, we need the kernel patch on the clients... |
| Comment by Andreas Dilger [ 13/Jan/20 ] |
|
It makes sense to add the patch to the RHEL8.1 series and we can test with the server packages installed on the client. As for opening a ticket with RedHat, I hope Oleg or Peter know the details on that. |
| Comment by Gerrit Updater [ 13/Jan/20 ] |
|
Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/37197 |
| Comment by Dongyang Li [ 16/Jan/20 ] |
|
How can we tell Maloo to use the server packages(where the kernel patch actually apples) on the clients? I looked into https://wiki.whamcloud.com/display/PUB/Changing+Test+Parameters+with+Gerrit+Commit+Messages but could not find anything related |
| Comment by Minh Diep [ 17/Jan/20 ] |
|
currently you can't install lustre server on client (unless you patched the client) |
| Comment by Gerrit Updater [ 17/Jan/20 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37270 |
| Comment by Gerrit Updater [ 17/Jan/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37272 |
| Comment by Gerrit Updater [ 28/Jan/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37270/ |
| Comment by Jian Yu [ 29/Jan/20 ] |
|
The failure also occurred on Lustre b2_12 branch: https://testing.whamcloud.com/test_sets/0e3d7c8e-3ea7-11ea-9543-52540065bddc |
| Comment by Gerrit Updater [ 30/Jan/20 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37376 |
| Comment by Gerrit Updater [ 01/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37272/ |
| Comment by Gerrit Updater [ 16/Apr/20 ] |
|
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38260 |
| Comment by Gerrit Updater [ 01/May/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38260/ |
| Comment by Patrick Farrell [ 29/Mar/23 ] |
|
I don't think we're testing on RHEL 8.1 clients any more? Are we? I've removed this ALWAYS_EXCEPT as part of https://review.whamcloud.com/c/fs/lustre-release/+/50460 ; hopefully I'm right about testing |