[LU-13063] sanity test 411 times out for RHEL8.1 Created: 11/Dec/19  Updated: 29/Mar/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Dongyang Li
Resolution: Unresolved Votes: 0
Labels: always_except, rhel8
Environment:

RHEL 8.1


Issue Links:
Related
is related to LU-10073 lnet-selftest test_smoke: lst Error f... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The last thing seen in the suite_log for sanity test 411 is

== sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 04:54:33 (1575953673)
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 3.88888 s, 27.0 MB/s

Normally, on successful runs, we would see a dd error reading the file just created, but the test hangs at this point. Looking at the console logs, it’s not clear why the test is hanging, but we see lnet-selftest processes hung. Looking at the stack trace on the first client (vm10), we see that there a lnet-selftest process stuck D state

[14127.185129] lst_t_00_00     S    0 14488      2 0x80000080
[14127.186075] Call Trace:
[14127.186561]  ? __schedule+0x253/0x830
[14127.187236]  ? sfw_test_unit_done.isra.14+0x9d/0x150 [lnet_selftest]
[14127.188348]  schedule+0x28/0x70
[14127.188929]  cfs_wi_scheduler+0x40d/0x420 [libcfs]
[14127.189783]  ? finish_wait+0x80/0x80
[14127.190466]  ? cfs_wi_sched_create+0x5a0/0x5a0 [libcfs]
[14127.191397]  kthread+0x112/0x130
[14127.191984]  ? kthread_flush_work_fn+0x10/0x10
[14127.192782]  ret_from_fork+0x35/0x40
[14127.193448] st_timer        D    0 14636      2 0x80000080
[14127.194413] Call Trace:
[14127.194882]  ? __schedule+0x253/0x830
[14127.195555]  schedule+0x28/0x70
[14127.196142]  schedule_timeout+0x16b/0x390
[14127.196859]  ? __next_timer_interrupt+0xc0/0xc0
[14127.197678]  ? prepare_to_wait_event+0xbb/0x140
[14127.198496]  stt_timer_main+0x215/0x230 [lnet_selftest]
[14127.199436]  ? finish_wait+0x80/0x80
[14127.200083]  ? sfw_startup+0x540/0x540 [lnet_selftest]
[14127.200989]  kthread+0x112/0x130
[14127.201595]  ? kthread_flush_work_fn+0x10/0x10
[14127.202393]  ret_from_fork+0x35/0x40

Similarly in the stack-trace log on the MDS (vm12), we see the lnet process

[14034.774700] st_timer        D ffff9cb15b62a080     0 28114      2 0x00000080
[14034.776068] Call Trace:
[14034.776493]  [<ffffffffb0f6af19>] schedule+0x29/0x70
[14034.777425]  [<ffffffffb0f68968>] schedule_timeout+0x168/0x2d0
[14034.778391]  [<ffffffffb08cfeb4>] ? __wake_up+0x44/0x50
[14034.779358]  [<ffffffffb08aab30>] ? __internal_add_timer+0x130/0x130
[14034.780432]  [<ffffffffb08c3a46>] ? prepare_to_wait+0x56/0x90
[14034.781474]  [<ffffffffc1542a98>] stt_timer_main+0x168/0x220 [lnet_selftest]
[14034.782654]  [<ffffffffb08c3f50>] ? wake_up_atomic_t+0x30/0x30
[14034.783688]  [<ffffffffc1542930>] ? sfw_startup+0x580/0x580 [lnet_selftest]
[14034.784856]  [<ffffffffb08c2e81>] kthread+0xd1/0xe0
[14034.785787]  [<ffffffffb08c2db0>] ? insert_kthread_work+0x40/0x40
[14034.786818]  [<ffffffffb0f77c37>] ret_from_fork_nospec_begin+0x21/0x21
[14034.788077]  [<ffffffffb08c2db0>] ? insert_kthread_work+0x40/0x40

lnet-selftest did run and fail (LU-10073) previous to sanity running. It’s not clear if lnet-selftest is a cause of this test hang.

We’ve see this test hang twice for RHEL 8.1 testing both in December
https://testing.whamcloud.com/test_sets/293b5216-1b13-11ea-a9d7-52540065bddc
https://testing.whamcloud.com/test_sets/133daa46-1b8a-11ea-b1e8-52540065bddc

In addition, we've seen this once in the past 3 months in PPC testing for a patch for LU-11997 at https://testing.whamcloud.com/test_sets/b4851392-f175-11e9-b62b-52540065bddc .



 Comments   
Comment by James A Simmons [ 11/Dec/19 ]

I have a patch that replaces cfs_wi_schedular with a workqueue. Give me a bit to get the patch working and I will push it.

Comment by Andreas Dilger [ 12/Dec/19 ]

The patch looks like https://review.whamcloud.com/36991 "LU-9859 lnet: convert selftest to use workqueues"

Comment by James Nunez (Inactive) [ 18/Dec/19 ]

sanity test 411 still hangs for RHEL 8.1 when rebased on top of James's patch https://review.whamcloud.com/#/c/36991/. Results at https://testing.whamcloud.com/test_sets/ca87349e-21d2-11ea-80b4-52540065bddc

Comment by Peter Jones [ 19/Dec/19 ]

dongyang could you please advise here?

Comment by Dongyang Li [ 24/Dec/19 ]

from one of the test logs:

[14170.962056] dd              R  running task        0 27130  26934 0x00000080
[14170.963279] Call Trace:
[14170.963757]  ? finish_task_switch+0x76/0x2b0
[14170.964525]  ? __schedule+0x25b/0x830
[14170.965207]  ? shrink_page_list+0x48e/0xc50
[14170.965970]  ? page_evictable+0xe/0x40
[14170.966675]  ? putback_inactive_pages+0x1f8/0x4f0
[14170.967517]  ? shrink_inactive_list+0x207/0x570
[14170.968339]  ? inactive_list_is_low+0xe0/0x220
[14170.969148]  ? shrink_node_memcg+0x204/0x770
[14170.969920]  ? shrink_node+0xce/0x440
[14170.970589]  ? do_try_to_free_pages+0xc3/0x360
[14170.971396]  ? try_to_free_mem_cgroup_pages+0xf9/0x210
[14170.972316]  ? try_charge+0x192/0x780
[14170.972992]  ? mem_cgroup_commit_charge+0x7a/0x560
[14170.973856]  ? mem_cgroup_try_charge+0x8b/0x1a0
[14170.974677]  ? __add_to_page_cache_locked+0x64/0x240
[14170.975560]  ? add_to_page_cache_lru+0x4a/0xc0
[14170.976366]  ? pagecache_get_page+0xf2/0x2c0
[14170.977294]  ? ll_read_ahead_pages+0x1e8/0x8b0 [lustre]
[14170.978305]  ? osc_io_fini+0x10/0x10 [osc]
[14170.979072]  ? ll_readahead.constprop.32+0x641/0x9d0 [lustre]
[14170.980105]  ? ll_io_read_page+0x355/0x4c0 [lustre]
[14170.980992]  ? ll_readpage+0xe3/0x650 [lustre]
[14170.981801]  ? find_get_entry+0x19/0xf0
[14170.982508]  ? pagecache_get_page+0x30/0x2c0
[14170.983286]  ? generic_file_buffered_read+0x601/0xb10
[14170.984188]  ? atime_needs_update+0x77/0xe0
[14170.984960]  ? vvp_io_read_start+0x3ef/0x720 [lustre]
[14170.986080]  ? cl_lock_request+0x62/0x1b0 [obdclass]
[14170.986989]  ? cl_io_start+0x58/0x100 [obdclass]
[14170.987851]  ? cl_io_loop+0xdc/0x1b0 [obdclass]
[14170.988682]  ? ll_file_io_generic+0x23d/0x960 [lustre]
[14170.989611]  ? ll_file_read_iter+0x244/0x2e0 [lustre]
[14170.990510]  ? new_sync_read+0x121/0x170
[14170.991233]  ? vfs_read+0x91/0x140
[14170.991864]  ? ksys_read+0x4f/0xb0
[14170.992493]  ? do_syscall_64+0x5b/0x1b0
[14170.993198]  ? entry_SYSCALL_64_after_hwframe+0x65/0xca

looks like dd is not stuck, we are reading a page and read_ahead kicks in, trying to allocate a page in page cache, hit cgroup limits and then it just keep trying to reclaim in a loop? 

Comment by Andreas Dilger [ 07/Jan/20 ]

Dongyang, in the case of a cgroup limit being hit, is there an error returned to the caller, or where is the allocation looping? It seems that ll_read_ahead_page() is already calling grab_cache_page_nowait() into pagecache_get_page(FGP_NOWAIT) so it shouldn't be blocking for pages that cannot be allocated. Maybe this is a bug in the kernel?

Comment by Dongyang Li [ 08/Jan/20 ]

Andreas, if we hit the cgroup limit in grab_cache_page_nowait(), it calls into mem_cgroup_try_charge() and eventually oom inside the cgroup.

before doing that try_charge() will try to reclaim the pages, that's what we see here.

yes ll_read_ahead_page() is using FGP_NOWAIT but that only means we won't wait for the

page lock when getting the page ref. mapping->gfp_mask is still used by the cgroup reclaim.

It does feels like a bug in the kernel but I'm having difficulty locating it, as well as reproducing it.

Comment by Gerrit Updater [ 08/Jan/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37165
Subject: LU-13063 testing: reproduce sanity 411 error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a0560ae8ebcc4559be7823ab0a5975e057979f4a

Comment by Andreas Dilger [ 09/Jan/20 ]

mapping->gfp_mask is still used by the cgroup reclaim.

I was wondering about that. Are we sure that gfp_mask excludes __GFP_FS at this point? The comment says this, but it isn't clear to me how that is working since it doesn't actually clear __GFP_FS that I can see:

 * Clear __GFP_FS when allocating the page to avoid recursion into the fs
 * and deadlock against the caller's locked page.
 */             
static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
                                pgoff_t index)
{       
        return pagecache_get_page(mapping, index,
                        FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
                        mapping_gfp_mask(mapping));
}

If we can reproduce this failure fairly easily, we might consider using our own version of this function that explicitly clears __GFP_FS, though this is just speculation (I don't actually see recursion in the stack):

/* Clear __GFP_FS when allocating the page to avoid recursion into the fs
 * and deadlock against the caller's locked page.
 */             
static inline struct page *ll_grab_cache_page_nowait(struct address_space *mapping,
                                   pgoff_t index)
{       
        return pagecache_get_page(mapping, index,
                        FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
                        mapping_gfp_mask(mapping) & ~__GFP_FS);
]

It would be worthwhile to see what has changed under linux/mm/ between the RHEL 8.0 and 8.1 kernels to see if this would identify what is causing the failures.

Comment by Dongyang Li [ 09/Jan/20 ]

__GFP_FS is cleared in pagecache_get_page(), before we use the flag to alloc the page and add to lru,

The comments from grab_cache_page_nowait() are misleading, see 45f87de57f8fad59302fd263dd81ffa4843b5b24

grab_cache_page_nowait() passes FGP_NOFS to pagecache_get_page(), so we should be good.

Comment by Andreas Dilger [ 09/Jan/20 ]

Can you please compare RHEL8.0 and 8.1 mm/ tree and cgroups to see what has changed, since the test fails 100% on RHEL8.1 and never on 8.0, so there must be some code change in the kernel that is causing this.

Comment by Dongyang Li [ 10/Jan/20 ]

The changes are significant and I'm going through them. On the plus side, after enabling centos8-stream I'm able to get 4.18.0-147.3.1.el8_1.x86_64 and the newer 4.18.0-151.el8.x86_64. I can reproduce it on my vm on both kernels. and you are right it's a regression on 8.1, 8.0 is fine. The issue actually has nothing to do with lustre, can be reproduced doing dd from a local file.

Comment by Dongyang Li [ 13/Jan/20 ]

OK, turns out redhat is missing a patch for the 8.1 kernel.

f9c645621a28e37813a1de96d9cbd89cde94a1e4 memcg, oom: don't require __GFP_FS when invoking memcg OOM killer

how do we proceed? I can create a new 4.18-rhel8.1.series under kernel_patches and add the patch there, but that's for the server side, we need the kernel patch on the clients...

Comment by Andreas Dilger [ 13/Jan/20 ]

It makes sense to add the patch to the RHEL8.1 series and we can test with the server packages installed on the client.

As for opening a ticket with RedHat, I hope Oleg or Peter know the details on that.

Comment by Gerrit Updater [ 13/Jan/20 ]

Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/37197
Subject: LU-13063 kernel: fix try_charge retrying forever
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d4eae90c1ec4676f775bf7c0582c999968edf408

Comment by Dongyang Li [ 16/Jan/20 ]

How can we tell Maloo to use the server packages(where the kernel patch actually apples) on the clients?

I looked into https://wiki.whamcloud.com/display/PUB/Changing+Test+Parameters+with+Gerrit+Commit+Messages

but could not find anything related

Comment by Minh Diep [ 17/Jan/20 ]

currently you can't install lustre server on client (unless you patched the client)

Comment by Gerrit Updater [ 17/Jan/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37270
Subject: LU-13063 tests: stop running sanity test 411
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0681c9075cfb28745bdb90f660c36878a8b94679

Comment by Gerrit Updater [ 17/Jan/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37272
Subject: LU-13063 tests: remove checks for old RHEL versions
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 734072a4ff70ff044fe44c3ff018f28720f38617

Comment by Gerrit Updater [ 28/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37270/
Subject: LU-13063 tests: stop running sanity test 411
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 34e4c37474b3d9328aac1dd9228a018ac7d4f47e

Comment by Jian Yu [ 29/Jan/20 ]

The failure also occurred on Lustre b2_12 branch: https://testing.whamcloud.com/test_sets/0e3d7c8e-3ea7-11ea-9543-52540065bddc

Comment by Gerrit Updater [ 30/Jan/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37376
Subject: LU-13063 tests: fortestonly RHEL8.0 vs 8.1
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: d20b45141a917549858dc443c18c7b8b4f108ca8

Comment by Gerrit Updater [ 01/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37272/
Subject: LU-13063 tests: remove checks for old RHEL versions
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4a8e06fb5af88f93a936ed8ba0718ff5d3554c9f

Comment by Gerrit Updater [ 16/Apr/20 ]

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38260
Subject: LU-13063 tests: stop running sanity test 411
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f857eb47ea816d4a74b0012049125ae8cbeb2149

Comment by Gerrit Updater [ 01/May/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38260/
Subject: LU-13063 tests: stop running sanity test 411
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 495a6f01a30855c72aa0ec4fda12f961342a97a3

Comment by Patrick Farrell [ 29/Mar/23 ]

I don't think we're testing on RHEL 8.1 clients any more?  Are we?  I've removed this ALWAYS_EXCEPT as part of https://review.whamcloud.com/c/fs/lustre-release/+/50460 ; hopefully I'm right about testing

Generated at Sat Feb 10 02:58:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.