Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.14.0, Lustre 2.12.4
-
RHEL 8.1
-
3
-
9223372036854775807
Description
The last thing seen in the suite_log for sanity test 411 is
== sanity test 411: Slab allocation error with cgroup does not LBUG ================================== 04:54:33 (1575953673) 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 3.88888 s, 27.0 MB/s
Normally, on successful runs, we would see a dd error reading the file just created, but the test hangs at this point. Looking at the console logs, it’s not clear why the test is hanging, but we see lnet-selftest processes hung. Looking at the stack trace on the first client (vm10), we see that there a lnet-selftest process stuck D state
[14127.185129] lst_t_00_00 S 0 14488 2 0x80000080 [14127.186075] Call Trace: [14127.186561] ? __schedule+0x253/0x830 [14127.187236] ? sfw_test_unit_done.isra.14+0x9d/0x150 [lnet_selftest] [14127.188348] schedule+0x28/0x70 [14127.188929] cfs_wi_scheduler+0x40d/0x420 [libcfs] [14127.189783] ? finish_wait+0x80/0x80 [14127.190466] ? cfs_wi_sched_create+0x5a0/0x5a0 [libcfs] [14127.191397] kthread+0x112/0x130 [14127.191984] ? kthread_flush_work_fn+0x10/0x10 [14127.192782] ret_from_fork+0x35/0x40 [14127.193448] st_timer D 0 14636 2 0x80000080 [14127.194413] Call Trace: [14127.194882] ? __schedule+0x253/0x830 [14127.195555] schedule+0x28/0x70 [14127.196142] schedule_timeout+0x16b/0x390 [14127.196859] ? __next_timer_interrupt+0xc0/0xc0 [14127.197678] ? prepare_to_wait_event+0xbb/0x140 [14127.198496] stt_timer_main+0x215/0x230 [lnet_selftest] [14127.199436] ? finish_wait+0x80/0x80 [14127.200083] ? sfw_startup+0x540/0x540 [lnet_selftest] [14127.200989] kthread+0x112/0x130 [14127.201595] ? kthread_flush_work_fn+0x10/0x10 [14127.202393] ret_from_fork+0x35/0x40
Similarly in the stack-trace log on the MDS (vm12), we see the lnet process
[14034.774700] st_timer D ffff9cb15b62a080 0 28114 2 0x00000080 [14034.776068] Call Trace: [14034.776493] [<ffffffffb0f6af19>] schedule+0x29/0x70 [14034.777425] [<ffffffffb0f68968>] schedule_timeout+0x168/0x2d0 [14034.778391] [<ffffffffb08cfeb4>] ? __wake_up+0x44/0x50 [14034.779358] [<ffffffffb08aab30>] ? __internal_add_timer+0x130/0x130 [14034.780432] [<ffffffffb08c3a46>] ? prepare_to_wait+0x56/0x90 [14034.781474] [<ffffffffc1542a98>] stt_timer_main+0x168/0x220 [lnet_selftest] [14034.782654] [<ffffffffb08c3f50>] ? wake_up_atomic_t+0x30/0x30 [14034.783688] [<ffffffffc1542930>] ? sfw_startup+0x580/0x580 [lnet_selftest] [14034.784856] [<ffffffffb08c2e81>] kthread+0xd1/0xe0 [14034.785787] [<ffffffffb08c2db0>] ? insert_kthread_work+0x40/0x40 [14034.786818] [<ffffffffb0f77c37>] ret_from_fork_nospec_begin+0x21/0x21 [14034.788077] [<ffffffffb08c2db0>] ? insert_kthread_work+0x40/0x40
lnet-selftest did run and fail (LU-10073) previous to sanity running. It’s not clear if lnet-selftest is a cause of this test hang.
We’ve see this test hang twice for RHEL 8.1 testing both in December
https://testing.whamcloud.com/test_sets/293b5216-1b13-11ea-a9d7-52540065bddc
https://testing.whamcloud.com/test_sets/133daa46-1b8a-11ea-b1e8-52540065bddc
In addition, we've seen this once in the past 3 months in PPC testing for a patch for LU-11997 at https://testing.whamcloud.com/test_sets/b4851392-f175-11e9-b62b-52540065bddc .
Attachments
Issue Links
- is related to
-
LU-10073 lnet-selftest test_smoke: lst Error found
- Resolved