[LU-14868] sanity: all subtests pass but test suite fails with TIMEOUT Created: 20/Jul/21 Updated: 27/Jul/21 Resolved: 25/Jul/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Chris Horn <hornc@cray.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d1a16088-4017-4128-935f-b366895d3df1 Subtests pass 832/832 but the test suite failed with TIMEOUT status. Nothing obviously wrong in the available logs. |
| Comments |
| Comment by Olaf Faaland [ 20/Jul/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/4878a6f1-c343-41bf-88b3-9baa2b1e25fd |
| Comment by Olaf Faaland [ 21/Jul/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/fc893a44-4e28-42a1-9ba8-a9a52345bbd8 |
| Comment by Andreas Dilger [ 21/Jul/21 ] |
|
+1 on master: https://testing.whamcloud.com/test_sessions/a6bffc7d-e7ad-4289-980a-1ffc677d0189 |
| Comment by Andreas Dilger [ 21/Jul/21 ] |
|
It appears that this is caused by a slow cleanup of test files from a subtest that left a lot of files behind. Investigating which test is causing this. |
| Comment by Andreas Dilger [ 21/Jul/21 ] |
|
According to my Maloo search, it looks like this started having timeouts with '832/832' (i.e. all) subtests passed on 2021-07-08. There were other sanity timeouts before then, but they always happened while running a subtest (e.g. 'n-1/n' subtests run) and only 3-5 per day. On 2021-07-09 there were 29 timeouts with '832/832' subtests run and similar on later days (excluding a rash of patch-specific timeouts on 2021-07-12 due to Patches that landed to lustre/ on 2021-07-08 (not including lnet/, at 2:00 am, so available for nearly a full day of test runs that day) are: f21c507fbc LU-14779 utils: no DNS lookups for NID in get_param *** 4fc127428f LU-14778 readahead: fix to reserve min pages 791f656a03 LU-14776 ldiskfs: Add Ubuntu 20.04 HWE support a81c093a93 LU-14767 utils: mkfs.lustre allow lazy_itable_init=1 b31792b0e7 LU-11872 quota: add get/set project support for non-dir/file d775f9ae37 LU-11698 libcfs: Add checksum speed under /sys/fs *** 25aa852737 LU-14648 lod: protect lod_object layout info 002c2a8026 LU-14762 lmv: compare space to mkdir on parent MDT 8d1895f2f6 LU-14654 tests: Ensure recovery_limit zero works as expected a15eb4f132 LU-13055 mdd: per-user changelog names and mask *** 3070ca9b18 LU-4684 tests: enable racer directory migration cbaaa7cde4 LU-14139 llite: simplify callback handling for async getattr 5c98de8566 LU-6142 obdclass: resolve lu_ref checkpatch issues f5967b06aa LU-14734 osd-ldiskfs: enable large_dir automatically with the patches modifying sanity.sh marked with ***, but none of those patches look like they are creating more than a handful of files. |
| Comment by Andreas Dilger [ 21/Jul/21 ] |
|
There are stack traces being dumped during cleanup for the same process (i.e. it is really stuck), after all of the subtests have finished: [19582.720606] Lustre: DEBUG MARKER: == sanity test complete, duration 19117 sec =============== 20:32:33 (1626899553) [20640.853345] Lustre: lustre-OST0000-osc-ffff9a44ff09b000: Connection to lustre-OST0000 (at 10.9.7.173@tcp) was lost; in progress operations using this service will wait for recovery to complete [20640.857025] LustreError: 167-0: lustre-OST0000-osc-ffff9a44ff09b000: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [20767.316538] INFO: task rm:1023489 blocked for more than 120 seconds. [20767.317692] Tainted: G OE --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1 [20767.319022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [20767.320261] rm D 0 1023489 19363 0x00004080 [20767.321183] Call Trace: [20767.321710] __schedule+0x2c4/0x700 [20767.323543] schedule+0x38/0xa0 [20767.324085] bit_wait+0xd/0x50 [20767.324620] __wait_on_bit+0x6c/0x80 [20767.326186] out_of_line_wait_on_bit+0x91/0xb0 [20767.327678] ilookup5.part.26+0x69/0x90 [20767.329948] iget5_locked+0x26/0x80 [20767.330557] ll_iget+0x4d/0x320 [lustre] [20767.331230] ll_prep_inode+0x374/0x990 [lustre] [20767.332004] ll_lookup_it_finish.constprop.28+0x21d/0x1100 [lustre] [20767.335562] ll_lookup_it+0x586/0x18a0 [lustre] [20767.336889] ll_lookup_nd+0xfc/0x1a0 [lustre] [20767.337629] __lookup_slow+0x97/0x150 [20767.338252] lookup_slow+0x35/0x50 [20767.338839] walk_component+0x1bf/0x2f0 [20767.339487] path_lookupat.isra.48+0x75/0x200 [20767.341916] filename_lookup.part.64+0xa0/0x170 [20767.343381] vfs_statx+0x74/0xd0 [20767.343990] __do_sys_newfstatat+0x31/0x70 [20767.346290] do_syscall_64+0x5b/0x1a0 This makes me think that patch https://review.whamcloud.com/40712 " |
| Comment by Gerrit Updater [ 21/Jul/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44371 |
| Comment by Andreas Dilger [ 22/Jul/21 ] |
|
Another timed out test session showed a different stack for rm stuck in obd_get_request_slot(), as were a number of other threads, but was similarly stuck after the tests finished; [24074.551230] rm S 0 1087612 21959 0x00000080 [24074.552328] Call Trace: [24074.552845] __schedule+0x2c4/0x700 [24074.554393] schedule+0x38/0xa0 [24074.555077] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.556886] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.561142] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.562069] mdc_intent_lock+0x210/0x530 [mdc] [24074.567209] lmv_intent_open+0x296/0xa60 [lmv] [24074.570167] lmv_intent_lock+0x19c/0x390 [lmv] [24074.572094] ll_intent_file_open+0x37c/0x8c0 [lustre] [24074.573969] ll_file_open+0x26a/0xc40 [lustre] [24074.575771] do_dentry_open+0x132/0x340 [24074.576567] path_openat+0x53e/0x14f0 [24074.579404] do_filp_open+0x93/0x100 [24074.581006] do_sys_open+0x184/0x220 [24074.412684] ptlrpcd_01_00 S 0 1076811 2 0x80004080 [24074.413774] Call Trace: [24074.414315] __schedule+0x2c4/0x700 [24074.415837] schedule+0x38/0xa0 [24074.416715] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.418541] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.423070] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.424995] mdc_intent_lock+0x210/0x530 [mdc] [24074.429177] lmv_revalidate_slaves+0x607/0x9e0 [lmv] [24074.431189] lmv_merge_attr+0x32/0x170 [lmv] [24074.432059] ll_update_lsm_md+0xad7/0xe20 [lustre] [24074.433025] ll_update_inode+0x41f/0x650 [lustre] [24074.433971] ll_read_inode2+0x4e/0x3a0 [lustre] [24074.434876] ll_iget+0xbd/0x320 [lustre] [24074.435696] ll_prep_inode+0x374/0x990 [lustre] [24074.438680] ll_statahead_interpret+0x51b/0xb90 [lustre] [24074.439727] mdc_intent_getattr_async_interpret+0x13b/0x3d0 [mdc] [24074.440921] ptlrpc_check_set+0x579/0x2070 [ptlrpc] [24074.442769] ptlrpcd_check+0x3f3/0x5e0 [ptlrpc] [24074.443698] ptlrpcd+0x3cb/0x4c0 [ptlrpc] [24074.610008] ll_sa_1087612 S 0 1088357 2 0x80004080 [24074.611119] Call Trace: [24074.611645] __schedule+0x2c4/0x700 [24074.613188] schedule+0x38/0xa0 [24074.613842] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.615639] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.619882] mdc_intent_getattr_async+0x25f/0x320 [mdc] [24074.620907] lmv_intent_getattr_async+0x25c/0x290 [lmv] [24074.621944] sa_statahead+0x8c9/0xe40 [lustre] [24074.623585] ll_statahead_thread+0x913/0xaf0 [lustre] |
| Comment by Andreas Dilger [ 22/Jul/21 ] |
|
This latter form seems to be more common in the different timeouts I looked at, along with more clear indication that statahead/AGL are problematic: [21523.815920] LustreError: 1084347:0:(import.c:377:ptlrpc_invalidate_import()) @@@ still on sending list req@00000000b5d7744a x1704730918725248/t0(0) o101->lustre-MDT0001-mdc-ffff900242516000@10.9.5.67@tcp:12/10 lens 592/744 e 0 to 0 dl 1625758991 ref 1 fl Interpret:REQU/0/0 rc -5/0 job:'ll_sa_1083589.0' [21523.820258] LustreError: 1084347:0:(import.c:388:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out. [21544.292771] LustreError: 1084347:0:(import.c:354:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (6 != 0) : [21790.041768] LustreError: 1084387:0:(import.c:382:ptlrpc_invalidate_import()) @@@ still on delayed list req@000000002b0a711a x1704730918725632/t0(0) o101->lustre-OST0005-osc-ffff900242516000@10.9.5.65@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:EWQU/0/ffffffff rc -5/-1 job:'ll_agl_1083589.0' |
| Comment by Andreas Dilger [ 23/Jul/21 ] |
|
More information in this issue:
|
| Comment by Gerrit Updater [ 25/Jul/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/44371/ |
| Comment by Andreas Dilger [ 25/Jul/21 ] |
|
Patch landed/reverted on master. |
| Comment by Alex Zhuravlev [ 26/Jul/21 ] |
|
my bisecting points to |
| Comment by Andreas Dilger [ 26/Jul/21 ] |
|
| Comment by Gerrit Updater [ 27/Jul/21 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/44402 |