Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Chris Horn <hornc@cray.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d1a16088-4017-4128-935f-b366895d3df1
Subtests pass 832/832 but the test suite failed with TIMEOUT status. Nothing obviously wrong in the available logs.
Attachments
Issue Links
Activity
More information in this issue:
- searching the Maloo failures shows no hits for non-DNE test sessions (0/173 runs of review-ldiskfs, review-ldiskfs-arm, review-ldiskfs-ubuntu), so the problem definitely seems DNE related (61/168=36.8% runs of review-dne-part-1, review-dne-zfs-part-1)
- the revert of the
LU-14139patch did not (completely?) solve the problem. I ran that patch 10 times without DNE (5x ldiskfs and 5x zfs) and did not hit the problem, then ran with DNE 13x and only hit the problem once (on a retest of an unrelated failure). That is a failure rate of about 7.7% compared to the DNE 36.8% failure rate. It might just be random chance, or it may be an aggravating factor? - the one failure with the revert of
LU-14139did not show the obd_get_request_slot blockage for functions or watchdog timers, though it still had an "rm" process in the stack trace. It is possible that the revert fixed one problem, and now we are left with a second problem of very slow "rm -rf" of files created by eg. test_413a/b/c, though 1h would be very slow
This latter form seems to be more common in the different timeouts I looked at, along with more clear indication that statahead/AGL are problematic:
[21523.815920] LustreError: 1084347:0:(import.c:377:ptlrpc_invalidate_import()) @@@ still on sending list req@00000000b5d7744a x1704730918725248/t0(0) o101->lustre-MDT0001-mdc-ffff900242516000@10.9.5.67@tcp:12/10 lens 592/744 e 0 to 0 dl 1625758991 ref 1 fl Interpret:REQU/0/0 rc -5/0 job:'ll_sa_1083589.0' [21523.820258] LustreError: 1084347:0:(import.c:388:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out. [21544.292771] LustreError: 1084347:0:(import.c:354:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (6 != 0) : [21790.041768] LustreError: 1084387:0:(import.c:382:ptlrpc_invalidate_import()) @@@ still on delayed list req@000000002b0a711a x1704730918725632/t0(0) o101->lustre-OST0005-osc-ffff900242516000@10.9.5.65@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:EWQU/0/ffffffff rc -5/-1 job:'ll_agl_1083589.0'
Another timed out test session showed a different stack for rm stuck in obd_get_request_slot(), as were a number of other threads, but was similarly stuck after the tests finished;
[24074.551230] rm S 0 1087612 21959 0x00000080 [24074.552328] Call Trace: [24074.552845] __schedule+0x2c4/0x700 [24074.554393] schedule+0x38/0xa0 [24074.555077] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.556886] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.561142] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.562069] mdc_intent_lock+0x210/0x530 [mdc] [24074.567209] lmv_intent_open+0x296/0xa60 [lmv] [24074.570167] lmv_intent_lock+0x19c/0x390 [lmv] [24074.572094] ll_intent_file_open+0x37c/0x8c0 [lustre] [24074.573969] ll_file_open+0x26a/0xc40 [lustre] [24074.575771] do_dentry_open+0x132/0x340 [24074.576567] path_openat+0x53e/0x14f0 [24074.579404] do_filp_open+0x93/0x100 [24074.581006] do_sys_open+0x184/0x220 [24074.412684] ptlrpcd_01_00 S 0 1076811 2 0x80004080 [24074.413774] Call Trace: [24074.414315] __schedule+0x2c4/0x700 [24074.415837] schedule+0x38/0xa0 [24074.416715] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.418541] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.423070] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.424995] mdc_intent_lock+0x210/0x530 [mdc] [24074.429177] lmv_revalidate_slaves+0x607/0x9e0 [lmv] [24074.431189] lmv_merge_attr+0x32/0x170 [lmv] [24074.432059] ll_update_lsm_md+0xad7/0xe20 [lustre] [24074.433025] ll_update_inode+0x41f/0x650 [lustre] [24074.433971] ll_read_inode2+0x4e/0x3a0 [lustre] [24074.434876] ll_iget+0xbd/0x320 [lustre] [24074.435696] ll_prep_inode+0x374/0x990 [lustre] [24074.438680] ll_statahead_interpret+0x51b/0xb90 [lustre] [24074.439727] mdc_intent_getattr_async_interpret+0x13b/0x3d0 [mdc] [24074.440921] ptlrpc_check_set+0x579/0x2070 [ptlrpc] [24074.442769] ptlrpcd_check+0x3f3/0x5e0 [ptlrpc] [24074.443698] ptlrpcd+0x3cb/0x4c0 [ptlrpc] [24074.610008] ll_sa_1087612 S 0 1088357 2 0x80004080 [24074.611119] Call Trace: [24074.611645] __schedule+0x2c4/0x700 [24074.613188] schedule+0x38/0xa0 [24074.613842] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.615639] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.619882] mdc_intent_getattr_async+0x25f/0x320 [mdc] [24074.620907] lmv_intent_getattr_async+0x25c/0x290 [lmv] [24074.621944] sa_statahead+0x8c9/0xe40 [lustre] [24074.623585] ll_statahead_thread+0x913/0xaf0 [lustre]
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44371
Subject: LU-14868 llite: revert 'simplify callback handling for async getattr'
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 512f2c446b1388dfbad79d848b2ccf99f0b9222d
There are stack traces being dumped during cleanup for the same process (i.e. it is really stuck), after all of the subtests have finished:
[19582.720606] Lustre: DEBUG MARKER: == sanity test complete, duration 19117 sec =============== 20:32:33 (1626899553) [20640.853345] Lustre: lustre-OST0000-osc-ffff9a44ff09b000: Connection to lustre-OST0000 (at 10.9.7.173@tcp) was lost; in progress operations using this service will wait for recovery to complete [20640.857025] LustreError: 167-0: lustre-OST0000-osc-ffff9a44ff09b000: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [20767.316538] INFO: task rm:1023489 blocked for more than 120 seconds. [20767.317692] Tainted: G OE --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1 [20767.319022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [20767.320261] rm D 0 1023489 19363 0x00004080 [20767.321183] Call Trace: [20767.321710] __schedule+0x2c4/0x700 [20767.323543] schedule+0x38/0xa0 [20767.324085] bit_wait+0xd/0x50 [20767.324620] __wait_on_bit+0x6c/0x80 [20767.326186] out_of_line_wait_on_bit+0x91/0xb0 [20767.327678] ilookup5.part.26+0x69/0x90 [20767.329948] iget5_locked+0x26/0x80 [20767.330557] ll_iget+0x4d/0x320 [lustre] [20767.331230] ll_prep_inode+0x374/0x990 [lustre] [20767.332004] ll_lookup_it_finish.constprop.28+0x21d/0x1100 [lustre] [20767.335562] ll_lookup_it+0x586/0x18a0 [lustre] [20767.336889] ll_lookup_nd+0xfc/0x1a0 [lustre] [20767.337629] __lookup_slow+0x97/0x150 [20767.338252] lookup_slow+0x35/0x50 [20767.338839] walk_component+0x1bf/0x2f0 [20767.339487] path_lookupat.isra.48+0x75/0x200 [20767.341916] filename_lookup.part.64+0xa0/0x170 [20767.343381] vfs_statx+0x74/0xd0 [20767.343990] __do_sys_newfstatat+0x31/0x70 [20767.346290] do_syscall_64+0x5b/0x1a0
This makes me think that patch https://review.whamcloud.com/40712 "LU-14139 llite: simplify callback handling for async getattr", but there were no timeouts for any of the review tests on that patch, so possibly it is a combination of factors. The thread is blocked on ilookup5->wait_on_inode->wait_on_bit(&inode->i_state, __I_NEW which may relate to the changes to statahead in that patch.
According to my Maloo search, it looks like this started having timeouts with '832/832' (i.e. all) subtests passed on 2021-07-08. There were other sanity timeouts before then, but they always happened while running a subtest (e.g. 'n-1/n' subtests run) and only 3-5 per day. On 2021-07-09 there were 29 timeouts with '832/832' subtests run and similar on later days (excluding a rash of patch-specific timeouts on 2021-07-12 due to LU-13799). From 2021-07-08 to 2021-07-21 there have been 189/1042=20% sanity timeouts (of all kinds), compared to 15/462=3% for the previous week.
Patches that landed to lustre/ on 2021-07-08 (not including lnet/, at 2:00 am, so available for nearly a full day of test runs that day) are:
f21c507fbc LU-14779 utils: no DNS lookups for NID in get_param *** 4fc127428f LU-14778 readahead: fix to reserve min pages 791f656a03 LU-14776 ldiskfs: Add Ubuntu 20.04 HWE support a81c093a93 LU-14767 utils: mkfs.lustre allow lazy_itable_init=1 b31792b0e7 LU-11872 quota: add get/set project support for non-dir/file d775f9ae37 LU-11698 libcfs: Add checksum speed under /sys/fs *** 25aa852737 LU-14648 lod: protect lod_object layout info 002c2a8026 LU-14762 lmv: compare space to mkdir on parent MDT 8d1895f2f6 LU-14654 tests: Ensure recovery_limit zero works as expected a15eb4f132 LU-13055 mdd: per-user changelog names and mask *** 3070ca9b18 LU-4684 tests: enable racer directory migration cbaaa7cde4 LU-14139 llite: simplify callback handling for async getattr 5c98de8566 LU-6142 obdclass: resolve lu_ref checkpatch issues f5967b06aa LU-14734 osd-ldiskfs: enable large_dir automatically
with the patches modifying sanity.sh marked with ***, but none of those patches look like they are creating more than a handful of files.
It appears that this is caused by a slow cleanup of test files from a subtest that left a lot of files behind. Investigating which test is causing this.
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/44371/
Subject:
LU-14868llite: revert 'simplify callback handling for async getattr'Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e90794af4bfac3a591983f567b4acd27bf5a0990