[LU-14868] sanity: all subtests pass but test suite fails with TIMEOUT Created: 20/Jul/21  Updated: 27/Jul/21  Resolved: 25/Jul/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-14139 batched statahead processing Resolved
is related to LU-13440 DNE3: limit directory default layout ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Chris Horn <hornc@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d1a16088-4017-4128-935f-b366895d3df1

Subtests pass 832/832 but the test suite failed with TIMEOUT status. Nothing obviously wrong in the available logs.



 Comments   
Comment by Olaf Faaland [ 20/Jul/21 ]

+1 on master  https://testing.whamcloud.com/test_sets/4878a6f1-c343-41bf-88b3-9baa2b1e25fd

Comment by Olaf Faaland [ 21/Jul/21 ]

+1 on master https://testing.whamcloud.com/test_sets/fc893a44-4e28-42a1-9ba8-a9a52345bbd8

Comment by Andreas Dilger [ 21/Jul/21 ]

+1 on master: https://testing.whamcloud.com/test_sessions/a6bffc7d-e7ad-4289-980a-1ffc677d0189

Comment by Andreas Dilger [ 21/Jul/21 ]

It appears that this is caused by a slow cleanup of test files from a subtest that left a lot of files behind. Investigating which test is causing this.

Comment by Andreas Dilger [ 21/Jul/21 ]

According to my Maloo search, it looks like this started having timeouts with '832/832' (i.e. all) subtests passed on 2021-07-08. There were other sanity timeouts before then, but they always happened while running a subtest (e.g. 'n-1/n' subtests run) and only 3-5 per day. On 2021-07-09 there were 29 timeouts with '832/832' subtests run and similar on later days (excluding a rash of patch-specific timeouts on 2021-07-12 due to LU-13799). From 2021-07-08 to 2021-07-21 there have been 189/1042=20% sanity timeouts (of all kinds), compared to 15/462=3% for the previous week.

Patches that landed to lustre/ on 2021-07-08 (not including lnet/, at 2:00 am, so available for nearly a full day of test runs that day) are:

f21c507fbc LU-14779 utils: no DNS lookups for NID in get_param ***
4fc127428f LU-14778 readahead: fix to reserve min pages
791f656a03 LU-14776 ldiskfs: Add Ubuntu 20.04 HWE support
a81c093a93 LU-14767 utils: mkfs.lustre allow lazy_itable_init=1
b31792b0e7 LU-11872 quota: add get/set project support for non-dir/file
d775f9ae37 LU-11698 libcfs: Add checksum speed under /sys/fs ***
25aa852737 LU-14648 lod: protect lod_object layout info
002c2a8026 LU-14762 lmv: compare space to mkdir on parent MDT
8d1895f2f6 LU-14654 tests: Ensure recovery_limit zero works as expected
a15eb4f132 LU-13055 mdd: per-user changelog names and mask ***
3070ca9b18 LU-4684 tests: enable racer directory migration
cbaaa7cde4 LU-14139 llite: simplify callback handling for async getattr
5c98de8566 LU-6142 obdclass: resolve lu_ref checkpatch issues
f5967b06aa LU-14734 osd-ldiskfs: enable large_dir automatically

with the patches modifying sanity.sh marked with ***, but none of those patches look like they are creating more than a handful of files.

Comment by Andreas Dilger [ 21/Jul/21 ]

There are stack traces being dumped during cleanup for the same process (i.e. it is really stuck), after all of the subtests have finished:

[19582.720606] Lustre: DEBUG MARKER: == sanity test complete, duration 19117 sec =============== 20:32:33 (1626899553)
[20640.853345] Lustre: lustre-OST0000-osc-ffff9a44ff09b000: Connection to lustre-OST0000 (at 10.9.7.173@tcp) was lost; in progress operations using this service will wait for recovery to complete
[20640.857025] LustreError: 167-0: lustre-OST0000-osc-ffff9a44ff09b000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
[20767.316538] INFO: task rm:1023489 blocked for more than 120 seconds.
[20767.317692]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_3.x86_64 #1
[20767.319022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20767.320261] rm              D    0 1023489  19363 0x00004080
[20767.321183] Call Trace:
[20767.321710]  __schedule+0x2c4/0x700
[20767.323543]  schedule+0x38/0xa0
[20767.324085]  bit_wait+0xd/0x50
[20767.324620]  __wait_on_bit+0x6c/0x80
[20767.326186]  out_of_line_wait_on_bit+0x91/0xb0
[20767.327678]  ilookup5.part.26+0x69/0x90
[20767.329948]  iget5_locked+0x26/0x80
[20767.330557]  ll_iget+0x4d/0x320 [lustre]
[20767.331230]  ll_prep_inode+0x374/0x990 [lustre]
[20767.332004]  ll_lookup_it_finish.constprop.28+0x21d/0x1100 [lustre]
[20767.335562]  ll_lookup_it+0x586/0x18a0 [lustre]
[20767.336889]  ll_lookup_nd+0xfc/0x1a0 [lustre]
[20767.337629]  __lookup_slow+0x97/0x150
[20767.338252]  lookup_slow+0x35/0x50
[20767.338839]  walk_component+0x1bf/0x2f0
[20767.339487]  path_lookupat.isra.48+0x75/0x200
[20767.341916]  filename_lookup.part.64+0xa0/0x170
[20767.343381]  vfs_statx+0x74/0xd0
[20767.343990]  __do_sys_newfstatat+0x31/0x70
[20767.346290]  do_syscall_64+0x5b/0x1a0

This makes me think that patch https://review.whamcloud.com/40712 "LU-14139 llite: simplify callback handling for async getattr", but there were no timeouts for any of the review tests on that patch, so possibly it is a combination of factors. The thread is blocked on ilookup5->wait_on_inode->wait_on_bit(&inode->i_state, __I_NEW which may relate to the changes to statahead in that patch.

Comment by Gerrit Updater [ 21/Jul/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44371
Subject: LU-14868 llite: revert 'simplify callback handling for async getattr'
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 512f2c446b1388dfbad79d848b2ccf99f0b9222d

Comment by Andreas Dilger [ 22/Jul/21 ]

Another timed out test session showed a different stack for rm stuck in obd_get_request_slot(), as were a number of other threads, but was similarly stuck after the tests finished;

[24074.551230] rm              S    0 1087612  21959 0x00000080
[24074.552328] Call Trace:
[24074.552845]  __schedule+0x2c4/0x700
[24074.554393]  schedule+0x38/0xa0
[24074.555077]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
[24074.556886]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
[24074.561142]  mdc_enqueue_base+0x335/0x1340 [mdc]
[24074.562069]  mdc_intent_lock+0x210/0x530 [mdc]
[24074.567209]  lmv_intent_open+0x296/0xa60 [lmv]
[24074.570167]  lmv_intent_lock+0x19c/0x390 [lmv]
[24074.572094]  ll_intent_file_open+0x37c/0x8c0 [lustre]
[24074.573969]  ll_file_open+0x26a/0xc40 [lustre]
[24074.575771]  do_dentry_open+0x132/0x340
[24074.576567]  path_openat+0x53e/0x14f0
[24074.579404]  do_filp_open+0x93/0x100
[24074.581006]  do_sys_open+0x184/0x220
[24074.412684] ptlrpcd_01_00   S    0 1076811      2 0x80004080
[24074.413774] Call Trace:
[24074.414315]  __schedule+0x2c4/0x700
[24074.415837]  schedule+0x38/0xa0
[24074.416715]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
[24074.418541]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
[24074.423070]  mdc_enqueue_base+0x335/0x1340 [mdc]
[24074.424995]  mdc_intent_lock+0x210/0x530 [mdc]
[24074.429177]  lmv_revalidate_slaves+0x607/0x9e0 [lmv]
[24074.431189]  lmv_merge_attr+0x32/0x170 [lmv]
[24074.432059]  ll_update_lsm_md+0xad7/0xe20 [lustre]
[24074.433025]  ll_update_inode+0x41f/0x650 [lustre]
[24074.433971]  ll_read_inode2+0x4e/0x3a0 [lustre]
[24074.434876]  ll_iget+0xbd/0x320 [lustre]
[24074.435696]  ll_prep_inode+0x374/0x990 [lustre]
[24074.438680]  ll_statahead_interpret+0x51b/0xb90 [lustre]
[24074.439727]  mdc_intent_getattr_async_interpret+0x13b/0x3d0 [mdc]
[24074.440921]  ptlrpc_check_set+0x579/0x2070 [ptlrpc]
[24074.442769]  ptlrpcd_check+0x3f3/0x5e0 [ptlrpc]
[24074.443698]  ptlrpcd+0x3cb/0x4c0 [ptlrpc]
[24074.610008] ll_sa_1087612   S    0 1088357      2 0x80004080
[24074.611119] Call Trace:
[24074.611645]  __schedule+0x2c4/0x700
[24074.613188]  schedule+0x38/0xa0
[24074.613842]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
[24074.615639]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
[24074.619882]  mdc_intent_getattr_async+0x25f/0x320 [mdc]
[24074.620907]  lmv_intent_getattr_async+0x25c/0x290 [lmv]
[24074.621944]  sa_statahead+0x8c9/0xe40 [lustre]
[24074.623585]  ll_statahead_thread+0x913/0xaf0 [lustre]
Comment by Andreas Dilger [ 22/Jul/21 ]

This latter form seems to be more common in the different timeouts I looked at, along with more clear indication that statahead/AGL are problematic:

[21523.815920] LustreError: 1084347:0:(import.c:377:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000b5d7744a x1704730918725248/t0(0) o101->lustre-MDT0001-mdc-ffff900242516000@10.9.5.67@tcp:12/10 lens 592/744 e 0 to 0 dl 1625758991 ref 1 fl Interpret:REQU/0/0 rc -5/0 job:'ll_sa_1083589.0'
[21523.820258] LustreError: 1084347:0:(import.c:388:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
[21544.292771] LustreError: 1084347:0:(import.c:354:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (6 != 0)
:
[21790.041768] LustreError: 1084387:0:(import.c:382:ptlrpc_invalidate_import()) @@@ still on delayed list  req@000000002b0a711a x1704730918725632/t0(0) o101->lustre-OST0005-osc-ffff900242516000@10.9.5.65@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:EWQU/0/ffffffff rc -5/-1 job:'ll_agl_1083589.0'
Comment by Andreas Dilger [ 23/Jul/21 ]

More information in this issue:

  • searching the Maloo failures shows no hits for non-DNE test sessions (0/173 runs of review-ldiskfs, review-ldiskfs-arm, review-ldiskfs-ubuntu), so the problem definitely seems DNE related (61/168=36.8% runs of review-dne-part-1, review-dne-zfs-part-1)
  • the revert of the LU-14139 patch did not (completely?) solve the problem. I ran that patch 10 times without DNE (5x ldiskfs and 5x zfs) and did not hit the problem, then ran with DNE 13x and only hit the problem once (on a retest of an unrelated failure). That is a failure rate of about 7.7% compared to the DNE 36.8% failure rate. It might just be random chance, or it may be an aggravating factor?
  • the one failure with the revert of LU-14139 did not show the obd_get_request_slot blockage for functions or watchdog timers, though it still had an "rm" process in the stack trace. It is possible that the revert fixed one problem, and now we are left with a second problem of very slow "rm -rf" of files created by eg. test_413a/b/c, though 1h would be very slow
Comment by Gerrit Updater [ 25/Jul/21 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/44371/
Subject: LU-14868 llite: revert 'simplify callback handling for async getattr'
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e90794af4bfac3a591983f567b4acd27bf5a0990

Comment by Andreas Dilger [ 25/Jul/21 ]

Patch landed/reverted on master.

Comment by Alex Zhuravlev [ 26/Jul/21 ]

my bisecting points to LU-14868 as well

Comment by Andreas Dilger [ 26/Jul/21 ]

> my bisecting points to LU-14868 as well

LU-14868 is this ticket. Did you mean to write LU-14139?

Comment by Gerrit Updater [ 27/Jul/21 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/44402
Subject: LU-14868 test: lets see if patch 44041 resolves regression
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 73a70a11cf5494f087428e506aa546e974232775

Generated at Sat Feb 10 03:13:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.