[LU-14868] sanity: all subtests pass but test suite fails with TIMEOUT - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for Chris Horn <hornc@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d1a16088-4017-4128-935f-b366895d3df1

Subtests pass 832/832 but the test suite failed with TIMEOUT status. Nothing obviously wrong in the available logs.

Attachments

Issue Links

is related to

LU-14139 batched statahead processing

Resolved

LU-13440 DNE3: limit directory default layout inheritance

Resolved

Activity

[LU-14868] sanity: all subtests pass but test suite fails with TIMEOUT

Gerrit Updater added a comment - 27/Jul/21 12:23 AM

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/44402
Subject: ~~LU-14868~~ test: lets see if patch 44041 resolves regression
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 73a70a11cf5494f087428e506aa546e974232775

Gerrit Updater added a comment - 27/Jul/21 12:23 AM James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/44402 Subject: LU-14868 test: lets see if patch 44041 resolves regression Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 73a70a11cf5494f087428e506aa546e974232775

Andreas Dilger added a comment - 26/Jul/21 11:36 AM

> my bisecting points to ~~LU-14868~~ as well

~~LU-14868~~ is this ticket. Did you mean to write ~~LU-14139~~?

Andreas Dilger added a comment - 26/Jul/21 11:36 AM > my bisecting points to LU-14868 as well LU-14868 is this ticket. Did you mean to write LU-14139 ?

Alex Zhuravlev added a comment - 26/Jul/21 5:51 AM

my bisecting points to ~~LU-14868~~ as well

Alex Zhuravlev added a comment - 26/Jul/21 5:51 AM my bisecting points to LU-14868 as well

Andreas Dilger added a comment - 25/Jul/21 6:16 AM

Patch landed/reverted on master.

Andreas Dilger added a comment - 25/Jul/21 6:16 AM Patch landed/reverted on master.

Gerrit Updater added a comment - 25/Jul/21 5:21 AM

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/44371/
Subject: ~~LU-14868~~ llite: revert 'simplify callback handling for async getattr'
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e90794af4bfac3a591983f567b4acd27bf5a0990

Gerrit Updater added a comment - 25/Jul/21 5:21 AM Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/44371/ Subject: LU-14868 llite: revert 'simplify callback handling for async getattr' Project: fs/lustre-release Branch: master Current Patch Set: Commit: e90794af4bfac3a591983f567b4acd27bf5a0990

Andreas Dilger added a comment - 23/Jul/21 4:13 PM - edited

More information in this issue:

searching the Maloo failures shows no hits for non-DNE test sessions (0/173 runs of review-ldiskfs, review-ldiskfs-arm, review-ldiskfs-ubuntu), so the problem definitely seems DNE related (61/168=36.8% runs of review-dne-part-1, review-dne-zfs-part-1)
the revert of the ~~LU-14139~~ patch did not (completely?) solve the problem. I ran that patch 10 times without DNE (5x ldiskfs and 5x zfs) and did not hit the problem, then ran with DNE 13x and only hit the problem once (on a retest of an unrelated failure). That is a failure rate of about 7.7% compared to the DNE 36.8% failure rate. It might just be random chance, or it may be an aggravating factor?
the one failure with the revert of ~~LU-14139~~ did not show the obd_get_request_slot blockage for functions or watchdog timers, though it still had an "rm" process in the stack trace. It is possible that the revert fixed one problem, and now we are left with a second problem of very slow "rm -rf" of files created by eg. test_413a/b/c, though 1h would be very slow

Andreas Dilger added a comment - 23/Jul/21 4:13 PM - edited More information in this issue: searching the Maloo failures shows no hits for non-DNE test sessions (0/173 runs of review-ldiskfs, review-ldiskfs-arm, review-ldiskfs-ubuntu), so the problem definitely seems DNE related (61/168=36.8% runs of review-dne-part-1, review-dne-zfs-part-1) the revert of the LU-14139 patch did not (completely?) solve the problem. I ran that patch 10 times without DNE (5x ldiskfs and 5x zfs) and did not hit the problem, then ran with DNE 13x and only hit the problem once (on a retest of an unrelated failure). That is a failure rate of about 7.7% compared to the DNE 36.8% failure rate. It might just be random chance, or it may be an aggravating factor? the one failure with the revert of LU-14139 did not show the obd_get_request_slot blockage for functions or watchdog timers, though it still had an " rm " process in the stack trace. It is possible that the revert fixed one problem, and now we are left with a second problem of very slow " rm -rf " of files created by eg. test_413a/b/c, though 1h would be very slow

Andreas Dilger added a comment - 22/Jul/21 1:19 AM

This latter form seems to be more common in the different timeouts I looked at, along with more clear indication that statahead/AGL are problematic:

[21523.815920] LustreError: 1084347:0:(import.c:377:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000b5d7744a x1704730918725248/t0(0) o101->lustre-MDT0001-mdc-ffff900242516000@10.9.5.67@tcp:12/10 lens 592/744 e 0 to 0 dl 1625758991 ref 1 fl Interpret:REQU/0/0 rc -5/0 job:'ll_sa_1083589.0'
[21523.820258] LustreError: 1084347:0:(import.c:388:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
[21544.292771] LustreError: 1084347:0:(import.c:354:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (6 != 0)
:
[21790.041768] LustreError: 1084387:0:(import.c:382:ptlrpc_invalidate_import()) @@@ still on delayed list  req@000000002b0a711a x1704730918725632/t0(0) o101->lustre-OST0005-osc-ffff900242516000@10.9.5.65@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:EWQU/0/ffffffff rc -5/-1 job:'ll_agl_1083589.0'

Andreas Dilger added a comment - 22/Jul/21 1:19 AM This latter form seems to be more common in the different timeouts I looked at, along with more clear indication that statahead/AGL are problematic: [21523.815920] LustreError: 1084347:0:(import.c:377:ptlrpc_invalidate_import()) @@@ still on sending list req@00000000b5d7744a x1704730918725248/t0(0) o101->lustre-MDT0001-mdc-ffff900242516000@10.9.5.67@tcp:12/10 lens 592/744 e 0 to 0 dl 1625758991 ref 1 fl Interpret:REQU/0/0 rc -5/0 job:'ll_sa_1083589.0' [21523.820258] LustreError: 1084347:0:(import.c:388:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out. [21544.292771] LustreError: 1084347:0:(import.c:354:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (6 != 0) : [21790.041768] LustreError: 1084387:0:(import.c:382:ptlrpc_invalidate_import()) @@@ still on delayed list req@000000002b0a711a x1704730918725632/t0(0) o101->lustre-OST0005-osc-ffff900242516000@10.9.5.65@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:EWQU/0/ffffffff rc -5/-1 job:'ll_agl_1083589.0'

Andreas Dilger added a comment - 22/Jul/21 1:14 AM

Another timed out test session showed a different stack for rm stuck in obd_get_request_slot(), as were a number of other threads, but was similarly stuck after the tests finished;

[24074.551230] rm              S    0 1087612  21959 0x00000080
[24074.552328] Call Trace:
[24074.552845]  __schedule+0x2c4/0x700
[24074.554393]  schedule+0x38/0xa0
[24074.555077]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
[24074.556886]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
[24074.561142]  mdc_enqueue_base+0x335/0x1340 [mdc]
[24074.562069]  mdc_intent_lock+0x210/0x530 [mdc]
[24074.567209]  lmv_intent_open+0x296/0xa60 [lmv]
[24074.570167]  lmv_intent_lock+0x19c/0x390 [lmv]
[24074.572094]  ll_intent_file_open+0x37c/0x8c0 [lustre]
[24074.573969]  ll_file_open+0x26a/0xc40 [lustre]
[24074.575771]  do_dentry_open+0x132/0x340
[24074.576567]  path_openat+0x53e/0x14f0
[24074.579404]  do_filp_open+0x93/0x100
[24074.581006]  do_sys_open+0x184/0x220
[24074.412684] ptlrpcd_01_00   S    0 1076811      2 0x80004080
[24074.413774] Call Trace:
[24074.414315]  __schedule+0x2c4/0x700
[24074.415837]  schedule+0x38/0xa0
[24074.416715]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
[24074.418541]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
[24074.423070]  mdc_enqueue_base+0x335/0x1340 [mdc]
[24074.424995]  mdc_intent_lock+0x210/0x530 [mdc]
[24074.429177]  lmv_revalidate_slaves+0x607/0x9e0 [lmv]
[24074.431189]  lmv_merge_attr+0x32/0x170 [lmv]
[24074.432059]  ll_update_lsm_md+0xad7/0xe20 [lustre]
[24074.433025]  ll_update_inode+0x41f/0x650 [lustre]
[24074.433971]  ll_read_inode2+0x4e/0x3a0 [lustre]
[24074.434876]  ll_iget+0xbd/0x320 [lustre]
[24074.435696]  ll_prep_inode+0x374/0x990 [lustre]
[24074.438680]  ll_statahead_interpret+0x51b/0xb90 [lustre]
[24074.439727]  mdc_intent_getattr_async_interpret+0x13b/0x3d0 [mdc]
[24074.440921]  ptlrpc_check_set+0x579/0x2070 [ptlrpc]
[24074.442769]  ptlrpcd_check+0x3f3/0x5e0 [ptlrpc]
[24074.443698]  ptlrpcd+0x3cb/0x4c0 [ptlrpc]
[24074.610008] ll_sa_1087612   S    0 1088357      2 0x80004080
[24074.611119] Call Trace:
[24074.611645]  __schedule+0x2c4/0x700
[24074.613188]  schedule+0x38/0xa0
[24074.613842]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
[24074.615639]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
[24074.619882]  mdc_intent_getattr_async+0x25f/0x320 [mdc]
[24074.620907]  lmv_intent_getattr_async+0x25c/0x290 [lmv]
[24074.621944]  sa_statahead+0x8c9/0xe40 [lustre]
[24074.623585]  ll_statahead_thread+0x913/0xaf0 [lustre]

Andreas Dilger added a comment - 22/Jul/21 1:14 AM Another timed out test session showed a different stack for rm stuck in obd_get_request_slot() , as were a number of other threads, but was similarly stuck after the tests finished; [24074.551230] rm S 0 1087612 21959 0x00000080 [24074.552328] Call Trace: [24074.552845] __schedule+0x2c4/0x700 [24074.554393] schedule+0x38/0xa0 [24074.555077] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.556886] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.561142] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.562069] mdc_intent_lock+0x210/0x530 [mdc] [24074.567209] lmv_intent_open+0x296/0xa60 [lmv] [24074.570167] lmv_intent_lock+0x19c/0x390 [lmv] [24074.572094] ll_intent_file_open+0x37c/0x8c0 [lustre] [24074.573969] ll_file_open+0x26a/0xc40 [lustre] [24074.575771] do_dentry_open+0x132/0x340 [24074.576567] path_openat+0x53e/0x14f0 [24074.579404] do_filp_open+0x93/0x100 [24074.581006] do_sys_open+0x184/0x220 [24074.412684] ptlrpcd_01_00 S 0 1076811 2 0x80004080 [24074.413774] Call Trace: [24074.414315] __schedule+0x2c4/0x700 [24074.415837] schedule+0x38/0xa0 [24074.416715] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.418541] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.423070] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.424995] mdc_intent_lock+0x210/0x530 [mdc] [24074.429177] lmv_revalidate_slaves+0x607/0x9e0 [lmv] [24074.431189] lmv_merge_attr+0x32/0x170 [lmv] [24074.432059] ll_update_lsm_md+0xad7/0xe20 [lustre] [24074.433025] ll_update_inode+0x41f/0x650 [lustre] [24074.433971] ll_read_inode2+0x4e/0x3a0 [lustre] [24074.434876] ll_iget+0xbd/0x320 [lustre] [24074.435696] ll_prep_inode+0x374/0x990 [lustre] [24074.438680] ll_statahead_interpret+0x51b/0xb90 [lustre] [24074.439727] mdc_intent_getattr_async_interpret+0x13b/0x3d0 [mdc] [24074.440921] ptlrpc_check_set+0x579/0x2070 [ptlrpc] [24074.442769] ptlrpcd_check+0x3f3/0x5e0 [ptlrpc] [24074.443698] ptlrpcd+0x3cb/0x4c0 [ptlrpc] [24074.610008] ll_sa_1087612 S 0 1088357 2 0x80004080 [24074.611119] Call Trace: [24074.611645] __schedule+0x2c4/0x700 [24074.613188] schedule+0x38/0xa0 [24074.613842] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.615639] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.619882] mdc_intent_getattr_async+0x25f/0x320 [mdc] [24074.620907] lmv_intent_getattr_async+0x25c/0x290 [lmv] [24074.621944] sa_statahead+0x8c9/0xe40 [lustre] [24074.623585] ll_statahead_thread+0x913/0xaf0 [lustre]

Gerrit Updater added a comment - 21/Jul/21 11:38 PM

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44371
Subject: ~~LU-14868~~ llite: revert 'simplify callback handling for async getattr'
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 512f2c446b1388dfbad79d848b2ccf99f0b9222d

Gerrit Updater added a comment - 21/Jul/21 11:38 PM Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44371 Subject: LU-14868 llite: revert 'simplify callback handling for async getattr' Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 512f2c446b1388dfbad79d848b2ccf99f0b9222d

Andreas Dilger added a comment - 21/Jul/21 11:37 PM

There are stack traces being dumped during cleanup for the same process (i.e. it is really stuck), after all of the subtests have finished:

[19582.720606] Lustre: DEBUG MARKER: == sanity test complete, duration 19117 sec =============== 20:32:33 (1626899553)
[20640.853345] Lustre: lustre-OST0000-osc-ffff9a44ff09b000: Connection to lustre-OST0000 (at 10.9.7.173@tcp) was lost; in progress operations using this service will wait for recovery to complete
[20640.857025] LustreError: 167-0: lustre-OST0000-osc-ffff9a44ff09b000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
[20767.316538] INFO: task rm:1023489 blocked for more than 120 seconds.
[20767.317692]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_3.x86_64 #1
[20767.319022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[20767.320261] rm              D    0 1023489  19363 0x00004080
[20767.321183] Call Trace:
[20767.321710]  __schedule+0x2c4/0x700
[20767.323543]  schedule+0x38/0xa0
[20767.324085]  bit_wait+0xd/0x50
[20767.324620]  __wait_on_bit+0x6c/0x80
[20767.326186]  out_of_line_wait_on_bit+0x91/0xb0
[20767.327678]  ilookup5.part.26+0x69/0x90
[20767.329948]  iget5_locked+0x26/0x80
[20767.330557]  ll_iget+0x4d/0x320 [lustre]
[20767.331230]  ll_prep_inode+0x374/0x990 [lustre]
[20767.332004]  ll_lookup_it_finish.constprop.28+0x21d/0x1100 [lustre]
[20767.335562]  ll_lookup_it+0x586/0x18a0 [lustre]
[20767.336889]  ll_lookup_nd+0xfc/0x1a0 [lustre]
[20767.337629]  __lookup_slow+0x97/0x150
[20767.338252]  lookup_slow+0x35/0x50
[20767.338839]  walk_component+0x1bf/0x2f0
[20767.339487]  path_lookupat.isra.48+0x75/0x200
[20767.341916]  filename_lookup.part.64+0xa0/0x170
[20767.343381]  vfs_statx+0x74/0xd0
[20767.343990]  __do_sys_newfstatat+0x31/0x70
[20767.346290]  do_syscall_64+0x5b/0x1a0

This makes me think that patch https://review.whamcloud.com/40712 "LU-14139 llite: simplify callback handling for async getattr", but there were no timeouts for any of the review tests on that patch, so possibly it is a combination of factors. The thread is blocked on ilookup5->wait_on_inode->wait_on_bit(&inode->i_state, __I_NEW which may relate to the changes to statahead in that patch.

Andreas Dilger added a comment - 21/Jul/21 11:37 PM There are stack traces being dumped during cleanup for the same process (i.e. it is really stuck), after all of the subtests have finished: [19582.720606] Lustre: DEBUG MARKER: == sanity test complete, duration 19117 sec =============== 20:32:33 (1626899553) [20640.853345] Lustre: lustre-OST0000-osc-ffff9a44ff09b000: Connection to lustre-OST0000 (at 10.9.7.173@tcp) was lost; in progress operations using this service will wait for recovery to complete [20640.857025] LustreError: 167-0: lustre-OST0000-osc-ffff9a44ff09b000: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [20767.316538] INFO: task rm:1023489 blocked for more than 120 seconds. [20767.317692] Tainted: G OE --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1 [20767.319022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [20767.320261] rm D 0 1023489 19363 0x00004080 [20767.321183] Call Trace: [20767.321710] __schedule+0x2c4/0x700 [20767.323543] schedule+0x38/0xa0 [20767.324085] bit_wait+0xd/0x50 [20767.324620] __wait_on_bit+0x6c/0x80 [20767.326186] out_of_line_wait_on_bit+0x91/0xb0 [20767.327678] ilookup5.part.26+0x69/0x90 [20767.329948] iget5_locked+0x26/0x80 [20767.330557] ll_iget+0x4d/0x320 [lustre] [20767.331230] ll_prep_inode+0x374/0x990 [lustre] [20767.332004] ll_lookup_it_finish.constprop.28+0x21d/0x1100 [lustre] [20767.335562] ll_lookup_it+0x586/0x18a0 [lustre] [20767.336889] ll_lookup_nd+0xfc/0x1a0 [lustre] [20767.337629] __lookup_slow+0x97/0x150 [20767.338252] lookup_slow+0x35/0x50 [20767.338839] walk_component+0x1bf/0x2f0 [20767.339487] path_lookupat.isra.48+0x75/0x200 [20767.341916] filename_lookup.part.64+0xa0/0x170 [20767.343381] vfs_statx+0x74/0xd0 [20767.343990] __do_sys_newfstatat+0x31/0x70 [20767.346290] do_syscall_64+0x5b/0x1a0 This makes me think that patch https://review.whamcloud.com/40712 " LU-14139 llite: simplify callback handling for async getattr ", but there were no timeouts for any of the review tests on that patch, so possibly it is a combination of factors. The thread is blocked on ilookup5->wait_on_inode->wait_on_bit(&inode->i_state, __I_NEW which may relate to the changes to statahead in that patch.

People

Assignee:: WC Triage

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 20/Jul/21 3:04 PM

Updated:: 27/Jul/21 12:23 AM

Resolved:: 25/Jul/21 6:16 AM