Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14868

sanity: all subtests pass but test suite fails with TIMEOUT

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Chris Horn <hornc@cray.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d1a16088-4017-4128-935f-b366895d3df1

      Subtests pass 832/832 but the test suite failed with TIMEOUT status. Nothing obviously wrong in the available logs.

      Attachments

        Issue Links

          Activity

            [LU-14868] sanity: all subtests pass but test suite fails with TIMEOUT

            James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/44402
            Subject: LU-14868 test: lets see if patch 44041 resolves regression
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 73a70a11cf5494f087428e506aa546e974232775

            gerrit Gerrit Updater added a comment - James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/44402 Subject: LU-14868 test: lets see if patch 44041 resolves regression Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 73a70a11cf5494f087428e506aa546e974232775

            > my bisecting points to LU-14868 as well

            LU-14868 is this ticket. Did you mean to write LU-14139?

            adilger Andreas Dilger added a comment - > my bisecting points to LU-14868 as well LU-14868 is this ticket. Did you mean to write LU-14139 ?

            my bisecting points to LU-14868 as well

            bzzz Alex Zhuravlev added a comment - my bisecting points to LU-14868 as well

            Patch landed/reverted on master.

            adilger Andreas Dilger added a comment - Patch landed/reverted on master.

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/44371/
            Subject: LU-14868 llite: revert 'simplify callback handling for async getattr'
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e90794af4bfac3a591983f567b4acd27bf5a0990

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/44371/ Subject: LU-14868 llite: revert 'simplify callback handling for async getattr' Project: fs/lustre-release Branch: master Current Patch Set: Commit: e90794af4bfac3a591983f567b4acd27bf5a0990
            adilger Andreas Dilger added a comment - - edited

            More information in this issue:

            • searching the Maloo failures shows no hits for non-DNE test sessions (0/173 runs of review-ldiskfs, review-ldiskfs-arm, review-ldiskfs-ubuntu), so the problem definitely seems DNE related (61/168=36.8% runs of review-dne-part-1, review-dne-zfs-part-1)
            • the revert of the LU-14139 patch did not (completely?) solve the problem. I ran that patch 10 times without DNE (5x ldiskfs and 5x zfs) and did not hit the problem, then ran with DNE 13x and only hit the problem once (on a retest of an unrelated failure). That is a failure rate of about 7.7% compared to the DNE 36.8% failure rate. It might just be random chance, or it may be an aggravating factor?
            • the one failure with the revert of LU-14139 did not show the obd_get_request_slot blockage for functions or watchdog timers, though it still had an "rm" process in the stack trace. It is possible that the revert fixed one problem, and now we are left with a second problem of very slow "rm -rf" of files created by eg. test_413a/b/c, though 1h would be very slow
            adilger Andreas Dilger added a comment - - edited More information in this issue: searching the Maloo failures shows no hits for non-DNE test sessions (0/173 runs of review-ldiskfs, review-ldiskfs-arm, review-ldiskfs-ubuntu), so the problem definitely seems DNE related (61/168=36.8% runs of review-dne-part-1, review-dne-zfs-part-1) the revert of the LU-14139 patch did not (completely?) solve the problem. I ran that patch 10 times without DNE (5x ldiskfs and 5x zfs) and did not hit the problem, then ran with DNE 13x and only hit the problem once (on a retest of an unrelated failure). That is a failure rate of about 7.7% compared to the DNE 36.8% failure rate. It might just be random chance, or it may be an aggravating factor? the one failure with the revert of LU-14139 did not show the obd_get_request_slot blockage for functions or watchdog timers, though it still had an " rm " process in the stack trace. It is possible that the revert fixed one problem, and now we are left with a second problem of very slow " rm -rf " of files created by eg. test_413a/b/c, though 1h would be very slow

            This latter form seems to be more common in the different timeouts I looked at, along with more clear indication that statahead/AGL are problematic:

            [21523.815920] LustreError: 1084347:0:(import.c:377:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000b5d7744a x1704730918725248/t0(0) o101->lustre-MDT0001-mdc-ffff900242516000@10.9.5.67@tcp:12/10 lens 592/744 e 0 to 0 dl 1625758991 ref 1 fl Interpret:REQU/0/0 rc -5/0 job:'ll_sa_1083589.0'
            [21523.820258] LustreError: 1084347:0:(import.c:388:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
            [21544.292771] LustreError: 1084347:0:(import.c:354:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (6 != 0)
            :
            [21790.041768] LustreError: 1084387:0:(import.c:382:ptlrpc_invalidate_import()) @@@ still on delayed list  req@000000002b0a711a x1704730918725632/t0(0) o101->lustre-OST0005-osc-ffff900242516000@10.9.5.65@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:EWQU/0/ffffffff rc -5/-1 job:'ll_agl_1083589.0'
            
            adilger Andreas Dilger added a comment - This latter form seems to be more common in the different timeouts I looked at, along with more clear indication that statahead/AGL are problematic: [21523.815920] LustreError: 1084347:0:(import.c:377:ptlrpc_invalidate_import()) @@@ still on sending list req@00000000b5d7744a x1704730918725248/t0(0) o101->lustre-MDT0001-mdc-ffff900242516000@10.9.5.67@tcp:12/10 lens 592/744 e 0 to 0 dl 1625758991 ref 1 fl Interpret:REQU/0/0 rc -5/0 job:'ll_sa_1083589.0' [21523.820258] LustreError: 1084347:0:(import.c:388:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out. [21544.292771] LustreError: 1084347:0:(import.c:354:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (6 != 0) : [21790.041768] LustreError: 1084387:0:(import.c:382:ptlrpc_invalidate_import()) @@@ still on delayed list req@000000002b0a711a x1704730918725632/t0(0) o101->lustre-OST0005-osc-ffff900242516000@10.9.5.65@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:EWQU/0/ffffffff rc -5/-1 job:'ll_agl_1083589.0'

            Another timed out test session showed a different stack for rm stuck in obd_get_request_slot(), as were a number of other threads, but was similarly stuck after the tests finished;

            [24074.551230] rm              S    0 1087612  21959 0x00000080
            [24074.552328] Call Trace:
            [24074.552845]  __schedule+0x2c4/0x700
            [24074.554393]  schedule+0x38/0xa0
            [24074.555077]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
            [24074.556886]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
            [24074.561142]  mdc_enqueue_base+0x335/0x1340 [mdc]
            [24074.562069]  mdc_intent_lock+0x210/0x530 [mdc]
            [24074.567209]  lmv_intent_open+0x296/0xa60 [lmv]
            [24074.570167]  lmv_intent_lock+0x19c/0x390 [lmv]
            [24074.572094]  ll_intent_file_open+0x37c/0x8c0 [lustre]
            [24074.573969]  ll_file_open+0x26a/0xc40 [lustre]
            [24074.575771]  do_dentry_open+0x132/0x340
            [24074.576567]  path_openat+0x53e/0x14f0
            [24074.579404]  do_filp_open+0x93/0x100
            [24074.581006]  do_sys_open+0x184/0x220
            [24074.412684] ptlrpcd_01_00   S    0 1076811      2 0x80004080
            [24074.413774] Call Trace:
            [24074.414315]  __schedule+0x2c4/0x700
            [24074.415837]  schedule+0x38/0xa0
            [24074.416715]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
            [24074.418541]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
            [24074.423070]  mdc_enqueue_base+0x335/0x1340 [mdc]
            [24074.424995]  mdc_intent_lock+0x210/0x530 [mdc]
            [24074.429177]  lmv_revalidate_slaves+0x607/0x9e0 [lmv]
            [24074.431189]  lmv_merge_attr+0x32/0x170 [lmv]
            [24074.432059]  ll_update_lsm_md+0xad7/0xe20 [lustre]
            [24074.433025]  ll_update_inode+0x41f/0x650 [lustre]
            [24074.433971]  ll_read_inode2+0x4e/0x3a0 [lustre]
            [24074.434876]  ll_iget+0xbd/0x320 [lustre]
            [24074.435696]  ll_prep_inode+0x374/0x990 [lustre]
            [24074.438680]  ll_statahead_interpret+0x51b/0xb90 [lustre]
            [24074.439727]  mdc_intent_getattr_async_interpret+0x13b/0x3d0 [mdc]
            [24074.440921]  ptlrpc_check_set+0x579/0x2070 [ptlrpc]
            [24074.442769]  ptlrpcd_check+0x3f3/0x5e0 [ptlrpc]
            [24074.443698]  ptlrpcd+0x3cb/0x4c0 [ptlrpc]
            [24074.610008] ll_sa_1087612   S    0 1088357      2 0x80004080
            [24074.611119] Call Trace:
            [24074.611645]  __schedule+0x2c4/0x700
            [24074.613188]  schedule+0x38/0xa0
            [24074.613842]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
            [24074.615639]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
            [24074.619882]  mdc_intent_getattr_async+0x25f/0x320 [mdc]
            [24074.620907]  lmv_intent_getattr_async+0x25c/0x290 [lmv]
            [24074.621944]  sa_statahead+0x8c9/0xe40 [lustre]
            [24074.623585]  ll_statahead_thread+0x913/0xaf0 [lustre]
            
            adilger Andreas Dilger added a comment - Another timed out test session showed a different stack for rm stuck in obd_get_request_slot() , as were a number of other threads, but was similarly stuck after the tests finished; [24074.551230] rm S 0 1087612 21959 0x00000080 [24074.552328] Call Trace: [24074.552845] __schedule+0x2c4/0x700 [24074.554393] schedule+0x38/0xa0 [24074.555077] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.556886] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.561142] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.562069] mdc_intent_lock+0x210/0x530 [mdc] [24074.567209] lmv_intent_open+0x296/0xa60 [lmv] [24074.570167] lmv_intent_lock+0x19c/0x390 [lmv] [24074.572094] ll_intent_file_open+0x37c/0x8c0 [lustre] [24074.573969] ll_file_open+0x26a/0xc40 [lustre] [24074.575771] do_dentry_open+0x132/0x340 [24074.576567] path_openat+0x53e/0x14f0 [24074.579404] do_filp_open+0x93/0x100 [24074.581006] do_sys_open+0x184/0x220 [24074.412684] ptlrpcd_01_00 S 0 1076811 2 0x80004080 [24074.413774] Call Trace: [24074.414315] __schedule+0x2c4/0x700 [24074.415837] schedule+0x38/0xa0 [24074.416715] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.418541] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.423070] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.424995] mdc_intent_lock+0x210/0x530 [mdc] [24074.429177] lmv_revalidate_slaves+0x607/0x9e0 [lmv] [24074.431189] lmv_merge_attr+0x32/0x170 [lmv] [24074.432059] ll_update_lsm_md+0xad7/0xe20 [lustre] [24074.433025] ll_update_inode+0x41f/0x650 [lustre] [24074.433971] ll_read_inode2+0x4e/0x3a0 [lustre] [24074.434876] ll_iget+0xbd/0x320 [lustre] [24074.435696] ll_prep_inode+0x374/0x990 [lustre] [24074.438680] ll_statahead_interpret+0x51b/0xb90 [lustre] [24074.439727] mdc_intent_getattr_async_interpret+0x13b/0x3d0 [mdc] [24074.440921] ptlrpc_check_set+0x579/0x2070 [ptlrpc] [24074.442769] ptlrpcd_check+0x3f3/0x5e0 [ptlrpc] [24074.443698] ptlrpcd+0x3cb/0x4c0 [ptlrpc] [24074.610008] ll_sa_1087612 S 0 1088357 2 0x80004080 [24074.611119] Call Trace: [24074.611645] __schedule+0x2c4/0x700 [24074.613188] schedule+0x38/0xa0 [24074.613842] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.615639] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.619882] mdc_intent_getattr_async+0x25f/0x320 [mdc] [24074.620907] lmv_intent_getattr_async+0x25c/0x290 [lmv] [24074.621944] sa_statahead+0x8c9/0xe40 [lustre] [24074.623585] ll_statahead_thread+0x913/0xaf0 [lustre]

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44371
            Subject: LU-14868 llite: revert 'simplify callback handling for async getattr'
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 512f2c446b1388dfbad79d848b2ccf99f0b9222d

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44371 Subject: LU-14868 llite: revert 'simplify callback handling for async getattr' Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 512f2c446b1388dfbad79d848b2ccf99f0b9222d

            There are stack traces being dumped during cleanup for the same process (i.e. it is really stuck), after all of the subtests have finished:

            [19582.720606] Lustre: DEBUG MARKER: == sanity test complete, duration 19117 sec =============== 20:32:33 (1626899553)
            [20640.853345] Lustre: lustre-OST0000-osc-ffff9a44ff09b000: Connection to lustre-OST0000 (at 10.9.7.173@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [20640.857025] LustreError: 167-0: lustre-OST0000-osc-ffff9a44ff09b000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
            [20767.316538] INFO: task rm:1023489 blocked for more than 120 seconds.
            [20767.317692]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_3.x86_64 #1
            [20767.319022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [20767.320261] rm              D    0 1023489  19363 0x00004080
            [20767.321183] Call Trace:
            [20767.321710]  __schedule+0x2c4/0x700
            [20767.323543]  schedule+0x38/0xa0
            [20767.324085]  bit_wait+0xd/0x50
            [20767.324620]  __wait_on_bit+0x6c/0x80
            [20767.326186]  out_of_line_wait_on_bit+0x91/0xb0
            [20767.327678]  ilookup5.part.26+0x69/0x90
            [20767.329948]  iget5_locked+0x26/0x80
            [20767.330557]  ll_iget+0x4d/0x320 [lustre]
            [20767.331230]  ll_prep_inode+0x374/0x990 [lustre]
            [20767.332004]  ll_lookup_it_finish.constprop.28+0x21d/0x1100 [lustre]
            [20767.335562]  ll_lookup_it+0x586/0x18a0 [lustre]
            [20767.336889]  ll_lookup_nd+0xfc/0x1a0 [lustre]
            [20767.337629]  __lookup_slow+0x97/0x150
            [20767.338252]  lookup_slow+0x35/0x50
            [20767.338839]  walk_component+0x1bf/0x2f0
            [20767.339487]  path_lookupat.isra.48+0x75/0x200
            [20767.341916]  filename_lookup.part.64+0xa0/0x170
            [20767.343381]  vfs_statx+0x74/0xd0
            [20767.343990]  __do_sys_newfstatat+0x31/0x70
            [20767.346290]  do_syscall_64+0x5b/0x1a0
            

            This makes me think that patch https://review.whamcloud.com/40712 "LU-14139 llite: simplify callback handling for async getattr", but there were no timeouts for any of the review tests on that patch, so possibly it is a combination of factors. The thread is blocked on ilookup5->wait_on_inode->wait_on_bit(&inode->i_state, __I_NEW which may relate to the changes to statahead in that patch.

            adilger Andreas Dilger added a comment - There are stack traces being dumped during cleanup for the same process (i.e. it is really stuck), after all of the subtests have finished: [19582.720606] Lustre: DEBUG MARKER: == sanity test complete, duration 19117 sec =============== 20:32:33 (1626899553) [20640.853345] Lustre: lustre-OST0000-osc-ffff9a44ff09b000: Connection to lustre-OST0000 (at 10.9.7.173@tcp) was lost; in progress operations using this service will wait for recovery to complete [20640.857025] LustreError: 167-0: lustre-OST0000-osc-ffff9a44ff09b000: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [20767.316538] INFO: task rm:1023489 blocked for more than 120 seconds. [20767.317692] Tainted: G OE --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1 [20767.319022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [20767.320261] rm D 0 1023489 19363 0x00004080 [20767.321183] Call Trace: [20767.321710] __schedule+0x2c4/0x700 [20767.323543] schedule+0x38/0xa0 [20767.324085] bit_wait+0xd/0x50 [20767.324620] __wait_on_bit+0x6c/0x80 [20767.326186] out_of_line_wait_on_bit+0x91/0xb0 [20767.327678] ilookup5.part.26+0x69/0x90 [20767.329948] iget5_locked+0x26/0x80 [20767.330557] ll_iget+0x4d/0x320 [lustre] [20767.331230] ll_prep_inode+0x374/0x990 [lustre] [20767.332004] ll_lookup_it_finish.constprop.28+0x21d/0x1100 [lustre] [20767.335562] ll_lookup_it+0x586/0x18a0 [lustre] [20767.336889] ll_lookup_nd+0xfc/0x1a0 [lustre] [20767.337629] __lookup_slow+0x97/0x150 [20767.338252] lookup_slow+0x35/0x50 [20767.338839] walk_component+0x1bf/0x2f0 [20767.339487] path_lookupat.isra.48+0x75/0x200 [20767.341916] filename_lookup.part.64+0xa0/0x170 [20767.343381] vfs_statx+0x74/0xd0 [20767.343990] __do_sys_newfstatat+0x31/0x70 [20767.346290] do_syscall_64+0x5b/0x1a0 This makes me think that patch https://review.whamcloud.com/40712 " LU-14139 llite: simplify callback handling for async getattr ", but there were no timeouts for any of the review tests on that patch, so possibly it is a combination of factors. The thread is blocked on ilookup5->wait_on_inode->wait_on_bit(&inode->i_state, __I_NEW which may relate to the changes to statahead in that patch.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: