Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14868

sanity: all subtests pass but test suite fails with TIMEOUT

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Chris Horn <hornc@cray.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d1a16088-4017-4128-935f-b366895d3df1

      Subtests pass 832/832 but the test suite failed with TIMEOUT status. Nothing obviously wrong in the available logs.

      Attachments

        Issue Links

          Activity

            [LU-14868] sanity: all subtests pass but test suite fails with TIMEOUT

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/44371/
            Subject: LU-14868 llite: revert 'simplify callback handling for async getattr'
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e90794af4bfac3a591983f567b4acd27bf5a0990

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/44371/ Subject: LU-14868 llite: revert 'simplify callback handling for async getattr' Project: fs/lustre-release Branch: master Current Patch Set: Commit: e90794af4bfac3a591983f567b4acd27bf5a0990
            adilger Andreas Dilger added a comment - - edited

            More information in this issue:

            • searching the Maloo failures shows no hits for non-DNE test sessions (0/173 runs of review-ldiskfs, review-ldiskfs-arm, review-ldiskfs-ubuntu), so the problem definitely seems DNE related (61/168=36.8% runs of review-dne-part-1, review-dne-zfs-part-1)
            • the revert of the LU-14139 patch did not (completely?) solve the problem. I ran that patch 10 times without DNE (5x ldiskfs and 5x zfs) and did not hit the problem, then ran with DNE 13x and only hit the problem once (on a retest of an unrelated failure). That is a failure rate of about 7.7% compared to the DNE 36.8% failure rate. It might just be random chance, or it may be an aggravating factor?
            • the one failure with the revert of LU-14139 did not show the obd_get_request_slot blockage for functions or watchdog timers, though it still had an "rm" process in the stack trace. It is possible that the revert fixed one problem, and now we are left with a second problem of very slow "rm -rf" of files created by eg. test_413a/b/c, though 1h would be very slow
            adilger Andreas Dilger added a comment - - edited More information in this issue: searching the Maloo failures shows no hits for non-DNE test sessions (0/173 runs of review-ldiskfs, review-ldiskfs-arm, review-ldiskfs-ubuntu), so the problem definitely seems DNE related (61/168=36.8% runs of review-dne-part-1, review-dne-zfs-part-1) the revert of the LU-14139 patch did not (completely?) solve the problem. I ran that patch 10 times without DNE (5x ldiskfs and 5x zfs) and did not hit the problem, then ran with DNE 13x and only hit the problem once (on a retest of an unrelated failure). That is a failure rate of about 7.7% compared to the DNE 36.8% failure rate. It might just be random chance, or it may be an aggravating factor? the one failure with the revert of LU-14139 did not show the obd_get_request_slot blockage for functions or watchdog timers, though it still had an " rm " process in the stack trace. It is possible that the revert fixed one problem, and now we are left with a second problem of very slow " rm -rf " of files created by eg. test_413a/b/c, though 1h would be very slow

            This latter form seems to be more common in the different timeouts I looked at, along with more clear indication that statahead/AGL are problematic:

            [21523.815920] LustreError: 1084347:0:(import.c:377:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000b5d7744a x1704730918725248/t0(0) o101->lustre-MDT0001-mdc-ffff900242516000@10.9.5.67@tcp:12/10 lens 592/744 e 0 to 0 dl 1625758991 ref 1 fl Interpret:REQU/0/0 rc -5/0 job:'ll_sa_1083589.0'
            [21523.820258] LustreError: 1084347:0:(import.c:388:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
            [21544.292771] LustreError: 1084347:0:(import.c:354:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (6 != 0)
            :
            [21790.041768] LustreError: 1084387:0:(import.c:382:ptlrpc_invalidate_import()) @@@ still on delayed list  req@000000002b0a711a x1704730918725632/t0(0) o101->lustre-OST0005-osc-ffff900242516000@10.9.5.65@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:EWQU/0/ffffffff rc -5/-1 job:'ll_agl_1083589.0'
            
            adilger Andreas Dilger added a comment - This latter form seems to be more common in the different timeouts I looked at, along with more clear indication that statahead/AGL are problematic: [21523.815920] LustreError: 1084347:0:(import.c:377:ptlrpc_invalidate_import()) @@@ still on sending list req@00000000b5d7744a x1704730918725248/t0(0) o101->lustre-MDT0001-mdc-ffff900242516000@10.9.5.67@tcp:12/10 lens 592/744 e 0 to 0 dl 1625758991 ref 1 fl Interpret:REQU/0/0 rc -5/0 job:'ll_sa_1083589.0' [21523.820258] LustreError: 1084347:0:(import.c:388:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out. [21544.292771] LustreError: 1084347:0:(import.c:354:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (6 != 0) : [21790.041768] LustreError: 1084387:0:(import.c:382:ptlrpc_invalidate_import()) @@@ still on delayed list req@000000002b0a711a x1704730918725632/t0(0) o101->lustre-OST0005-osc-ffff900242516000@10.9.5.65@tcp:28/4 lens 328/400 e 0 to 0 dl 0 ref 1 fl Rpc:EWQU/0/ffffffff rc -5/-1 job:'ll_agl_1083589.0'

            Another timed out test session showed a different stack for rm stuck in obd_get_request_slot(), as were a number of other threads, but was similarly stuck after the tests finished;

            [24074.551230] rm              S    0 1087612  21959 0x00000080
            [24074.552328] Call Trace:
            [24074.552845]  __schedule+0x2c4/0x700
            [24074.554393]  schedule+0x38/0xa0
            [24074.555077]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
            [24074.556886]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
            [24074.561142]  mdc_enqueue_base+0x335/0x1340 [mdc]
            [24074.562069]  mdc_intent_lock+0x210/0x530 [mdc]
            [24074.567209]  lmv_intent_open+0x296/0xa60 [lmv]
            [24074.570167]  lmv_intent_lock+0x19c/0x390 [lmv]
            [24074.572094]  ll_intent_file_open+0x37c/0x8c0 [lustre]
            [24074.573969]  ll_file_open+0x26a/0xc40 [lustre]
            [24074.575771]  do_dentry_open+0x132/0x340
            [24074.576567]  path_openat+0x53e/0x14f0
            [24074.579404]  do_filp_open+0x93/0x100
            [24074.581006]  do_sys_open+0x184/0x220
            [24074.412684] ptlrpcd_01_00   S    0 1076811      2 0x80004080
            [24074.413774] Call Trace:
            [24074.414315]  __schedule+0x2c4/0x700
            [24074.415837]  schedule+0x38/0xa0
            [24074.416715]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
            [24074.418541]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
            [24074.423070]  mdc_enqueue_base+0x335/0x1340 [mdc]
            [24074.424995]  mdc_intent_lock+0x210/0x530 [mdc]
            [24074.429177]  lmv_revalidate_slaves+0x607/0x9e0 [lmv]
            [24074.431189]  lmv_merge_attr+0x32/0x170 [lmv]
            [24074.432059]  ll_update_lsm_md+0xad7/0xe20 [lustre]
            [24074.433025]  ll_update_inode+0x41f/0x650 [lustre]
            [24074.433971]  ll_read_inode2+0x4e/0x3a0 [lustre]
            [24074.434876]  ll_iget+0xbd/0x320 [lustre]
            [24074.435696]  ll_prep_inode+0x374/0x990 [lustre]
            [24074.438680]  ll_statahead_interpret+0x51b/0xb90 [lustre]
            [24074.439727]  mdc_intent_getattr_async_interpret+0x13b/0x3d0 [mdc]
            [24074.440921]  ptlrpc_check_set+0x579/0x2070 [ptlrpc]
            [24074.442769]  ptlrpcd_check+0x3f3/0x5e0 [ptlrpc]
            [24074.443698]  ptlrpcd+0x3cb/0x4c0 [ptlrpc]
            [24074.610008] ll_sa_1087612   S    0 1088357      2 0x80004080
            [24074.611119] Call Trace:
            [24074.611645]  __schedule+0x2c4/0x700
            [24074.613188]  schedule+0x38/0xa0
            [24074.613842]  obd_get_request_slot+0x19e/0x2b0 [obdclass]
            [24074.615639]  ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc]
            [24074.619882]  mdc_intent_getattr_async+0x25f/0x320 [mdc]
            [24074.620907]  lmv_intent_getattr_async+0x25c/0x290 [lmv]
            [24074.621944]  sa_statahead+0x8c9/0xe40 [lustre]
            [24074.623585]  ll_statahead_thread+0x913/0xaf0 [lustre]
            
            adilger Andreas Dilger added a comment - Another timed out test session showed a different stack for rm stuck in obd_get_request_slot() , as were a number of other threads, but was similarly stuck after the tests finished; [24074.551230] rm S 0 1087612 21959 0x00000080 [24074.552328] Call Trace: [24074.552845] __schedule+0x2c4/0x700 [24074.554393] schedule+0x38/0xa0 [24074.555077] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.556886] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.561142] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.562069] mdc_intent_lock+0x210/0x530 [mdc] [24074.567209] lmv_intent_open+0x296/0xa60 [lmv] [24074.570167] lmv_intent_lock+0x19c/0x390 [lmv] [24074.572094] ll_intent_file_open+0x37c/0x8c0 [lustre] [24074.573969] ll_file_open+0x26a/0xc40 [lustre] [24074.575771] do_dentry_open+0x132/0x340 [24074.576567] path_openat+0x53e/0x14f0 [24074.579404] do_filp_open+0x93/0x100 [24074.581006] do_sys_open+0x184/0x220 [24074.412684] ptlrpcd_01_00 S 0 1076811 2 0x80004080 [24074.413774] Call Trace: [24074.414315] __schedule+0x2c4/0x700 [24074.415837] schedule+0x38/0xa0 [24074.416715] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.418541] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.423070] mdc_enqueue_base+0x335/0x1340 [mdc] [24074.424995] mdc_intent_lock+0x210/0x530 [mdc] [24074.429177] lmv_revalidate_slaves+0x607/0x9e0 [lmv] [24074.431189] lmv_merge_attr+0x32/0x170 [lmv] [24074.432059] ll_update_lsm_md+0xad7/0xe20 [lustre] [24074.433025] ll_update_inode+0x41f/0x650 [lustre] [24074.433971] ll_read_inode2+0x4e/0x3a0 [lustre] [24074.434876] ll_iget+0xbd/0x320 [lustre] [24074.435696] ll_prep_inode+0x374/0x990 [lustre] [24074.438680] ll_statahead_interpret+0x51b/0xb90 [lustre] [24074.439727] mdc_intent_getattr_async_interpret+0x13b/0x3d0 [mdc] [24074.440921] ptlrpc_check_set+0x579/0x2070 [ptlrpc] [24074.442769] ptlrpcd_check+0x3f3/0x5e0 [ptlrpc] [24074.443698] ptlrpcd+0x3cb/0x4c0 [ptlrpc] [24074.610008] ll_sa_1087612 S 0 1088357 2 0x80004080 [24074.611119] Call Trace: [24074.611645] __schedule+0x2c4/0x700 [24074.613188] schedule+0x38/0xa0 [24074.613842] obd_get_request_slot+0x19e/0x2b0 [obdclass] [24074.615639] ldlm_cli_enqueue+0x5f2/0x9c0 [ptlrpc] [24074.619882] mdc_intent_getattr_async+0x25f/0x320 [mdc] [24074.620907] lmv_intent_getattr_async+0x25c/0x290 [lmv] [24074.621944] sa_statahead+0x8c9/0xe40 [lustre] [24074.623585] ll_statahead_thread+0x913/0xaf0 [lustre]

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44371
            Subject: LU-14868 llite: revert 'simplify callback handling for async getattr'
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 512f2c446b1388dfbad79d848b2ccf99f0b9222d

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44371 Subject: LU-14868 llite: revert 'simplify callback handling for async getattr' Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 512f2c446b1388dfbad79d848b2ccf99f0b9222d

            There are stack traces being dumped during cleanup for the same process (i.e. it is really stuck), after all of the subtests have finished:

            [19582.720606] Lustre: DEBUG MARKER: == sanity test complete, duration 19117 sec =============== 20:32:33 (1626899553)
            [20640.853345] Lustre: lustre-OST0000-osc-ffff9a44ff09b000: Connection to lustre-OST0000 (at 10.9.7.173@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [20640.857025] LustreError: 167-0: lustre-OST0000-osc-ffff9a44ff09b000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
            [20767.316538] INFO: task rm:1023489 blocked for more than 120 seconds.
            [20767.317692]       Tainted: G           OE    --------- -  - 4.18.0-240.22.1.el8_3.x86_64 #1
            [20767.319022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [20767.320261] rm              D    0 1023489  19363 0x00004080
            [20767.321183] Call Trace:
            [20767.321710]  __schedule+0x2c4/0x700
            [20767.323543]  schedule+0x38/0xa0
            [20767.324085]  bit_wait+0xd/0x50
            [20767.324620]  __wait_on_bit+0x6c/0x80
            [20767.326186]  out_of_line_wait_on_bit+0x91/0xb0
            [20767.327678]  ilookup5.part.26+0x69/0x90
            [20767.329948]  iget5_locked+0x26/0x80
            [20767.330557]  ll_iget+0x4d/0x320 [lustre]
            [20767.331230]  ll_prep_inode+0x374/0x990 [lustre]
            [20767.332004]  ll_lookup_it_finish.constprop.28+0x21d/0x1100 [lustre]
            [20767.335562]  ll_lookup_it+0x586/0x18a0 [lustre]
            [20767.336889]  ll_lookup_nd+0xfc/0x1a0 [lustre]
            [20767.337629]  __lookup_slow+0x97/0x150
            [20767.338252]  lookup_slow+0x35/0x50
            [20767.338839]  walk_component+0x1bf/0x2f0
            [20767.339487]  path_lookupat.isra.48+0x75/0x200
            [20767.341916]  filename_lookup.part.64+0xa0/0x170
            [20767.343381]  vfs_statx+0x74/0xd0
            [20767.343990]  __do_sys_newfstatat+0x31/0x70
            [20767.346290]  do_syscall_64+0x5b/0x1a0
            

            This makes me think that patch https://review.whamcloud.com/40712 "LU-14139 llite: simplify callback handling for async getattr", but there were no timeouts for any of the review tests on that patch, so possibly it is a combination of factors. The thread is blocked on ilookup5->wait_on_inode->wait_on_bit(&inode->i_state, __I_NEW which may relate to the changes to statahead in that patch.

            adilger Andreas Dilger added a comment - There are stack traces being dumped during cleanup for the same process (i.e. it is really stuck), after all of the subtests have finished: [19582.720606] Lustre: DEBUG MARKER: == sanity test complete, duration 19117 sec =============== 20:32:33 (1626899553) [20640.853345] Lustre: lustre-OST0000-osc-ffff9a44ff09b000: Connection to lustre-OST0000 (at 10.9.7.173@tcp) was lost; in progress operations using this service will wait for recovery to complete [20640.857025] LustreError: 167-0: lustre-OST0000-osc-ffff9a44ff09b000: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [20767.316538] INFO: task rm:1023489 blocked for more than 120 seconds. [20767.317692] Tainted: G OE --------- - - 4.18.0-240.22.1.el8_3.x86_64 #1 [20767.319022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [20767.320261] rm D 0 1023489 19363 0x00004080 [20767.321183] Call Trace: [20767.321710] __schedule+0x2c4/0x700 [20767.323543] schedule+0x38/0xa0 [20767.324085] bit_wait+0xd/0x50 [20767.324620] __wait_on_bit+0x6c/0x80 [20767.326186] out_of_line_wait_on_bit+0x91/0xb0 [20767.327678] ilookup5.part.26+0x69/0x90 [20767.329948] iget5_locked+0x26/0x80 [20767.330557] ll_iget+0x4d/0x320 [lustre] [20767.331230] ll_prep_inode+0x374/0x990 [lustre] [20767.332004] ll_lookup_it_finish.constprop.28+0x21d/0x1100 [lustre] [20767.335562] ll_lookup_it+0x586/0x18a0 [lustre] [20767.336889] ll_lookup_nd+0xfc/0x1a0 [lustre] [20767.337629] __lookup_slow+0x97/0x150 [20767.338252] lookup_slow+0x35/0x50 [20767.338839] walk_component+0x1bf/0x2f0 [20767.339487] path_lookupat.isra.48+0x75/0x200 [20767.341916] filename_lookup.part.64+0xa0/0x170 [20767.343381] vfs_statx+0x74/0xd0 [20767.343990] __do_sys_newfstatat+0x31/0x70 [20767.346290] do_syscall_64+0x5b/0x1a0 This makes me think that patch https://review.whamcloud.com/40712 " LU-14139 llite: simplify callback handling for async getattr ", but there were no timeouts for any of the review tests on that patch, so possibly it is a combination of factors. The thread is blocked on ilookup5->wait_on_inode->wait_on_bit(&inode->i_state, __I_NEW which may relate to the changes to statahead in that patch.

            According to my Maloo search, it looks like this started having timeouts with '832/832' (i.e. all) subtests passed on 2021-07-08. There were other sanity timeouts before then, but they always happened while running a subtest (e.g. 'n-1/n' subtests run) and only 3-5 per day. On 2021-07-09 there were 29 timeouts with '832/832' subtests run and similar on later days (excluding a rash of patch-specific timeouts on 2021-07-12 due to LU-13799). From 2021-07-08 to 2021-07-21 there have been 189/1042=20% sanity timeouts (of all kinds), compared to 15/462=3% for the previous week.

            Patches that landed to lustre/ on 2021-07-08 (not including lnet/, at 2:00 am, so available for nearly a full day of test runs that day) are:

            f21c507fbc LU-14779 utils: no DNS lookups for NID in get_param ***
            4fc127428f LU-14778 readahead: fix to reserve min pages
            791f656a03 LU-14776 ldiskfs: Add Ubuntu 20.04 HWE support
            a81c093a93 LU-14767 utils: mkfs.lustre allow lazy_itable_init=1
            b31792b0e7 LU-11872 quota: add get/set project support for non-dir/file
            d775f9ae37 LU-11698 libcfs: Add checksum speed under /sys/fs ***
            25aa852737 LU-14648 lod: protect lod_object layout info
            002c2a8026 LU-14762 lmv: compare space to mkdir on parent MDT
            8d1895f2f6 LU-14654 tests: Ensure recovery_limit zero works as expected
            a15eb4f132 LU-13055 mdd: per-user changelog names and mask ***
            3070ca9b18 LU-4684 tests: enable racer directory migration
            cbaaa7cde4 LU-14139 llite: simplify callback handling for async getattr
            5c98de8566 LU-6142 obdclass: resolve lu_ref checkpatch issues
            f5967b06aa LU-14734 osd-ldiskfs: enable large_dir automatically
            

            with the patches modifying sanity.sh marked with ***, but none of those patches look like they are creating more than a handful of files.

            adilger Andreas Dilger added a comment - According to my Maloo search, it looks like this started having timeouts with '832/832' (i.e. all) subtests passed on 2021-07-08. There were other sanity timeouts before then, but they always happened while running a subtest (e.g. 'n-1/n' subtests run) and only 3-5 per day. On 2021-07-09 there were 29 timeouts with '832/832' subtests run and similar on later days (excluding a rash of patch-specific timeouts on 2021-07-12 due to LU-13799 ). From 2021-07-08 to 2021-07-21 there have been 189/1042=20% sanity timeouts (of all kinds), compared to 15/462=3% for the previous week. Patches that landed to lustre/ on 2021-07-08 (not including lnet/ , at 2:00 am, so available for nearly a full day of test runs that day) are: f21c507fbc LU-14779 utils: no DNS lookups for NID in get_param *** 4fc127428f LU-14778 readahead: fix to reserve min pages 791f656a03 LU-14776 ldiskfs: Add Ubuntu 20.04 HWE support a81c093a93 LU-14767 utils: mkfs.lustre allow lazy_itable_init=1 b31792b0e7 LU-11872 quota: add get/set project support for non-dir/file d775f9ae37 LU-11698 libcfs: Add checksum speed under /sys/fs *** 25aa852737 LU-14648 lod: protect lod_object layout info 002c2a8026 LU-14762 lmv: compare space to mkdir on parent MDT 8d1895f2f6 LU-14654 tests: Ensure recovery_limit zero works as expected a15eb4f132 LU-13055 mdd: per-user changelog names and mask *** 3070ca9b18 LU-4684 tests: enable racer directory migration cbaaa7cde4 LU-14139 llite: simplify callback handling for async getattr 5c98de8566 LU-6142 obdclass: resolve lu_ref checkpatch issues f5967b06aa LU-14734 osd-ldiskfs: enable large_dir automatically with the patches modifying sanity.sh marked with *** , but none of those patches look like they are creating more than a handful of files.

            It appears that this is caused by a slow cleanup of test files from a subtest that left a lot of files behind. Investigating which test is causing this.

            adilger Andreas Dilger added a comment - It appears that this is caused by a slow cleanup of test files from a subtest that left a lot of files behind. Investigating which test is causing this.
            adilger Andreas Dilger added a comment - +1 on master: https://testing.whamcloud.com/test_sessions/a6bffc7d-e7ad-4289-980a-1ffc677d0189
            ofaaland Olaf Faaland added a comment - - edited +1 on master  https://testing.whamcloud.com/test_sets/fc893a44-4e28-42a1-9ba8-a9a52345bbd8
            ofaaland Olaf Faaland added a comment - - edited +1 on master   https://testing.whamcloud.com/test_sets/4878a6f1-c343-41bf-88b3-9baa2b1e25fd

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: