[LU-15657] replay-dual test_30: operation ldlm_enqueue failed: rc = -107 Created: 17/Mar/22 Updated: 20/Jul/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/ce9f15e1-c403-44ac-8152-162384f0d6a8 test_30 failed with the following error: Timeout occurred after 117 mins, last suite running was replay-dual [Wed Feb 9 10:09:01 2022] Lustre: DEBUG MARKER: == replay-dual test 30: layout lock replay is not blocked on IO ========================================================== 10:09:01 (1644401341) [Wed Feb 9 10:09:05 2022] LustreError: 11-0: lustre-MDT0000-lwp-OST0001: operation ldlm_enqueue to node 10.240.22.123@tcp failed: rc = -107 [Wed Feb 9 10:09:05 2022] LustreError: Skipped 1 previous similar message [Wed Feb 9 10:09:25 2022] Lustre: ll_ost00_045: service thread pid 116830 was inactive for 62.178 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [Wed Feb 9 10:09:25 2022] Pid: 116786, comm: ll_ost00_011 4.18.0-348.2.1.el8_lustre.x86_64 #1 SMP Tue Feb 8 01:15:01 UTC 2022 [Wed Feb 9 10:09:25 2022] Lustre: Skipped 1 previous similar message [Wed Feb 9 10:09:25 2022] Call Trace TBD: [Wed Feb 9 10:09:25 2022] [<0>] ldlm_completion_ast+0x7ac/0x900 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ofd_destroy_by_fid+0x235/0x4a0 [ofd] [Wed Feb 9 10:09:25 2022] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd] [Wed Feb 9 10:09:25 2022] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] kthread+0x116/0x130 [Wed Feb 9 10:09:25 2022] [<0>] ret_from_fork+0x35/0x40 [Wed Feb 9 10:09:25 2022] Pid: 116830, comm: ll_ost00_045 4.18.0-348.2.1.el8_lustre.x86_64 #1 SMP Tue Feb 8 01:15:01 UTC 2022 [Wed Feb 9 10:09:25 2022] Call Trace TBD: [Wed Feb 9 10:09:25 2022] [<0>] ldlm_completion_ast+0x7ac/0x900 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ofd_destroy_by_fid+0x235/0x4a0 [ofd] [Wed Feb 9 10:09:25 2022] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd] [Wed Feb 9 10:09:25 2022] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] kthread+0x116/0x130 [Wed Feb 9 10:09:25 2022] [<0>] ret_from_fork+0x35/0x40 [Wed Feb 9 10:09:25 2022] Pid: 116816, comm: ll_ost00_031 4.18.0-348.2.1.el8_lustre.x86_64 #1 SMP Tue Feb 8 01:15:01 UTC 2022 [Wed Feb 9 10:09:25 2022] Call Trace TBD: [Wed Feb 9 10:09:25 2022] Lustre: ll_ost00_014: service thread pid 116797 was inactive for 62.212 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [Wed Feb 9 10:09:25 2022] [<0>] ldlm_completion_ast+0x7ac/0x900 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ofd_destroy_by_fid+0x235/0x4a0 [ofd] [Wed Feb 9 10:09:25 2022] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd] [Wed Feb 9 10:09:25 2022] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc] [Wed Feb 9 10:09:25 2022] [<0>] kthread+0x116/0x130 [Wed Feb 9 10:09:25 2022] [<0>] ret_from_fork+0x35/0x40 [Wed Feb 9 10:09:26 2022] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null [Wed Feb 9 10:09:27 2022] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-26vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4 [Wed Feb 9 10:09:27 2022] Lustre: DEBUG MARKER: onyx-26vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4 [Wed Feb 9 10:11:49 2022] Lustre: ll_ost00_003: service thread pid 62623 was inactive for 144.982 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [Wed Feb 9 10:11:49 2022] Lustre: Skipped 54 previous similar messages [Wed Feb 9 10:11:53 2022] Lustre: ll_ost00_052: service thread pid 118905 was inactive for 143.741 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [Wed Feb 9 10:11:53 2022] Lustre: Skipped 7 previous similar messages [Wed Feb 9 10:11:57 2022] Lustre: ll_ost00_072: service thread pid 131846 was inactive for 143.588 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [Wed Feb 9 10:11:57 2022] Lustre: Skipped 3 previous similar messages [Wed Feb 9 10:12:01 2022] Lustre: ll_ost00_074: service thread pid 131848 was inactive for 144.645 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [Wed Feb 9 10:12:01 2022] Lustre: Skipped 2 previous similar messages [Wed Feb 9 10:13:31 2022] LustreError: 116830:0:(ldlm_request.c:123:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1644401304, 308s ago); not entering recovery in server code, just going back to sleep ns: filter-lustre-OST0006_UUID lock: 0000000096b27b45/0xca050e0025d0b7f7 lrc: 3/0,1 mode: --/PW res: [0xdee:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 116830 timeout: 0 lvb_type: 0 [Wed Feb 9 10:13:31 2022] LustreError: 116830:0:(ldlm_request.c:123:ldlm_expired_completion_wait()) Skipped 47 previous similar messages [Wed Feb 9 10:13:31 2022] LustreError: dumping log to /tmp/lustre-log.1644401612.116786 [Wed Feb 9 10:14:36 2022] LustreError: 131848:0:(ldlm_request.c:123:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1644401377, 300s ago); not entering recovery in server code, just going back to sleep ns: filter-lustre-OST0006_UUID lock: 000000008894b201/0xca050e0025d0be8e lrc: 3/0,1 mode: --/PW res: [0xdf5:0x0:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 131848 timeout: 0 lvb_type: 0 [Wed Feb 9 10:14:36 2022] LustreError: 131848:0:(ldlm_request.c:123:ldlm_expired_completion_wait()) Skipped 25 previous similar messages [Wed Feb 9 10:15:09 2022] LustreError: 131858:0:(ldlm_request.c:123:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1644401380, 330s ago); not entering recovery in server code, just going back to sleep ns: filter-lustre-OST0006_UUID lock: 000000002e1d934b/0xca050e0025d0be9c lrc: 3/0,1 mode: --/PW res: [0xdf2:0x0:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 131858 timeout: 0 lvb_type: 0 [Wed Feb 9 10:15:09 2022] LustreError: 131858:0:(ldlm_request.c:123:ldlm_expired_completion_wait()) Skipped 3 previous similar messages [Wed Feb 9 10:19:18 2022] Lustre: 131861:0:(service.c:1436:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@00000000a56db3f7 x1724276360899264/t0(0) o6->lustre-MDT0000-mdtlov_UUID@10.240.22.123@tcp:639/0 lens 544/432 e 20 to 0 dl 1644401964 ref 2 fl Interpret:/0/0 rc 0/0 job:'osp-syn-2-0.0' [Wed Feb 9 10:19:23 2022] Lustre: 131861:0:(service.c:1436:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply req@00000000ee42b4b0 x1724276360903360/t0(0) o6->lustre-MDT0000-mdtlov_UUID@10.240.22.123@tcp:644/0 lens 544/0 e 20 to 0 dl 1644401969 ref 2 fl New:/0/ffffffff rc 0/-1 job:'osp-syn-4-0.0' [Wed Feb 9 10:19:23 2022] Lustre: 131861:0:(service.c:1436:ptlrpc_at_send_early_reply()) Skipped 15 previous similar messages [Wed Feb 9 10:19:24 2022] Lustre: lustre-OST0001: Client lustre-MDT0000-mdtlov_UUID (at 10.240.22.123@tcp) reconnecting [Wed Feb 9 10:19:29 2022] Lustre: lustre-OST0003: Client lustre-MDT0000-mdtlov_UUID (at 10.240.22.123@tcp) reconnecting VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 19/Mar/22 ] |
|
If this is the only regular replay-dual failure, it should be added to ALWAYS_EXCEPT and this test script added to an enforced session. |
| Comment by James Nunez (Inactive) [ 20/Apr/22 ] |
|
We're seeing the same issue or something very similar for CentOS 8.5 and ZFS at https://testing.whamcloud.com/test_sets/dfd0a347-d14e-4393-84a7-997380b92ba7. |
| Comment by Andreas Dilger [ 20/Apr/22 ] |
|
It looks like this problem was first hit on 2021-12-23 (has failed about 1/32 test runs since then), and there were a bunch of patches landed on that day: # git log --oneline master --after 2021-12-22 --before 2021-12-24 3c0b60fbb726 LU-15358 tests: Variable incorrectly defined under sanityn 9348b9a1e061 LU-15360 tests: Use saved value on EXIT/Restore 9b0b7264a8d7 LU-15220 utils: fix gcc-11 -Werror=format-truncation= error 63b5812e4e3b LU-15356 tests: get rid of extra spaces in PERM_CMD 09ca33738a39 LU-14776 ldiskfs: support Ubuntu 5.8.0-63 25606a2ce19e LU-15342 tests: escape "|" dc015fc0b51b LU-15339 tests: Increase timeout in sanity 208 1ee894a4355e LU-15338 tests: check whole jobid in sanity 205a 0381fa5356e2 LU-15331 kernel: kernel update SLES15 SP2 [5.3.18-24.96.1] 878561880d2a LU-15244 llite: set ra_pages of backing_dev_info with 0 a6b39ad3486b LU-15293 build: Add build support for arm64 centos8 2968a40a163a LU-12678 o2iblnd: convert ibp_refcount to a kref 852a4b264a98 LU-15234 lnet: Race on discovery queue d6a3b0529d7d LU-15279 ptlrpc: use a cached value ca7fdbb15e93 LU-15175 tests: fix ldev tests 2c787065441e LU-15095 target: lbug_on_grant_miscount module parameter 28769c65987c LU-15195 ofd: missing OST object ee9a03d8308c LU-15137 socklnd: expect two control connections maximum b53438df70eb LU-15109 tests: different quota and usage relations 1097f7ca61a1 LU-9516 tests: fix sanity test_24v fd8d4485062b LU-14965 ldiskfs: rhel7.6 inode mutex for ldiskfs_orphan_add 75414af6bf31 LU-13717 sec: fix handling of encrypted file with long name 1711e26ae861 LU-15009 ofd: continue precreate if LAST_ID is less on MDT e42d2d67d3a0 LU-14677 sec: remove MIGRATION_ compatibility defines fdbf2ffd41fa LU-14677 sec: no encryption key migrate/extend/resync/split b0f150eba4c2 LU-13783 osd-ldiskfs: use alloc_file_pseudo to create fake files 2f50e9a386f2 LU-14704 tests: disable opencache for sanity/29 cb3b2bb683ce LU-11388 test: enable replay-single test_131b d083c93c6fd9 LU-15357 mdd: fix changelog context leak c8b7afe49704 LU-15252 mdt: reduce contention at mdt_lsom_update 2da8f7adbe4a LU-7372 tests: re-enable replay-dual test_26 41c813d14ec9 LU-15262 osd: bio_integrity_prep_fn return value processing One distinct possibility is the "re-enable replay-dual test_26" |