[LU-15809] replay-dual test_29: timeout llog_verify_record() lustre-MDT0000-osp-MDT0001: record is too large: 0 > 32768 Created: 29/Apr/22 Updated: 14/Jun/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/9c32c9af-c574-4023-9286-07091f92769c test_29 failed with the following error: Started lustre-MDT0001 Timeout occurred after 135 mins, last suite running was replay-dual It looks like the MDS is having trouble reading the recovery llog and is stuck doing this forever with "retry remote llog process": [Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) lustre-MDT0000-osp-MDT0001: record is too large: 0 > 32768 [Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid record in llog [0x2:0x11d41:0x2] record for index 0/2: rc = -22 [Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:482:llog_verify_record()) lustre-MDT0000-osp-MDT0001: magic 0 is bad [Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:781:llog_process_thread()) lustre-MDT0000-osp-MDT0001 retry remote llog process [Mon Dec 27 22:49:22 2021] Lustre: lustre-MDT0001: in recovery but waiting for the first client to connect [Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) lustre-MDT0000-osp-MDT0001: record is too large: 400547 > 32768 [Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) Skipped 205 previous similar messages [Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid record in llog [0x2:0x11d41:0x2] record for index 96/0: rc = -22 [Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) Skipped 309 previous similar messages : : [Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:482:llog_verify_record()) lustre-MDT0000-osp-MDT0001: magic 0 is bad [Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:482:llog_verify_record()) Skipped 129784 previous similar messages [Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:781:llog_process_thread()) lustre-MDT0000-osp-MDT0001 retry remote llog process [Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:781:llog_process_thread()) Skipped 32445 previous similar messages [Mon Dec 27 23:36:29 2021] Lustre: 113052:0:(ldlm_lib.c:1962:extend_recovery_timer()) lustre-MDT0001: extended recovery timer reached hard limit: 180, extend: 1 [Mon Dec 27 23:36:29 2021] Lustre: 113052:0:(ldlm_lib.c:1962:extend_recovery_timer()) Skipped 29 previous similar messages [Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) lustre-MDT0000-osp-MDT0001: record is too large: 0 > 32768 [Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) Skipped 258999 previous similar messages [Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid record in llog [0x2:0x11d41:0x2] record for index 0/0: rc = -22 [Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) Skipped 388499 previous similar messages VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Alex Zhuravlev [ 30/Apr/22 ] |
|
this is likely a dup of |
| Comment by Åke Sandgren [ 06/Oct/22 ] |
|
I'm getting this on a production system, but I don't see any traces of what's described in
Running DDN ExaScaler 5.2.5 with Lustre 2.12.8_ddn9 |
| Comment by Stephane Thiell [ 29/Mar/23 ] |
|
We hit the same issue today with 2.12.8 with patches (close to 2.12.9) when failing over MDT0001 to another server. symptoms: Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:461:llog_verify_record()) oak-MDT0000-osp-MDT0001: record is too large: 0 > 32768 Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:461:llog_verify_record()) Skipped 409732 previous similar messages Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:660:llog_process_thread()) oak-MDT0000-osp-MDT0001: invalid record in llog [0x1:0x181e9:0x2] record for index 0/6: rc = -22 Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:660:llog_process_thread()) Skipped 409722 previous similar messages Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:785:llog_process_thread()) oak-MDT0000-osp-MDT0001 retry remote llog process Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:785:llog_process_thread()) Skipped 409721 previous similar messages And a thread named lod0001_rec0000 was very active. Live backtrace for this thread: crash> bt 9824 PID: 9824 TASK: ffff8c1c61d16300 CPU: 19 COMMAND: "lod0001_rec0000" (active) crash> bt 9824 PID: 9824 TASK: ffff8c1c61d16300 CPU: 19 COMMAND: "lod0001_rec0000" (active) crash> bt 9824 PID: 9824 TASK: ffff8c1c61d16300 CPU: 19 COMMAND: "lod0001_rec0000" #0 [ffff8c1bbc7c75f0] __schedule at ffffffff9e1b78d8 #1 [ffff8c1bbc7c7650] ptlrpc_check_set at ffffffffc162afb8 [ptlrpc] #2 [ffff8c1bbc7c76f8] get_page_from_freelist at ffffffff9dbd34df #3 [ffff8c1bbc7c7810] __alloc_pages_nodemask at ffffffff9dbd3dc4 #4 [ffff8c1bbc7c7940] alloc_pages_current at ffffffff9dc256c8 #5 [ffff8c1bbc7c7988] kmalloc_order at ffffffff9dbf1688 #6 [ffff8c1bbc7c79f8] llog_process_thread at ffffffffc135e1ac [obdclass] #7 [ffff8c1bbc7c7b38] llog_process_or_fork at ffffffffc135fca9 [obdclass] #8 [ffff8c1bbc7c7ba8] llog_cat_process_cb at ffffffffc1365419 [obdclass] #9 [ffff8c1bbc7c7c00] llog_process_thread at ffffffffc135eaba [obdclass] #10 [ffff8c1bbc7c7d18] llog_process_or_fork at ffffffffc135fca9 [obdclass] #11 [ffff8c1bbc7c7d88] llog_cat_process_or_fork at ffffffffc1361cb9 [obdclass] #12 [ffff8c1bbc7c7e08] llog_cat_process at ffffffffc1361e6e [obdclass] #13 [ffff8c1bbc7c7e28] lod_sub_recovery_thread at ffffffffc1299b49 [lod] #14 [ffff8c1bbc7c7ec8] kthread at ffffffff9dacb511 #15 [ffff8c1bbc7c7f50] ret_from_fork_nospec_begin at ffffffff9e1c51dd The recovery of MDT1 could not complete. [root@oak-md1-s2 ~]# cat /proc/fs/lustre/mdt/oak-MDT0001/recovery_status status: WAITING non-ready MDTs: 0000 recovery_start: 677513 time_waited: 1679427391 As a workaround, we used a manual abort_recovery which succeeded and stopped the thread lod0001_rec0000. lctl --device oak-MDT0001 abort_recovery Mar 29 08:55:39 oak-md1-s2 kernel: Lustre: oak-MDT0001: Recovery over after 16:16, of 1885 clients 1122 recovered and 763 were evicted. |
| Comment by Andreas Dilger [ 13/Jun/23 ] |
|
Failed 13x on master over the past week. |