[LU-15809] replay-dual test_29: timeout llog_verify_record() lustre-MDT0000-osp-MDT0001: record is too large: 0 > 32768 Created: 29/Apr/22  Updated: 14/Jun/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16066 replay-dual test_30: timeout, cannot ... Open
is related to LU-16539 replay-dual test_32: Timeout occurred... Open
is related to LU-15139 sanity test_160h: dt_record_write() A... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/9c32c9af-c574-4023-9286-07091f92769c

test_29 failed with the following error:

Started lustre-MDT0001

Timeout occurred after 135 mins, last suite running was replay-dual

It looks like the MDS is having trouble reading the recovery llog and is stuck doing this forever with "retry remote llog process":

[Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) lustre-MDT0000-osp-MDT0001: record is too large: 0 > 32768
[Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid record in llog [0x2:0x11d41:0x2] record for index 0/2: rc = -22
[Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:482:llog_verify_record()) lustre-MDT0000-osp-MDT0001: magic 0 is bad
[Mon Dec 27 22:49:21 2021] LustreError: 113045:0:(llog.c:781:llog_process_thread()) lustre-MDT0000-osp-MDT0001 retry remote llog process
[Mon Dec 27 22:49:22 2021] Lustre: lustre-MDT0001: in recovery but waiting for the first client to connect
[Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) lustre-MDT0000-osp-MDT0001: record is too large: 400547 > 32768
[Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) Skipped 205 previous similar messages
[Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid record in llog [0x2:0x11d41:0x2] record for index 96/0: rc = -22
[Mon Dec 27 22:49:22 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) Skipped 309 previous similar messages
:
:
[Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:482:llog_verify_record()) lustre-MDT0000-osp-MDT0001: magic 0 is bad
[Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:482:llog_verify_record()) Skipped 129784 previous similar messages
[Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:781:llog_process_thread()) lustre-MDT0000-osp-MDT0001 retry remote llog process
[Mon Dec 27 23:36:25 2021] LustreError: 113045:0:(llog.c:781:llog_process_thread()) Skipped 32445 previous similar messages
[Mon Dec 27 23:36:29 2021] Lustre: 113052:0:(ldlm_lib.c:1962:extend_recovery_timer()) lustre-MDT0001: extended recovery timer reached hard limit: 180, extend: 1
[Mon Dec 27 23:36:29 2021] Lustre: 113052:0:(ldlm_lib.c:1962:extend_recovery_timer()) Skipped 29 previous similar messages
[Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) lustre-MDT0000-osp-MDT0001: record is too large: 0 > 32768
[Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:472:llog_verify_record()) Skipped 258999 previous similar messages
[Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) lustre-MDT0000-osp-MDT0001: invalid record in llog [0x2:0x11d41:0x2] record for index 0/0: rc = -22
[Mon Dec 27 23:46:25 2021] LustreError: 113045:0:(llog.c:656:llog_process_thread()) Skipped 388499 previous similar messages

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-dual test_29 - Timeout occurred after 135 mins, last suite running was replay-dual



 Comments   
Comment by Alex Zhuravlev [ 30/Apr/22 ]

this is likely a dup of LU-15139

Comment by Åke Sandgren [ 06/Oct/22 ]

I'm getting this on a production system, but I don't see any traces of what's described in LU-15139

 

Running DDN ExaScaler 5.2.5 with Lustre 2.12.8_ddn9

Comment by Stephane Thiell [ 29/Mar/23 ]

We hit the same issue today with 2.12.8 with patches (close to 2.12.9) when failing over MDT0001 to another server.

symptoms:

Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:461:llog_verify_record()) oak-MDT0000-osp-MDT0001: record is too large: 0 > 32768
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:461:llog_verify_record()) Skipped 409732 previous similar messages
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:660:llog_process_thread()) oak-MDT0000-osp-MDT0001: invalid record in llog [0x1:0x181e9:0x2] record for index 0/6: rc = -22
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:660:llog_process_thread()) Skipped 409722 previous similar messages
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:785:llog_process_thread()) oak-MDT0000-osp-MDT0001 retry remote llog process
Mar 29 08:42:55 oak-md1-s2 kernel: LustreError: 9824:0:(llog.c:785:llog_process_thread()) Skipped 409721 previous similar messages

And a thread named lod0001_rec0000 was very active. Live backtrace for this thread:

crash> bt 9824
PID: 9824   TASK: ffff8c1c61d16300  CPU: 19  COMMAND: "lod0001_rec0000"
(active)
crash> bt 9824
PID: 9824   TASK: ffff8c1c61d16300  CPU: 19  COMMAND: "lod0001_rec0000"
(active)
crash> bt 9824
PID: 9824   TASK: ffff8c1c61d16300  CPU: 19  COMMAND: "lod0001_rec0000"
 #0 [ffff8c1bbc7c75f0] __schedule at ffffffff9e1b78d8
 #1 [ffff8c1bbc7c7650] ptlrpc_check_set at ffffffffc162afb8 [ptlrpc]
 #2 [ffff8c1bbc7c76f8] get_page_from_freelist at ffffffff9dbd34df
 #3 [ffff8c1bbc7c7810] __alloc_pages_nodemask at ffffffff9dbd3dc4
 #4 [ffff8c1bbc7c7940] alloc_pages_current at ffffffff9dc256c8
 #5 [ffff8c1bbc7c7988] kmalloc_order at ffffffff9dbf1688
 #6 [ffff8c1bbc7c79f8] llog_process_thread at ffffffffc135e1ac [obdclass]
 #7 [ffff8c1bbc7c7b38] llog_process_or_fork at ffffffffc135fca9 [obdclass]
 #8 [ffff8c1bbc7c7ba8] llog_cat_process_cb at ffffffffc1365419 [obdclass]
 #9 [ffff8c1bbc7c7c00] llog_process_thread at ffffffffc135eaba [obdclass]
#10 [ffff8c1bbc7c7d18] llog_process_or_fork at ffffffffc135fca9 [obdclass]
#11 [ffff8c1bbc7c7d88] llog_cat_process_or_fork at ffffffffc1361cb9 [obdclass]
#12 [ffff8c1bbc7c7e08] llog_cat_process at ffffffffc1361e6e [obdclass]
#13 [ffff8c1bbc7c7e28] lod_sub_recovery_thread at ffffffffc1299b49 [lod]
#14 [ffff8c1bbc7c7ec8] kthread at ffffffff9dacb511
#15 [ffff8c1bbc7c7f50] ret_from_fork_nospec_begin at ffffffff9e1c51dd

The recovery of MDT1 could not complete.

[root@oak-md1-s2 ~]# cat /proc/fs/lustre/mdt/oak-MDT0001/recovery_status 
status: WAITING
non-ready MDTs:  0000
recovery_start: 677513
time_waited: 1679427391

As a workaround, we used a manual abort_recovery which succeeded and stopped the thread lod0001_rec0000.

lctl --device oak-MDT0001 abort_recovery
Mar 29 08:55:39 oak-md1-s2 kernel: Lustre: oak-MDT0001: Recovery over after 16:16, of 1885 clients 1122 recovered and 763 were evicted.
Comment by Andreas Dilger [ 13/Jun/23 ]

Failed 13x on master over the past week.

Generated at Sat Feb 10 03:21:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.