Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1ff8d9a5-c3da-4835-8739-9f790d3c2491
test_1 crashed on the MDS with the following error:
onyx-44vm9 crashed during runtests test_1 LustreError: 138526:0:(llog_cat.c:1162:llog_cat_cleanup()) ASSERTION( index ) failed: LustreError: 138526:0:(llog_cat.c:1162:llog_cat_cleanup()) LBUG Pid: 138526, comm: lod0001_rec0000 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Fri Jul 30 19:47:15 UTC 2021 header Call Trace TBD: libcfs_call_trace+0x6f/0x90 [libcfs] lbug_with_loc+0x43/0x80 [libcfs] llog_cat_cleanup+0x391/0x3d0 [obdclass] llog_cat_close+0x193/0x210 [obdclass] lod_sub_recovery_th6+0x1e3/0xb40 [lod] kthread+0x112/0x130 LustreError: 143361:0:(llog.c:1149:llog_write_rec()) lustre-MDT0000-osp-MDT0001: loghandle 0000000062d00541 with no LustreError: 143361:0:(llog_cat.c:602:llog_cat_add_rec()) llog_write_rec -71: lh=0000000062d00541 LustreError: 143361:0:(update_trans.c:1062:top_trans_stop()) lustre-MDT0000-osp-MDT0001: write updates failed: rc = -71
A second test had a similar MDS crash with a slightly different stack:
https://testing.whamcloud.com/test_sets/366c2ba7-795e-4856-b4c4-9f2cce973618
general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 139728 Comm: mdt00_002 4.18.0-240.22.1.el8_lustre.x86_64 #1 RIP: 0010:__list_add_valid+0x10/0x50 Call Trace: llog_cat_prep_log+0x311/0x3c0 [obdclass] llog_cat_declare_add_rec+0xbe/0x220 [obdclass] llog_declare_add+0x187/0x1d0 [obdclass] top_trans_start+0x212/0x940 [ptlrpc] mdd_unlink+0x4a0/0xb30 [mdd] mdt_reint_unlink+0xb0c/0x12a0 [mdt] mdt_reint_rec+0x11f/0x250 [mdt] mdt_reint_internal+0x498/0x780 [mdt] mdt_reint+0x5e/0x100 [mdt] tgt_request_handle+0xc90/0x1940 [ptlrpc] ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc] ptlrpc_main+0xba2/0x1490 [ptlrpc]
A third test crashed the MDS with a different operation, but also in llog list handling:
https://testing.whamcloud.com/test_sets/b7099363-3b2c-4b7a-ad54-795ca4541ddc
general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 138567 Comm: mdt00_002 4.18.0-240.22.1.el8_lustre.x86_64 #1 RIP: 0010:__list_add_valid+0x10/0x50 Call Trace: llog_cat_prep_log+0x311/0x3c0 [obdclass] llog_cat_declare_add_rec+0xbe/0x220 [obdclass] llog_declare_add+0x187/0x1d0 [obdclass] top_trans_start+0x212/0x940 [ptlrpc] mdd_create+0xb42/0x1870 [mdd] mdt_create+0x7a7/0xc20 [mdt] mdt_reint_create+0x30b/0x3c0 [mdt] mdt_reint_rec+0x11f/0x250 [mdt] mdt_reint_internal+0x498/0x780 [mdt] mdt_reint+0x5e/0x100 [mdt] tgt_request_handle+0xc90/0x1940 [ptlrpc] ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc] ptlrpc_main+0xba2/0x1490 [ptlrpc]
Searching back through the Maloo crashes of runtests to the start of the year, it appears this started failing with this ASSERTION on 2021-07-31 (though there are other, unlrelated crashes in runtests due to bugs in under-development patches).