[LU-14932] runtests: test_1 llog_cat_cleanup()) ASSERTION( index ) on MDS Created: 11/Aug/21 Updated: 15/Dec/22 Resolved: 14/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1ff8d9a5-c3da-4835-8739-9f790d3c2491 test_1 crashed on the MDS with the following error: onyx-44vm9 crashed during runtests test_1 LustreError: 138526:0:(llog_cat.c:1162:llog_cat_cleanup()) ASSERTION( index ) failed: LustreError: 138526:0:(llog_cat.c:1162:llog_cat_cleanup()) LBUG Pid: 138526, comm: lod0001_rec0000 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Fri Jul 30 19:47:15 UTC 2021 header Call Trace TBD: libcfs_call_trace+0x6f/0x90 [libcfs] lbug_with_loc+0x43/0x80 [libcfs] llog_cat_cleanup+0x391/0x3d0 [obdclass] llog_cat_close+0x193/0x210 [obdclass] lod_sub_recovery_th6+0x1e3/0xb40 [lod] kthread+0x112/0x130 LustreError: 143361:0:(llog.c:1149:llog_write_rec()) lustre-MDT0000-osp-MDT0001: loghandle 0000000062d00541 with no LustreError: 143361:0:(llog_cat.c:602:llog_cat_add_rec()) llog_write_rec -71: lh=0000000062d00541 LustreError: 143361:0:(update_trans.c:1062:top_trans_stop()) lustre-MDT0000-osp-MDT0001: write updates failed: rc = -71 A second test had a similar MDS crash with a slightly different stack: general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 139728 Comm: mdt00_002 4.18.0-240.22.1.el8_lustre.x86_64 #1 RIP: 0010:__list_add_valid+0x10/0x50 Call Trace: llog_cat_prep_log+0x311/0x3c0 [obdclass] llog_cat_declare_add_rec+0xbe/0x220 [obdclass] llog_declare_add+0x187/0x1d0 [obdclass] top_trans_start+0x212/0x940 [ptlrpc] mdd_unlink+0x4a0/0xb30 [mdd] mdt_reint_unlink+0xb0c/0x12a0 [mdt] mdt_reint_rec+0x11f/0x250 [mdt] mdt_reint_internal+0x498/0x780 [mdt] mdt_reint+0x5e/0x100 [mdt] tgt_request_handle+0xc90/0x1940 [ptlrpc] ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc] ptlrpc_main+0xba2/0x1490 [ptlrpc] A third test crashed the MDS with a different operation, but also in llog list handling: general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 138567 Comm: mdt00_002 4.18.0-240.22.1.el8_lustre.x86_64 #1 RIP: 0010:__list_add_valid+0x10/0x50 Call Trace: llog_cat_prep_log+0x311/0x3c0 [obdclass] llog_cat_declare_add_rec+0xbe/0x220 [obdclass] llog_declare_add+0x187/0x1d0 [obdclass] top_trans_start+0x212/0x940 [ptlrpc] mdd_create+0xb42/0x1870 [mdd] mdt_create+0x7a7/0xc20 [mdt] mdt_reint_create+0x30b/0x3c0 [mdt] mdt_reint_rec+0x11f/0x250 [mdt] mdt_reint_internal+0x498/0x780 [mdt] mdt_reint+0x5e/0x100 [mdt] tgt_request_handle+0xc90/0x1940 [ptlrpc] ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc] ptlrpc_main+0xba2/0x1490 [ptlrpc] Searching back through the Maloo crashes of runtests to the start of the year, it appears this started failing with this ASSERTION on 2021-07-31 (though there are other, unlrelated crashes in runtests due to bugs in under-development patches). |
| Comments |
| Comment by Andreas Dilger [ 11/Aug/21 ] |
|
It appears from the current failures that these are all happening with ZFS and after replay-single fails with e9cffb256d LU-14880 libcfs: Use crypto/sha2.h if available The patch that is the highest probability of introducing this failure is marked with "!". The default LMV changes marked with + do not really change the code as much as they change the behavior of the tests themselves to be more likely to use remote DNE directories, but that is largely driven by the client and shouldn't cause the client to crash. The other change to the test environment on 2021-07-28 was the increase of VM RAM for ZFS from 2GB to 3GB, though it would be confusing if more memory caused the MDS to crash. However, since the crash is only hit every 3-5 days, it may have been first introduced by a batch of landings on 2021-07-26, but none of these patches appear to modify any related code: adc1bbbf20 LU-13602 pcc: add LCM_FL_PCC_RDONLY layout flag |
| Comment by Andreas Dilger [ 24/Aug/21 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/3b075625-6d30-4bdf-a180-95dc1024dda8 |
| Comment by Andreas Dilger [ 24/Aug/21 ] |
|
Hit 6 other times in the past 4 weeks. |
| Comment by Andreas Dilger [ 30/Sep/21 ] |
|
May be fixed by patch: https://review.whamcloud.com/44998 " |
| Comment by Andreas Dilger [ 14/Oct/21 ] |
|
One test failure was seen today after the landing of patch 44998 5 days ago as v2_14_55-16-g4521f6af35, but the failed patch was based on an old parent v2_14_55-1-g1a409a3e6a that did not include that fix: |