[LU-14932] runtests: test_1 llog_cat_cleanup()) ASSERTION( index ) on MDS Created: 11/Aug/21  Updated: 15/Dec/22  Resolved: 14/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14474 Oops in llog_cat_prep_log() in sanity... Resolved
is related to LU-15139 sanity test_160h: dt_record_write() A... Resolved
is related to LU-14964 recovery-small: GPF in llog_exist aft... Resolved
is related to LU-16398 ost-pools: FAIL: remove sub-test dirs... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/1ff8d9a5-c3da-4835-8739-9f790d3c2491

test_1 crashed on the MDS with the following error:

onyx-44vm9 crashed during runtests test_1

LustreError: 138526:0:(llog_cat.c:1162:llog_cat_cleanup()) ASSERTION( index ) failed: 
LustreError: 138526:0:(llog_cat.c:1162:llog_cat_cleanup()) LBUG
Pid: 138526, comm: lod0001_rec0000 4.18.0-240.22.1.el8_lustre.x86_64 #1 SMP Fri Jul 30 19:47:15 UTC 2021
header
Call Trace TBD:
libcfs_call_trace+0x6f/0x90 [libcfs]
lbug_with_loc+0x43/0x80 [libcfs]
llog_cat_cleanup+0x391/0x3d0 [obdclass]
llog_cat_close+0x193/0x210 [obdclass]
lod_sub_recovery_th6+0x1e3/0xb40 [lod]
kthread+0x112/0x130

LustreError: 143361:0:(llog.c:1149:llog_write_rec()) lustre-MDT0000-osp-MDT0001: loghandle 0000000062d00541 with no 
LustreError: 143361:0:(llog_cat.c:602:llog_cat_add_rec()) llog_write_rec -71: lh=0000000062d00541
LustreError: 143361:0:(update_trans.c:1062:top_trans_stop()) lustre-MDT0000-osp-MDT0001: write updates failed: rc = -71

A second test had a similar MDS crash with a slightly different stack:
https://testing.whamcloud.com/test_sets/366c2ba7-795e-4856-b4c4-9f2cce973618

general protection fault: 0000 [#1] SMP PTI
CPU: 0 PID: 139728 Comm: mdt00_002  4.18.0-240.22.1.el8_lustre.x86_64 #1
RIP: 0010:__list_add_valid+0x10/0x50
Call Trace:
 llog_cat_prep_log+0x311/0x3c0 [obdclass]
 llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
 llog_declare_add+0x187/0x1d0 [obdclass]
 top_trans_start+0x212/0x940 [ptlrpc]
 mdd_unlink+0x4a0/0xb30 [mdd]
 mdt_reint_unlink+0xb0c/0x12a0 [mdt]
 mdt_reint_rec+0x11f/0x250 [mdt]
 mdt_reint_internal+0x498/0x780 [mdt]
 mdt_reint+0x5e/0x100 [mdt]
 tgt_request_handle+0xc90/0x1940 [ptlrpc]
 ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
 ptlrpc_main+0xba2/0x1490 [ptlrpc]

A third test crashed the MDS with a different operation, but also in llog list handling:
https://testing.whamcloud.com/test_sets/b7099363-3b2c-4b7a-ad54-795ca4541ddc

general protection fault: 0000 [#1] SMP PTI
CPU: 0 PID: 138567 Comm: mdt00_002 4.18.0-240.22.1.el8_lustre.x86_64 #1
RIP: 0010:__list_add_valid+0x10/0x50
Call Trace:
 llog_cat_prep_log+0x311/0x3c0 [obdclass]
 llog_cat_declare_add_rec+0xbe/0x220 [obdclass]
 llog_declare_add+0x187/0x1d0 [obdclass]
 top_trans_start+0x212/0x940 [ptlrpc]
 mdd_create+0xb42/0x1870 [mdd]
 mdt_create+0x7a7/0xc20 [mdt]
 mdt_reint_create+0x30b/0x3c0 [mdt]
 mdt_reint_rec+0x11f/0x250 [mdt]
 mdt_reint_internal+0x498/0x780 [mdt]
 mdt_reint+0x5e/0x100 [mdt]
 tgt_request_handle+0xc90/0x1940 [ptlrpc]
 ptlrpc_server_handle_request+0x323/0xbc0 [ptlrpc]
 ptlrpc_main+0xba2/0x1490 [ptlrpc]

Searching back through the Maloo crashes of runtests to the start of the year, it appears this started failing with this ASSERTION on 2021-07-31 (though there are other, unlrelated crashes in runtests due to bugs in under-development patches).



 Comments   
Comment by Andreas Dilger [ 11/Aug/21 ]

It appears from the current failures that these are all happening with ZFS and after replay-single fails with LU-10729.
While the LU-10729 failure has been around for quite a while, the runtests crash is new and should be fixed. Patches that landed on 2021-07-31 are:

e9cffb256d LU-14880 libcfs: Use crypto/sha2.h if available
39e4c97530 LU-14093 gss: gcc10 fixes for GSS
db0b09018e LU-13299 lnet: add "stats reset" to lnetctl
4668283cd1 LU-14806 o2iblnd: clear fatal error on successful failover
b9c4dc3c33 LU-14792 llite: enable filesystem-wide default LMV +
b7bd4e3422 LU-14621 mdd: fix lock-tx order in mdd_xattr_merge() !
3e04b0fd6c LU-13417 mdd: set default LMV on ROOT +

The patch that is the highest probability of introducing this failure is marked with "!". The default LMV changes marked with + do not really change the code as much as they change the behavior of the tests themselves to be more likely to use remote DNE directories, but that is largely driven by the client and shouldn't cause the client to crash.

The other change to the test environment on 2021-07-28 was the increase of VM RAM for ZFS from 2GB to 3GB, though it would be confusing if more memory caused the MDS to crash.

However, since the crash is only hit every 3-5 days, it may have been first introduced by a batch of landings on 2021-07-26, but none of these patches appear to modify any related code:

adc1bbbf20 LU-13602 pcc: add LCM_FL_PCC_RDONLY layout flag
6717c573ed LU-14814 osc: osc: Do not flush on lockless cancel
5ad00e36ec LU-14838 osc: Remove client contention support
6335dba839 LU-14838 osc: Remove lockless truncate
592d9a737b LU-9859 libcfs: make lnet_debugfs_symlink_def local to libcfs/modules.c
4b52ea1d30 LU-14637 flr: get rid of excluding dom+flr support test
5a28f3bc4b LU-14789 tests: make sanity 133f and 133g working
449d046e55 LU-14788 lnet: check memdup_user_nul using IS_ERR
393885c027 LU-13055 doc: update changelog manpages
66dcbd503f LU-14748 build: gcc9 fix address of packed member warning
3ffa5d680f LU-14740 llite: avoid project quota overflow
b1ed8e57da LU-14430 mdd: rename mti_fid to mdi_fid and friends
f18c87cb53 LU-13717 sec: handle null algo for filename encryption
87c4535f7a LU-13799 osc: Improve osc_queue_sync_pages
b855397878 LU-13799 clio: Skip prep for transients
1e4d10af39 LU-13799 llite: Adjust dio refcounting
d31647c017 LU-13799 lov: Improve DIO submit
587e5aa834 LU-13799 llite: Remove transient page counting
b3de247b76 LU-13799 llite: Modify AIO/DIO reference counting
7a2ef25f1f LU-13326 mds: remove MDS_SETATTR_PORTAL and service
618625af42 LU-13417 test: mkdir_on_mdt0() in more tests
d87af24452 LU-14655 lnet: Protect lpni deref in lnet_health_check

Comment by Andreas Dilger [ 24/Aug/21 ]

+1 on master https://testing.whamcloud.com/test_sets/3b075625-6d30-4bdf-a180-95dc1024dda8

Comment by Andreas Dilger [ 24/Aug/21 ]

Hit 6 other times in the past 4 weeks.

Comment by Andreas Dilger [ 30/Sep/21 ]

May be fixed by patch: https://review.whamcloud.com/44998 "LU-14474 llog: reset pointer to the next llog".

Comment by Andreas Dilger [ 14/Oct/21 ]

One test failure was seen today after the landing of patch 44998 5 days ago as v2_14_55-16-g4521f6af35, but the failed patch was based on an old parent v2_14_55-1-g1a409a3e6a that did not include that fix:
https://testing.whamcloud.com/test_sessions/805cf1c4-4678-405e-91a1-2d94b53d345d

Generated at Sat Feb 10 03:14:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.