[LU-17385] sanity-lfsck test_26a: only 3 of 4 MDTs are in completed Created: 21/Dec/23 Updated: 04/Feb/24 Resolved: 04/Feb/24 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Alexander Zarochentsev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | always_except | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e80cc085-ac08-4f47-b354-22551a7da132 test_26a failed with the following error: (7) only 3 of 4 MDTs are in completed Test session details: <<Please provide additional information about the failure here>> First started on 2023-12-20 for full runs, may be related to recent patch landing. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Peter Jones [ 22/Dec/23 ] |
|
Hongchao This seems to have started failing only very recently. Can you identify which change introduced this issue? Thanks Peter |
| Comment by Hongchao Zhang [ 22/Dec/23 ] |
|
The LFSCK_NOTIFY request is failed to be processed by MDT, but it is strange that other LFSCK_NOTIFY request succeed 00000100:00100000:1.0:1703075647.451663:0:1223156:0:(service.c:2333:ptlrpc_server_handle_request()) Handling RPC req@00000000a373e95a pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_003:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025664:12345-10.240.26.106@tcp:1101:lctl.0 00000100:00100000:0.0:1703075647.451664:0:1215944:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.240.26.106@tcp, seq: 1159 00000100:00100000:0.0:1703075647.451667:0:1215944:0:(service.c:2333:ptlrpc_server_handle_request()) Handling RPC req@00000000617d7544 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_001:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025792:12345-10.240.26.106@tcp:1101:lctl.0 00000100:00100000:1.0:1703075647.451692:0:1223156:0:(service.c:2382:ptlrpc_server_handle_request()) Handled RPC req@00000000a373e95a pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_003:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025664:12345-10.240.26.106@tcp:1101:lctl.0 Request processed in 29us (98us total) trans 0 rc -95/-95 00000100:00100000:1.0:1703075647.451695:0:1223156:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.240.26.106@tcp, seq: 1158 00100000:10000000:0.0:1703075647.451784:0:1215944:0:(lfsck_lib.c:2707:lfsck_load_one_trace_file()) lustre-MDT0003-osd: unlink lfsck sub trace file lfsck_namespace_01: rc = 0 the request (xid= x1785766154025664) failed with -95 immediately, but the similar request (xid = x1785766154025792) succeeded will create a debug patch to collect the logs |
| Comment by Gerrit Updater [ 22/Dec/23 ] |
|
"Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53530 |
| Comment by Andreas Dilger [ 23/Dec/23 ] |
|
It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 " Test-Parameters: trivial testlist=sanity-lfsck ... env=ONLY=23d so it is likely leaving the filesystem in a bad state after test_23d finished and this causes test_24 and test_26a to also fail. |
| Comment by Gerrit Updater [ 23/Dec/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53544 |
| Comment by Andreas Dilger [ 23/Dec/23 ] |
|
This is failing 22/62 runs since the I'll also push a revert of the patch that added test_23d and confirm that this stops the problem from being hit, and we'll have it ready if there is no quick solution. |
| Comment by Gerrit Updater [ 23/Dec/23 ] |
|
|
| Comment by Alexander Zarochentsev [ 23/Dec/23 ] |
|
> It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 " no, the patch adds test_23d, test_23c is an old one with similar name: run_test 23c "LFSCK can repair dangling name entry (3) |
| Comment by Andreas Dilger [ 23/Dec/23 ] |
|
Sorry, a typo on my part, and fixed in my comment. It is the new test that landed which caused the problem. |
| Comment by Andreas Dilger [ 25/Dec/23 ] |
|
It looks like MDT0000 has "finished" the LFSCK run, but with an error:
status: partial flags: incomplete It isn't clear from the test output why it is "partial". It doesn't look like waiting longer (600s) in wait_all_targets_blocked() is better than using "-w" in this case, because the LFSCK threads are all finished, but with an error. |
| Comment by Gerrit Updater [ 27/Dec/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53544/ |
| Comment by Hongchao Zhang [ 28/Dec/23 ] |
|
The LFSCK command used in test_23d to start layout LFSCK on MDT0000 uses option "-o", which will broadcast LFSCK to all MDTs int jt_lfsck_start(int argc, char **argv)
{
...
case 'o':
start.ls_flags |= LPF_ALL_TGT | LPF_BROADCAST |
LPF_OST_ORPHAN;
break;
...
}
the LFSCK command used in test_24 to start namespace LFSCK is also sent to all MDTs, but some MDT could not complete int lfsck_start(const struct lu_env *env, struct dt_device *key,
struct lfsck_start_param *lsp)
{
...
if (!thread_is_init(thread) && !thread_is_stopped(thread)) {
rc = -EALREADY;
if (unlikely(start == NULL)) {
spin_unlock(&lfsck->li_lock);
GOTO(out, rc);
}
while (start->ls_active != 0) {
if (!(type & start->ls_active)) {
type <<= 1;
continue;
}
com = __lfsck_component_find(lfsck, type,
&lfsck->li_list_scan);
if (com == NULL)
com = __lfsck_component_find(lfsck, type,
&lfsck->li_list_double_scan);
if (com == NULL) {
rc = -EOPNOTSUPP; <--------- return with error -EOPNOTSUPP
break;
}
...
}
...
}
...
}
the corresponding logs 00000020:00000001:1.0:1703243105.771867:0:217993:0:(tgt_handler.c:1621:tgt_handle_lfsck_notify()) Process entered 00100000:00000001:0.0:1703243105.771867:0:228721:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered 00100000:00000001:1.0:1703243105.771868:0:217993:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered 00100000:00000001:0.0:1703243105.771868:0:228721:0:(lfsck_lib.c:3104:lfsck_start()) Process entered 00100000:00000001:1.0:1703243105.771869:0:217993:0:(lfsck_lib.c:3104:lfsck_start()) Process entered 00100000:00000001:1.0:1703243105.771873:0:217993:0:(lfsck_bookmark.c:107:lfsck_bookmark_store()) Process entered 00080000:00000001:1.0:1703243105.771874:0:217993:0:(osd_handler.c:1912:osd_trans_create()) Process entered 00080000:00000010:1.0:1703243105.771877:0:217993:0:(osd_handler.c:1927:osd_trans_create()) kmalloced '(oh)': 288 at 00000000a80bb82f. 00100000:00000001:0.0:1703243105.771880:0:228721:0:(lfsck_lib.c:3174:lfsck_start()) Process leaving via out (rc=18446744073709551521 : -95 : 0xffffffffffffffa1) 00100000:00000001:0.0:1703243105.771886:0:228721:0:(lfsck_lib.c:3546:lfsck_in_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1) 00000020:00000001:0.0:1703243105.771887:0:228721:0:(tgt_handler.c:1629:tgt_handle_lfsck_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1) 00080000:00000001:1.0:1703243105.771888:0:217993:0:(osd_handler.c:1955:osd_trans_create()) Process leaving (rc=18446619811807463424 : -124261902088192 : ffff8efc05778800) 00010000:00000040:0.0:1703243105.771889:0:228721:0:(ldlm_lib.c:3238:target_committed_to_req()) last_committed 0, transno 0, xid 1785973155960256 00010000:00000001:0.0:1703243105.771890:0:228721:0:(ldlm_lib.c:3307:target_send_reply()) Process entered 00010000:00000200:0.0:1703243105.771892:0:228721:0:(ldlm_lib.c:3295:target_send_reply_msg()) @@@ sending reply req@00000000c24bd5b0 x1785973155960256/t0(0) o1101->lustre-MDT0000-mdtlov_UUID@10.240.38.25@tcp:111/0 lens 320/224 e 0 to 0 dl 1703243116 ref 1 fl Interpret:/200/0 rc -95/0 job:'lctl.0' uid:0 gid:0 |
| Comment by Andreas Dilger [ 28/Dec/23 ] |
|
The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line. |
| Comment by Alexander Zarochentsev [ 04/Jan/24 ] |
|
hongchao.zhang, thanks for the analysis, I am submitting a patch to test the idea. |
| Comment by Gerrit Updater [ 04/Jan/24 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53591 |
| Comment by Gerrit Updater [ 04/Feb/24 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53591/ |
| Comment by Peter Jones [ 04/Feb/24 ] |
|
Merged for 2.16 |