[LU-17385] sanity-lfsck test_26a: only 3 of 4 MDTs are in completed Created: 21/Dec/23  Updated: 04/Feb/24  Resolved: 04/Feb/24

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Alexander Zarochentsev
Resolution: Fixed Votes: 0
Labels: always_except

Issue Links:
Related
is related to LU-16826 MDS nodes panicked running lfsck repa... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e80cc085-ac08-4f47-b354-22551a7da132

test_26a failed with the following error:

(7) only 3 of 4 MDTs are in completed

Test session details:
clients: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64
servers: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64

<<Please provide additional information about the failure here>>

First started on 2023-12-20 for full runs, may be related to recent patch landing.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lfsck test_26a - (7) only 3 of 4 MDTs are in completed



 Comments   
Comment by Peter Jones [ 22/Dec/23 ]

Hongchao

This seems to have started failing only very recently. Can you identify which change introduced this issue?

Thanks

Peter

Comment by Hongchao Zhang [ 22/Dec/23 ]

The LFSCK_NOTIFY request is failed to be processed by MDT, but it is strange that other LFSCK_NOTIFY request succeed

00000100:00100000:1.0:1703075647.451663:0:1223156:0:(service.c:2333:ptlrpc_server_handle_request()) Handling RPC req@00000000a373e95a pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_003:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025664:12345-10.240.26.106@tcp:1101:lctl.0
00000100:00100000:0.0:1703075647.451664:0:1215944:0:(nrs_fifo.c:179:nrs_fifo_req_get()) NRS start fifo request from 12345-10.240.26.106@tcp, seq: 1159
00000100:00100000:0.0:1703075647.451667:0:1215944:0:(service.c:2333:ptlrpc_server_handle_request()) Handling RPC req@00000000617d7544 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_001:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025792:12345-10.240.26.106@tcp:1101:lctl.0
00000100:00100000:1.0:1703075647.451692:0:1223156:0:(service.c:2382:ptlrpc_server_handle_request()) Handled RPC req@00000000a373e95a pname:cluuid+ref:pid:xid:nid:opc:job mdt_out00_003:lustre-MDT0000-mdtlov_UUID+5:1744723:x1785766154025664:12345-10.240.26.106@tcp:1101:lctl.0 Request processed in 29us (98us total) trans 0 rc -95/-95
00000100:00100000:1.0:1703075647.451695:0:1223156:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.240.26.106@tcp, seq: 1158
00100000:10000000:0.0:1703075647.451784:0:1215944:0:(lfsck_lib.c:2707:lfsck_load_one_trace_file()) lustre-MDT0003-osd: unlink lfsck sub trace file lfsck_namespace_01: rc = 0

the request (xid= x1785766154025664) failed with -95 immediately, but the similar request (xid = x1785766154025792) succeeded

will create a debug patch to collect the logs

Comment by Gerrit Updater [ 22/Dec/23 ]

"Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53530
Subject: EX-8860 lfsck: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 17251801b1cf5516132edebd6677e2f34fcbc61c

Comment by Andreas Dilger [ 23/Dec/23 ]

It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 "LU-16826 tests: lfsck to repair a dangling remote entry" landing on 2023-12-20 which added sanity-lfsck.sh test_23d, but used:

Test-Parameters: trivial testlist=sanity-lfsck ... env=ONLY=23d

so it is likely leaving the filesystem in a bad state after test_23d finished and this causes test_24 and test_26a to also fail.

Comment by Gerrit Updater [ 23/Dec/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53544
Subject: LU-17385 tests: add sanity-lfsck/24 debugging
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 94f62d0d5bea764b3b0287662384a524283dd419

Comment by Andreas Dilger [ 23/Dec/23 ]

This is failing 22/62 runs since the LU-16826 test case landed.  I don't see anything obvious in the test logs, like an MDT reconnecting in test_26/test_27 after it was stopped/started in test_23d, so I added some more debugging to see why this is failing.

I'll also push a revert of the patch that added test_23d and confirm that this stops the problem from being hit, and we'll have it ready if there is no quick solution.

Comment by Gerrit Updater [ 23/Dec/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53545
Subject: LU-17385 revert: LU-16826 tests: lfsck to repair a dangling remote entry
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fb6c848ef816ecb17f02ac461c2352ced320c593

Comment by Alexander Zarochentsev [ 23/Dec/23 ]

> It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 "LU-16826 tests: lfsck to repair a dangling remote entry" landing on 2023-12-20 which added sanity-lfsck.sh test_23c, but used:

no, the patch adds test_23d, test_23c is an old one with similar name:

run_test 23c "LFSCK can repair dangling name entry (3)
run_test 23d "LFSCK can repair a dangling name entry to a remote object

Comment by Andreas Dilger [ 23/Dec/23 ]

Sorry, a typo on my part, and fixed in my comment. It is the new test that landed which caused the problem. 

Comment by Andreas Dilger [ 25/Dec/23 ]

It looks like MDT0000 has "finished" the LFSCK run, but with an error:

 

status: partial
flags: incomplete

It isn't clear from the test output why it is "partial". It doesn't look like waiting longer (600s) in wait_all_targets_blocked() is better than using "-w" in this case, because the LFSCK threads are all finished, but with an error.

Comment by Gerrit Updater [ 27/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53544/
Subject: LU-17385 tests: always_except sanity-lfsck/24
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 955e38051765609fe3a661035c0fab2cfca733ce

Comment by Hongchao Zhang [ 28/Dec/23 ]

The LFSCK command used in test_23d to start layout LFSCK on MDT0000 uses option "-o", which will broadcast LFSCK to all MDTs

int jt_lfsck_start(int argc, char **argv)
{
                ...
                case 'o':
                        start.ls_flags |= LPF_ALL_TGT | LPF_BROADCAST |         
                                          LPF_OST_ORPHAN;
                        break;
                ...
}

the LFSCK command used in test_24 to start namespace LFSCK is also sent to all MDTs, but some MDT could not complete
the previous layout LFSCK yet, then it will find the two LFSCKs are different and return -EOPNOTSUPP

int lfsck_start(const struct lu_env *env, struct dt_device *key,
                struct lfsck_start_param *lsp)
{
        ...
        if (!thread_is_init(thread) && !thread_is_stopped(thread)) {
                rc = -EALREADY;
                if (unlikely(start == NULL)) {
                        spin_unlock(&lfsck->li_lock);
                        GOTO(out, rc);
                }

                while (start->ls_active != 0) {
                        if (!(type & start->ls_active)) {
                                type <<= 1;
                                continue;
                        }

                        com = __lfsck_component_find(lfsck, type,
                                                     &lfsck->li_list_scan);
                        if (com == NULL)
                                com = __lfsck_component_find(lfsck, type,
                                                &lfsck->li_list_double_scan);
                        if (com == NULL) {
                                rc = -EOPNOTSUPP;      <--------- return with error -EOPNOTSUPP
                                break;
                        }
                        ...
                }
                ...
        }
        ...
}

the corresponding logs

00000020:00000001:1.0:1703243105.771867:0:217993:0:(tgt_handler.c:1621:tgt_handle_lfsck_notify()) Process entered
00100000:00000001:0.0:1703243105.771867:0:228721:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered
00100000:00000001:1.0:1703243105.771868:0:217993:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered
00100000:00000001:0.0:1703243105.771868:0:228721:0:(lfsck_lib.c:3104:lfsck_start()) Process entered
00100000:00000001:1.0:1703243105.771869:0:217993:0:(lfsck_lib.c:3104:lfsck_start()) Process entered
00100000:00000001:1.0:1703243105.771873:0:217993:0:(lfsck_bookmark.c:107:lfsck_bookmark_store()) Process entered
00080000:00000001:1.0:1703243105.771874:0:217993:0:(osd_handler.c:1912:osd_trans_create()) Process entered
00080000:00000010:1.0:1703243105.771877:0:217993:0:(osd_handler.c:1927:osd_trans_create()) kmalloced '(oh)': 288 at 00000000a80bb82f.
00100000:00000001:0.0:1703243105.771880:0:228721:0:(lfsck_lib.c:3174:lfsck_start()) Process leaving via out (rc=18446744073709551521 : -95 : 0xffffffffffffffa1)
00100000:00000001:0.0:1703243105.771886:0:228721:0:(lfsck_lib.c:3546:lfsck_in_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1)
00000020:00000001:0.0:1703243105.771887:0:228721:0:(tgt_handler.c:1629:tgt_handle_lfsck_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1)
00080000:00000001:1.0:1703243105.771888:0:217993:0:(osd_handler.c:1955:osd_trans_create()) Process leaving (rc=18446619811807463424 : -124261902088192 : ffff8efc05778800)
00010000:00000040:0.0:1703243105.771889:0:228721:0:(ldlm_lib.c:3238:target_committed_to_req()) last_committed 0, transno 0, xid 1785973155960256
00010000:00000001:0.0:1703243105.771890:0:228721:0:(ldlm_lib.c:3307:target_send_reply()) Process entered
00010000:00000200:0.0:1703243105.771892:0:228721:0:(ldlm_lib.c:3295:target_send_reply_msg()) @@@ sending reply  req@00000000c24bd5b0 x1785973155960256/t0(0) o1101->lustre-MDT0000-mdtlov_UUID@10.240.38.25@tcp:111/0 lens 320/224 e 0 to 0 dl 1703243116 ref 1 fl Interpret:/200/0 rc -95/0 job:'lctl.0' uid:0 gid:0
Comment by Andreas Dilger [ 28/Dec/23 ]

The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line.

Comment by Alexander Zarochentsev [ 04/Jan/24 ]

hongchao.zhang, thanks for the analysis, I am submitting a patch to test the idea.

Comment by Gerrit Updater [ 04/Jan/24 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53591
Subject: LU-17385 tests: sanity-lfsck 23d fix and enable
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2f4f656947703d8b44a7ea49a8b2c84020591307

Comment by Gerrit Updater [ 04/Feb/24 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53591/
Subject: LU-17385 tests: sanity-lfsck 23d fix and enable
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 028ed64d90cfdeb908fb5574aacf2f71c259e2c2

Comment by Peter Jones [ 04/Feb/24 ]

Merged for 2.16

Generated at Sat Feb 10 03:35:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.