Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17385

sanity-lfsck test_26a: only 3 of 4 MDTs are in completed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e80cc085-ac08-4f47-b354-22551a7da132

      test_26a failed with the following error:

      (7) only 3 of 4 MDTs are in completed
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64
      servers: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64

      <<Please provide additional information about the failure here>>

      First started on 2023-12-20 for full runs, may be related to recent patch landing.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lfsck test_26a - (7) only 3 of 4 MDTs are in completed

      Attachments

        Issue Links

          Activity

            [LU-17385] sanity-lfsck test_26a: only 3 of 4 MDTs are in completed

            The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line.

            adilger Andreas Dilger added a comment - The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line.

            The LFSCK command used in test_23d to start layout LFSCK on MDT0000 uses option "-o", which will broadcast LFSCK to all MDTs

            int jt_lfsck_start(int argc, char **argv)
            {
                            ...
                            case 'o':
                                    start.ls_flags |= LPF_ALL_TGT | LPF_BROADCAST |         
                                                      LPF_OST_ORPHAN;
                                    break;
                            ...
            }
            

            the LFSCK command used in test_24 to start namespace LFSCK is also sent to all MDTs, but some MDT could not complete
            the previous layout LFSCK yet, then it will find the two LFSCKs are different and return -EOPNOTSUPP

            int lfsck_start(const struct lu_env *env, struct dt_device *key,
                            struct lfsck_start_param *lsp)
            {
                    ...
                    if (!thread_is_init(thread) && !thread_is_stopped(thread)) {
                            rc = -EALREADY;
                            if (unlikely(start == NULL)) {
                                    spin_unlock(&lfsck->li_lock);
                                    GOTO(out, rc);
                            }
            
                            while (start->ls_active != 0) {
                                    if (!(type & start->ls_active)) {
                                            type <<= 1;
                                            continue;
                                    }
            
                                    com = __lfsck_component_find(lfsck, type,
                                                                 &lfsck->li_list_scan);
                                    if (com == NULL)
                                            com = __lfsck_component_find(lfsck, type,
                                                            &lfsck->li_list_double_scan);
                                    if (com == NULL) {
                                            rc = -EOPNOTSUPP;      <--------- return with error -EOPNOTSUPP
                                            break;
                                    }
                                    ...
                            }
                            ...
                    }
                    ...
            }
            

            the corresponding logs

            00000020:00000001:1.0:1703243105.771867:0:217993:0:(tgt_handler.c:1621:tgt_handle_lfsck_notify()) Process entered
            00100000:00000001:0.0:1703243105.771867:0:228721:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered
            00100000:00000001:1.0:1703243105.771868:0:217993:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered
            00100000:00000001:0.0:1703243105.771868:0:228721:0:(lfsck_lib.c:3104:lfsck_start()) Process entered
            00100000:00000001:1.0:1703243105.771869:0:217993:0:(lfsck_lib.c:3104:lfsck_start()) Process entered
            00100000:00000001:1.0:1703243105.771873:0:217993:0:(lfsck_bookmark.c:107:lfsck_bookmark_store()) Process entered
            00080000:00000001:1.0:1703243105.771874:0:217993:0:(osd_handler.c:1912:osd_trans_create()) Process entered
            00080000:00000010:1.0:1703243105.771877:0:217993:0:(osd_handler.c:1927:osd_trans_create()) kmalloced '(oh)': 288 at 00000000a80bb82f.
            00100000:00000001:0.0:1703243105.771880:0:228721:0:(lfsck_lib.c:3174:lfsck_start()) Process leaving via out (rc=18446744073709551521 : -95 : 0xffffffffffffffa1)
            00100000:00000001:0.0:1703243105.771886:0:228721:0:(lfsck_lib.c:3546:lfsck_in_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1)
            00000020:00000001:0.0:1703243105.771887:0:228721:0:(tgt_handler.c:1629:tgt_handle_lfsck_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1)
            00080000:00000001:1.0:1703243105.771888:0:217993:0:(osd_handler.c:1955:osd_trans_create()) Process leaving (rc=18446619811807463424 : -124261902088192 : ffff8efc05778800)
            00010000:00000040:0.0:1703243105.771889:0:228721:0:(ldlm_lib.c:3238:target_committed_to_req()) last_committed 0, transno 0, xid 1785973155960256
            00010000:00000001:0.0:1703243105.771890:0:228721:0:(ldlm_lib.c:3307:target_send_reply()) Process entered
            00010000:00000200:0.0:1703243105.771892:0:228721:0:(ldlm_lib.c:3295:target_send_reply_msg()) @@@ sending reply  req@00000000c24bd5b0 x1785973155960256/t0(0) o1101->lustre-MDT0000-mdtlov_UUID@10.240.38.25@tcp:111/0 lens 320/224 e 0 to 0 dl 1703243116 ref 1 fl Interpret:/200/0 rc -95/0 job:'lctl.0' uid:0 gid:0
            
            hongchao.zhang Hongchao Zhang added a comment - The LFSCK command used in test_23d to start layout LFSCK on MDT0000 uses option "-o", which will broadcast LFSCK to all MDTs int jt_lfsck_start(int argc, char **argv) { ... case 'o': start.ls_flags |= LPF_ALL_TGT | LPF_BROADCAST | LPF_OST_ORPHAN; break; ... } the LFSCK command used in test_24 to start namespace LFSCK is also sent to all MDTs, but some MDT could not complete the previous layout LFSCK yet, then it will find the two LFSCKs are different and return -EOPNOTSUPP int lfsck_start(const struct lu_env *env, struct dt_device *key, struct lfsck_start_param *lsp) { ... if (!thread_is_init(thread) && !thread_is_stopped(thread)) { rc = -EALREADY; if (unlikely(start == NULL)) { spin_unlock(&lfsck->li_lock); GOTO(out, rc); } while (start->ls_active != 0) { if (!(type & start->ls_active)) { type <<= 1; continue; } com = __lfsck_component_find(lfsck, type, &lfsck->li_list_scan); if (com == NULL) com = __lfsck_component_find(lfsck, type, &lfsck->li_list_double_scan); if (com == NULL) { rc = -EOPNOTSUPP; <--------- return with error -EOPNOTSUPP break; } ... } ... } ... } the corresponding logs 00000020:00000001:1.0:1703243105.771867:0:217993:0:(tgt_handler.c:1621:tgt_handle_lfsck_notify()) Process entered 00100000:00000001:0.0:1703243105.771867:0:228721:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered 00100000:00000001:1.0:1703243105.771868:0:217993:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered 00100000:00000001:0.0:1703243105.771868:0:228721:0:(lfsck_lib.c:3104:lfsck_start()) Process entered 00100000:00000001:1.0:1703243105.771869:0:217993:0:(lfsck_lib.c:3104:lfsck_start()) Process entered 00100000:00000001:1.0:1703243105.771873:0:217993:0:(lfsck_bookmark.c:107:lfsck_bookmark_store()) Process entered 00080000:00000001:1.0:1703243105.771874:0:217993:0:(osd_handler.c:1912:osd_trans_create()) Process entered 00080000:00000010:1.0:1703243105.771877:0:217993:0:(osd_handler.c:1927:osd_trans_create()) kmalloced '(oh)': 288 at 00000000a80bb82f. 00100000:00000001:0.0:1703243105.771880:0:228721:0:(lfsck_lib.c:3174:lfsck_start()) Process leaving via out (rc=18446744073709551521 : -95 : 0xffffffffffffffa1) 00100000:00000001:0.0:1703243105.771886:0:228721:0:(lfsck_lib.c:3546:lfsck_in_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1) 00000020:00000001:0.0:1703243105.771887:0:228721:0:(tgt_handler.c:1629:tgt_handle_lfsck_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1) 00080000:00000001:1.0:1703243105.771888:0:217993:0:(osd_handler.c:1955:osd_trans_create()) Process leaving (rc=18446619811807463424 : -124261902088192 : ffff8efc05778800) 00010000:00000040:0.0:1703243105.771889:0:228721:0:(ldlm_lib.c:3238:target_committed_to_req()) last_committed 0, transno 0, xid 1785973155960256 00010000:00000001:0.0:1703243105.771890:0:228721:0:(ldlm_lib.c:3307:target_send_reply()) Process entered 00010000:00000200:0.0:1703243105.771892:0:228721:0:(ldlm_lib.c:3295:target_send_reply_msg()) @@@ sending reply req@00000000c24bd5b0 x1785973155960256/t0(0) o1101->lustre-MDT0000-mdtlov_UUID@10.240.38.25@tcp:111/0 lens 320/224 e 0 to 0 dl 1703243116 ref 1 fl Interpret:/200/0 rc -95/0 job:'lctl.0' uid:0 gid:0

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53544/
            Subject: LU-17385 tests: always_except sanity-lfsck/24
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 955e38051765609fe3a661035c0fab2cfca733ce

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53544/ Subject: LU-17385 tests: always_except sanity-lfsck/24 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 955e38051765609fe3a661035c0fab2cfca733ce
            adilger Andreas Dilger added a comment - - edited

            It looks like MDT0000 has "finished" the LFSCK run, but with an error:

             

            status: partial
            flags: incomplete

            It isn't clear from the test output why it is "partial". It doesn't look like waiting longer (600s) in wait_all_targets_blocked() is better than using "-w" in this case, because the LFSCK threads are all finished, but with an error.

            adilger Andreas Dilger added a comment - - edited It looks like MDT0000 has "finished" the LFSCK run, but with an error:   status: partial flags: incomplete It isn't clear from the test output why it is "partial". It doesn't look like waiting longer (600s) in wait_all_targets_blocked() is better than using " -w " in this case, because the LFSCK threads are all finished, but with an error.

            Sorry, a typo on my part, and fixed in my comment. It is the new test that landed which caused the problem. 

            adilger Andreas Dilger added a comment - Sorry, a typo on my part, and fixed in my comment. It is the new test that landed which caused the problem. 

            > It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 "LU-16826 tests: lfsck to repair a dangling remote entry" landing on 2023-12-20 which added sanity-lfsck.sh test_23c, but used:

            no, the patch adds test_23d, test_23c is an old one with similar name:

            run_test 23c "LFSCK can repair dangling name entry (3)
            run_test 23d "LFSCK can repair a dangling name entry to a remote object

            zam Alexander Zarochentsev added a comment - > It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 " LU-16826 tests: lfsck to repair a dangling remote entry" landing on 2023-12-20 which added sanity-lfsck.sh test_23c, but used: no, the patch adds test_23d, test_23c is an old one with similar name: run_test 23c "LFSCK can repair dangling name entry (3) run_test 23d "LFSCK can repair a dangling name entry to a remote object
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53545
            Subject: LU-17385 revert: LU-16826 tests: lfsck to repair a dangling remote entry
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fb6c848ef816ecb17f02ac461c2352ced320c593

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53545 Subject: LU-17385 revert: LU-16826 tests: lfsck to repair a dangling remote entry Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fb6c848ef816ecb17f02ac461c2352ced320c593

            This is failing 22/62 runs since the LU-16826 test case landed.  I don't see anything obvious in the test logs, like an MDT reconnecting in test_26/test_27 after it was stopped/started in test_23d, so I added some more debugging to see why this is failing.

            I'll also push a revert of the patch that added test_23d and confirm that this stops the problem from being hit, and we'll have it ready if there is no quick solution.

            adilger Andreas Dilger added a comment - This is failing 22/62 runs since the LU-16826 test case landed.  I don't see anything obvious in the test logs, like an MDT reconnecting in test_26/test_27 after it was stopped/started in test_23d, so I added some more debugging to see why this is failing. I'll also push a revert of the patch that added test_23d and confirm that this stops the problem from being hit, and we'll have it ready if there is no quick solution.

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53544
            Subject: LU-17385 tests: add sanity-lfsck/24 debugging
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 94f62d0d5bea764b3b0287662384a524283dd419

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53544 Subject: LU-17385 tests: add sanity-lfsck/24 debugging Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 94f62d0d5bea764b3b0287662384a524283dd419
            adilger Andreas Dilger added a comment - - edited

            It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 "LU-16826 tests: lfsck to repair a dangling remote entry" landing on 2023-12-20 which added sanity-lfsck.sh test_23d, but used:

            Test-Parameters: trivial testlist=sanity-lfsck ... env=ONLY=23d
            

            so it is likely leaving the filesystem in a bad state after test_23d finished and this causes test_24 and test_26a to also fail.

            adilger Andreas Dilger added a comment - - edited It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 " LU-16826 tests: lfsck to repair a dangling remote entry " landing on 2023-12-20 which added sanity-lfsck.sh test_23d, but used: Test-Parameters: trivial testlist=sanity-lfsck ... env=ONLY=23d so it is likely leaving the filesystem in a bad state after test_23d finished and this causes test_24 and test_26a to also fail.

            "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53530
            Subject: EX-8860 lfsck: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 17251801b1cf5516132edebd6677e2f34fcbc61c

            gerrit Gerrit Updater added a comment - "Hongchao Zhang <hongchao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53530 Subject: EX-8860 lfsck: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 17251801b1cf5516132edebd6677e2f34fcbc61c

            People

              zam Alexander Zarochentsev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: