Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17385

sanity-lfsck test_26a: only 3 of 4 MDTs are in completed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e80cc085-ac08-4f47-b354-22551a7da132

      test_26a failed with the following error:

      (7) only 3 of 4 MDTs are in completed
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64
      servers: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64

      <<Please provide additional information about the failure here>>

      First started on 2023-12-20 for full runs, may be related to recent patch landing.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lfsck test_26a - (7) only 3 of 4 MDTs are in completed

      Attachments

        Issue Links

          Activity

            [LU-17385] sanity-lfsck test_26a: only 3 of 4 MDTs are in completed
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53591/
            Subject: LU-17385 tests: sanity-lfsck 23d fix and enable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 028ed64d90cfdeb908fb5574aacf2f71c259e2c2

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53591/ Subject: LU-17385 tests: sanity-lfsck 23d fix and enable Project: fs/lustre-release Branch: master Current Patch Set: Commit: 028ed64d90cfdeb908fb5574aacf2f71c259e2c2

            "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53591
            Subject: LU-17385 tests: sanity-lfsck 23d fix and enable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2f4f656947703d8b44a7ea49a8b2c84020591307

            gerrit Gerrit Updater added a comment - "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53591 Subject: LU-17385 tests: sanity-lfsck 23d fix and enable Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2f4f656947703d8b44a7ea49a8b2c84020591307

            hongchao.zhang, thanks for the analysis, I am submitting a patch to test the idea.

            zam Alexander Zarochentsev added a comment - hongchao.zhang , thanks for the analysis, I am submitting a patch to test the idea.

            The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line.

            adilger Andreas Dilger added a comment - The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line.

            The LFSCK command used in test_23d to start layout LFSCK on MDT0000 uses option "-o", which will broadcast LFSCK to all MDTs

            int jt_lfsck_start(int argc, char **argv)
            {
                            ...
                            case 'o':
                                    start.ls_flags |= LPF_ALL_TGT | LPF_BROADCAST |         
                                                      LPF_OST_ORPHAN;
                                    break;
                            ...
            }
            

            the LFSCK command used in test_24 to start namespace LFSCK is also sent to all MDTs, but some MDT could not complete
            the previous layout LFSCK yet, then it will find the two LFSCKs are different and return -EOPNOTSUPP

            int lfsck_start(const struct lu_env *env, struct dt_device *key,
                            struct lfsck_start_param *lsp)
            {
                    ...
                    if (!thread_is_init(thread) && !thread_is_stopped(thread)) {
                            rc = -EALREADY;
                            if (unlikely(start == NULL)) {
                                    spin_unlock(&lfsck->li_lock);
                                    GOTO(out, rc);
                            }
            
                            while (start->ls_active != 0) {
                                    if (!(type & start->ls_active)) {
                                            type <<= 1;
                                            continue;
                                    }
            
                                    com = __lfsck_component_find(lfsck, type,
                                                                 &lfsck->li_list_scan);
                                    if (com == NULL)
                                            com = __lfsck_component_find(lfsck, type,
                                                            &lfsck->li_list_double_scan);
                                    if (com == NULL) {
                                            rc = -EOPNOTSUPP;      <--------- return with error -EOPNOTSUPP
                                            break;
                                    }
                                    ...
                            }
                            ...
                    }
                    ...
            }
            

            the corresponding logs

            00000020:00000001:1.0:1703243105.771867:0:217993:0:(tgt_handler.c:1621:tgt_handle_lfsck_notify()) Process entered
            00100000:00000001:0.0:1703243105.771867:0:228721:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered
            00100000:00000001:1.0:1703243105.771868:0:217993:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered
            00100000:00000001:0.0:1703243105.771868:0:228721:0:(lfsck_lib.c:3104:lfsck_start()) Process entered
            00100000:00000001:1.0:1703243105.771869:0:217993:0:(lfsck_lib.c:3104:lfsck_start()) Process entered
            00100000:00000001:1.0:1703243105.771873:0:217993:0:(lfsck_bookmark.c:107:lfsck_bookmark_store()) Process entered
            00080000:00000001:1.0:1703243105.771874:0:217993:0:(osd_handler.c:1912:osd_trans_create()) Process entered
            00080000:00000010:1.0:1703243105.771877:0:217993:0:(osd_handler.c:1927:osd_trans_create()) kmalloced '(oh)': 288 at 00000000a80bb82f.
            00100000:00000001:0.0:1703243105.771880:0:228721:0:(lfsck_lib.c:3174:lfsck_start()) Process leaving via out (rc=18446744073709551521 : -95 : 0xffffffffffffffa1)
            00100000:00000001:0.0:1703243105.771886:0:228721:0:(lfsck_lib.c:3546:lfsck_in_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1)
            00000020:00000001:0.0:1703243105.771887:0:228721:0:(tgt_handler.c:1629:tgt_handle_lfsck_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1)
            00080000:00000001:1.0:1703243105.771888:0:217993:0:(osd_handler.c:1955:osd_trans_create()) Process leaving (rc=18446619811807463424 : -124261902088192 : ffff8efc05778800)
            00010000:00000040:0.0:1703243105.771889:0:228721:0:(ldlm_lib.c:3238:target_committed_to_req()) last_committed 0, transno 0, xid 1785973155960256
            00010000:00000001:0.0:1703243105.771890:0:228721:0:(ldlm_lib.c:3307:target_send_reply()) Process entered
            00010000:00000200:0.0:1703243105.771892:0:228721:0:(ldlm_lib.c:3295:target_send_reply_msg()) @@@ sending reply  req@00000000c24bd5b0 x1785973155960256/t0(0) o1101->lustre-MDT0000-mdtlov_UUID@10.240.38.25@tcp:111/0 lens 320/224 e 0 to 0 dl 1703243116 ref 1 fl Interpret:/200/0 rc -95/0 job:'lctl.0' uid:0 gid:0
            
            hongchao.zhang Hongchao Zhang added a comment - The LFSCK command used in test_23d to start layout LFSCK on MDT0000 uses option "-o", which will broadcast LFSCK to all MDTs int jt_lfsck_start(int argc, char **argv) { ... case 'o': start.ls_flags |= LPF_ALL_TGT | LPF_BROADCAST | LPF_OST_ORPHAN; break; ... } the LFSCK command used in test_24 to start namespace LFSCK is also sent to all MDTs, but some MDT could not complete the previous layout LFSCK yet, then it will find the two LFSCKs are different and return -EOPNOTSUPP int lfsck_start(const struct lu_env *env, struct dt_device *key, struct lfsck_start_param *lsp) { ... if (!thread_is_init(thread) && !thread_is_stopped(thread)) { rc = -EALREADY; if (unlikely(start == NULL)) { spin_unlock(&lfsck->li_lock); GOTO(out, rc); } while (start->ls_active != 0) { if (!(type & start->ls_active)) { type <<= 1; continue; } com = __lfsck_component_find(lfsck, type, &lfsck->li_list_scan); if (com == NULL) com = __lfsck_component_find(lfsck, type, &lfsck->li_list_double_scan); if (com == NULL) { rc = -EOPNOTSUPP; <--------- return with error -EOPNOTSUPP break; } ... } ... } ... } the corresponding logs 00000020:00000001:1.0:1703243105.771867:0:217993:0:(tgt_handler.c:1621:tgt_handle_lfsck_notify()) Process entered 00100000:00000001:0.0:1703243105.771867:0:228721:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered 00100000:00000001:1.0:1703243105.771868:0:217993:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered 00100000:00000001:0.0:1703243105.771868:0:228721:0:(lfsck_lib.c:3104:lfsck_start()) Process entered 00100000:00000001:1.0:1703243105.771869:0:217993:0:(lfsck_lib.c:3104:lfsck_start()) Process entered 00100000:00000001:1.0:1703243105.771873:0:217993:0:(lfsck_bookmark.c:107:lfsck_bookmark_store()) Process entered 00080000:00000001:1.0:1703243105.771874:0:217993:0:(osd_handler.c:1912:osd_trans_create()) Process entered 00080000:00000010:1.0:1703243105.771877:0:217993:0:(osd_handler.c:1927:osd_trans_create()) kmalloced '(oh)': 288 at 00000000a80bb82f. 00100000:00000001:0.0:1703243105.771880:0:228721:0:(lfsck_lib.c:3174:lfsck_start()) Process leaving via out (rc=18446744073709551521 : -95 : 0xffffffffffffffa1) 00100000:00000001:0.0:1703243105.771886:0:228721:0:(lfsck_lib.c:3546:lfsck_in_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1) 00000020:00000001:0.0:1703243105.771887:0:228721:0:(tgt_handler.c:1629:tgt_handle_lfsck_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1) 00080000:00000001:1.0:1703243105.771888:0:217993:0:(osd_handler.c:1955:osd_trans_create()) Process leaving (rc=18446619811807463424 : -124261902088192 : ffff8efc05778800) 00010000:00000040:0.0:1703243105.771889:0:228721:0:(ldlm_lib.c:3238:target_committed_to_req()) last_committed 0, transno 0, xid 1785973155960256 00010000:00000001:0.0:1703243105.771890:0:228721:0:(ldlm_lib.c:3307:target_send_reply()) Process entered 00010000:00000200:0.0:1703243105.771892:0:228721:0:(ldlm_lib.c:3295:target_send_reply_msg()) @@@ sending reply req@00000000c24bd5b0 x1785973155960256/t0(0) o1101->lustre-MDT0000-mdtlov_UUID@10.240.38.25@tcp:111/0 lens 320/224 e 0 to 0 dl 1703243116 ref 1 fl Interpret:/200/0 rc -95/0 job:'lctl.0' uid:0 gid:0

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53544/
            Subject: LU-17385 tests: always_except sanity-lfsck/24
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 955e38051765609fe3a661035c0fab2cfca733ce

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53544/ Subject: LU-17385 tests: always_except sanity-lfsck/24 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 955e38051765609fe3a661035c0fab2cfca733ce
            adilger Andreas Dilger added a comment - - edited

            It looks like MDT0000 has "finished" the LFSCK run, but with an error:

             

            status: partial
            flags: incomplete

            It isn't clear from the test output why it is "partial". It doesn't look like waiting longer (600s) in wait_all_targets_blocked() is better than using "-w" in this case, because the LFSCK threads are all finished, but with an error.

            adilger Andreas Dilger added a comment - - edited It looks like MDT0000 has "finished" the LFSCK run, but with an error:   status: partial flags: incomplete It isn't clear from the test output why it is "partial". It doesn't look like waiting longer (600s) in wait_all_targets_blocked() is better than using " -w " in this case, because the LFSCK threads are all finished, but with an error.

            Sorry, a typo on my part, and fixed in my comment. It is the new test that landed which caused the problem. 

            adilger Andreas Dilger added a comment - Sorry, a typo on my part, and fixed in my comment. It is the new test that landed which caused the problem. 

            > It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 "LU-16826 tests: lfsck to repair a dangling remote entry" landing on 2023-12-20 which added sanity-lfsck.sh test_23c, but used:

            no, the patch adds test_23d, test_23c is an old one with similar name:

            run_test 23c "LFSCK can repair dangling name entry (3)
            run_test 23d "LFSCK can repair a dangling name entry to a remote object

            zam Alexander Zarochentsev added a comment - > It looks like this test failure was introduced by patch https://review.whamcloud.com/50998 " LU-16826 tests: lfsck to repair a dangling remote entry" landing on 2023-12-20 which added sanity-lfsck.sh test_23c, but used: no, the patch adds test_23d, test_23c is an old one with similar name: run_test 23c "LFSCK can repair dangling name entry (3) run_test 23d "LFSCK can repair a dangling name entry to a remote object
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53545
            Subject: LU-17385 revert: LU-16826 tests: lfsck to repair a dangling remote entry
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fb6c848ef816ecb17f02ac461c2352ced320c593

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53545 Subject: LU-17385 revert: LU-16826 tests: lfsck to repair a dangling remote entry Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fb6c848ef816ecb17f02ac461c2352ced320c593

            People

              zam Alexander Zarochentsev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: