Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17385

sanity-lfsck test_26a: only 3 of 4 MDTs are in completed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e80cc085-ac08-4f47-b354-22551a7da132

      test_26a failed with the following error:

      (7) only 3 of 4 MDTs are in completed
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64
      servers: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64

      <<Please provide additional information about the failure here>>

      First started on 2023-12-20 for full runs, may be related to recent patch landing.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lfsck test_26a - (7) only 3 of 4 MDTs are in completed

      Attachments

        Issue Links

          Activity

            [LU-17385] sanity-lfsck test_26a: only 3 of 4 MDTs are in completed

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57687
            Subject: LU-17385 tests: always_except sanity-lfsck/24
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 34660438769d955968b65816e97e5f45f404198a

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57687 Subject: LU-17385 tests: always_except sanity-lfsck/24 Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 34660438769d955968b65816e97e5f45f404198a

            If I insert "sleep 10" after the "start mds1" before LFSCK start, the test completes successfully 10/10.

            thanks, let me try that..

            bzzz Alex Zhuravlev added a comment - If I insert "sleep 10" after the "start mds1" before LFSCK start, the test completes successfully 10/10. thanks, let me try that..

            >cmp: EOF on /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo which is empty

            it is a different issue, LFSCK reconnects the orphan inode into the namespace but somehow loses the layout. If I insert "sleep 10" after the "start mds1" before LFSCK start, the test completes successfully 10/10.

            zam Alexander Zarochentsev added a comment - >cmp: EOF on /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo which is empty it is a different issue, LFSCK reconnects the orphan inode into the namespace but somehow loses the layout. If I insert "sleep 10" after the "start mds1" before LFSCK start, the test completes successfully 10/10.
            bzzz Alex Zhuravlev added a comment - - edited

            23d still fail often locally:

            == sanity-lfsck test 23d: LFSCK can repair a dangling name entry to a remote object ========================================================== 04:50:54 (1712551854)
            Stopping /mnt/lustre-mds1 (opts:) on tmp.LJd1G0O9yk
            debugfs 1.46.2.wc5 (26-Mar-2022)
            
            Starting mds1: -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
            Started lustre-MDT0000
            cat: /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo: Bad address
            Started LFSCK on the device lustre-MDT0001: scrub namespace
            Started LFSCK on the device lustre-MDT0000: scrub layout
            cmp: EOF on /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo which is empty
             sanity-lfsck test_23d: @@@@@@ FAIL: file body has changed 
              Trace dump:
              = ./../tests/test-framework.sh:6975:error()
              = sanity-lfsck.sh:4252:test_23d()
              = ./../tests/test-framework.sh:7315:run_one()
              = ./../tests/test-framework.sh:7375:run_one_logged()
              = ./../tests/test-framework.sh:7201:run_test()
              = sanity-lfsck.sh:4254:main()
            Dumping lctl log to /tmp/ltest-logs/sanity-lfsck.test_23d.*.1712551889.log
            Dumping logs only on local client.
            FAIL 23d (36s)
            
            bzzz Alex Zhuravlev added a comment - - edited 23d still fail often locally: == sanity-lfsck test 23d: LFSCK can repair a dangling name entry to a remote object ========================================================== 04:50:54 (1712551854) Stopping /mnt/lustre-mds1 (opts:) on tmp.LJd1G0O9yk debugfs 1.46.2.wc5 (26-Mar-2022) Starting mds1: -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1 Started lustre-MDT0000 cat: /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo: Bad address Started LFSCK on the device lustre-MDT0001: scrub namespace Started LFSCK on the device lustre-MDT0000: scrub layout cmp: EOF on /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo which is empty sanity-lfsck test_23d: @@@@@@ FAIL: file body has changed Trace dump: = ./../tests/test-framework.sh:6975:error() = sanity-lfsck.sh:4252:test_23d() = ./../tests/test-framework.sh:7315:run_one() = ./../tests/test-framework.sh:7375:run_one_logged() = ./../tests/test-framework.sh:7201:run_test() = sanity-lfsck.sh:4254:main() Dumping lctl log to /tmp/ltest-logs/sanity-lfsck.test_23d.*.1712551889.log Dumping logs only on local client. FAIL 23d (36s)
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53591/
            Subject: LU-17385 tests: sanity-lfsck 23d fix and enable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 028ed64d90cfdeb908fb5574aacf2f71c259e2c2

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53591/ Subject: LU-17385 tests: sanity-lfsck 23d fix and enable Project: fs/lustre-release Branch: master Current Patch Set: Commit: 028ed64d90cfdeb908fb5574aacf2f71c259e2c2

            "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53591
            Subject: LU-17385 tests: sanity-lfsck 23d fix and enable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2f4f656947703d8b44a7ea49a8b2c84020591307

            gerrit Gerrit Updater added a comment - "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53591 Subject: LU-17385 tests: sanity-lfsck 23d fix and enable Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2f4f656947703d8b44a7ea49a8b2c84020591307

            hongchao.zhang, thanks for the analysis, I am submitting a patch to test the idea.

            zam Alexander Zarochentsev added a comment - hongchao.zhang , thanks for the analysis, I am submitting a patch to test the idea.

            The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line.

            adilger Andreas Dilger added a comment - The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line.

            The LFSCK command used in test_23d to start layout LFSCK on MDT0000 uses option "-o", which will broadcast LFSCK to all MDTs

            int jt_lfsck_start(int argc, char **argv)
            {
                            ...
                            case 'o':
                                    start.ls_flags |= LPF_ALL_TGT | LPF_BROADCAST |         
                                                      LPF_OST_ORPHAN;
                                    break;
                            ...
            }
            

            the LFSCK command used in test_24 to start namespace LFSCK is also sent to all MDTs, but some MDT could not complete
            the previous layout LFSCK yet, then it will find the two LFSCKs are different and return -EOPNOTSUPP

            int lfsck_start(const struct lu_env *env, struct dt_device *key,
                            struct lfsck_start_param *lsp)
            {
                    ...
                    if (!thread_is_init(thread) && !thread_is_stopped(thread)) {
                            rc = -EALREADY;
                            if (unlikely(start == NULL)) {
                                    spin_unlock(&lfsck->li_lock);
                                    GOTO(out, rc);
                            }
            
                            while (start->ls_active != 0) {
                                    if (!(type & start->ls_active)) {
                                            type <<= 1;
                                            continue;
                                    }
            
                                    com = __lfsck_component_find(lfsck, type,
                                                                 &lfsck->li_list_scan);
                                    if (com == NULL)
                                            com = __lfsck_component_find(lfsck, type,
                                                            &lfsck->li_list_double_scan);
                                    if (com == NULL) {
                                            rc = -EOPNOTSUPP;      <--------- return with error -EOPNOTSUPP
                                            break;
                                    }
                                    ...
                            }
                            ...
                    }
                    ...
            }
            

            the corresponding logs

            00000020:00000001:1.0:1703243105.771867:0:217993:0:(tgt_handler.c:1621:tgt_handle_lfsck_notify()) Process entered
            00100000:00000001:0.0:1703243105.771867:0:228721:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered
            00100000:00000001:1.0:1703243105.771868:0:217993:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered
            00100000:00000001:0.0:1703243105.771868:0:228721:0:(lfsck_lib.c:3104:lfsck_start()) Process entered
            00100000:00000001:1.0:1703243105.771869:0:217993:0:(lfsck_lib.c:3104:lfsck_start()) Process entered
            00100000:00000001:1.0:1703243105.771873:0:217993:0:(lfsck_bookmark.c:107:lfsck_bookmark_store()) Process entered
            00080000:00000001:1.0:1703243105.771874:0:217993:0:(osd_handler.c:1912:osd_trans_create()) Process entered
            00080000:00000010:1.0:1703243105.771877:0:217993:0:(osd_handler.c:1927:osd_trans_create()) kmalloced '(oh)': 288 at 00000000a80bb82f.
            00100000:00000001:0.0:1703243105.771880:0:228721:0:(lfsck_lib.c:3174:lfsck_start()) Process leaving via out (rc=18446744073709551521 : -95 : 0xffffffffffffffa1)
            00100000:00000001:0.0:1703243105.771886:0:228721:0:(lfsck_lib.c:3546:lfsck_in_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1)
            00000020:00000001:0.0:1703243105.771887:0:228721:0:(tgt_handler.c:1629:tgt_handle_lfsck_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1)
            00080000:00000001:1.0:1703243105.771888:0:217993:0:(osd_handler.c:1955:osd_trans_create()) Process leaving (rc=18446619811807463424 : -124261902088192 : ffff8efc05778800)
            00010000:00000040:0.0:1703243105.771889:0:228721:0:(ldlm_lib.c:3238:target_committed_to_req()) last_committed 0, transno 0, xid 1785973155960256
            00010000:00000001:0.0:1703243105.771890:0:228721:0:(ldlm_lib.c:3307:target_send_reply()) Process entered
            00010000:00000200:0.0:1703243105.771892:0:228721:0:(ldlm_lib.c:3295:target_send_reply_msg()) @@@ sending reply  req@00000000c24bd5b0 x1785973155960256/t0(0) o1101->lustre-MDT0000-mdtlov_UUID@10.240.38.25@tcp:111/0 lens 320/224 e 0 to 0 dl 1703243116 ref 1 fl Interpret:/200/0 rc -95/0 job:'lctl.0' uid:0 gid:0
            
            hongchao.zhang Hongchao Zhang added a comment - The LFSCK command used in test_23d to start layout LFSCK on MDT0000 uses option "-o", which will broadcast LFSCK to all MDTs int jt_lfsck_start(int argc, char **argv) { ... case 'o': start.ls_flags |= LPF_ALL_TGT | LPF_BROADCAST | LPF_OST_ORPHAN; break; ... } the LFSCK command used in test_24 to start namespace LFSCK is also sent to all MDTs, but some MDT could not complete the previous layout LFSCK yet, then it will find the two LFSCKs are different and return -EOPNOTSUPP int lfsck_start(const struct lu_env *env, struct dt_device *key, struct lfsck_start_param *lsp) { ... if (!thread_is_init(thread) && !thread_is_stopped(thread)) { rc = -EALREADY; if (unlikely(start == NULL)) { spin_unlock(&lfsck->li_lock); GOTO(out, rc); } while (start->ls_active != 0) { if (!(type & start->ls_active)) { type <<= 1; continue; } com = __lfsck_component_find(lfsck, type, &lfsck->li_list_scan); if (com == NULL) com = __lfsck_component_find(lfsck, type, &lfsck->li_list_double_scan); if (com == NULL) { rc = -EOPNOTSUPP; <--------- return with error -EOPNOTSUPP break; } ... } ... } ... } the corresponding logs 00000020:00000001:1.0:1703243105.771867:0:217993:0:(tgt_handler.c:1621:tgt_handle_lfsck_notify()) Process entered 00100000:00000001:0.0:1703243105.771867:0:228721:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered 00100000:00000001:1.0:1703243105.771868:0:217993:0:(lfsck_lib.c:3489:lfsck_in_notify()) Process entered 00100000:00000001:0.0:1703243105.771868:0:228721:0:(lfsck_lib.c:3104:lfsck_start()) Process entered 00100000:00000001:1.0:1703243105.771869:0:217993:0:(lfsck_lib.c:3104:lfsck_start()) Process entered 00100000:00000001:1.0:1703243105.771873:0:217993:0:(lfsck_bookmark.c:107:lfsck_bookmark_store()) Process entered 00080000:00000001:1.0:1703243105.771874:0:217993:0:(osd_handler.c:1912:osd_trans_create()) Process entered 00080000:00000010:1.0:1703243105.771877:0:217993:0:(osd_handler.c:1927:osd_trans_create()) kmalloced '(oh)': 288 at 00000000a80bb82f. 00100000:00000001:0.0:1703243105.771880:0:228721:0:(lfsck_lib.c:3174:lfsck_start()) Process leaving via out (rc=18446744073709551521 : -95 : 0xffffffffffffffa1) 00100000:00000001:0.0:1703243105.771886:0:228721:0:(lfsck_lib.c:3546:lfsck_in_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1) 00000020:00000001:0.0:1703243105.771887:0:228721:0:(tgt_handler.c:1629:tgt_handle_lfsck_notify()) Process leaving (rc=18446744073709551521 : -95 : ffffffffffffffa1) 00080000:00000001:1.0:1703243105.771888:0:217993:0:(osd_handler.c:1955:osd_trans_create()) Process leaving (rc=18446619811807463424 : -124261902088192 : ffff8efc05778800) 00010000:00000040:0.0:1703243105.771889:0:228721:0:(ldlm_lib.c:3238:target_committed_to_req()) last_committed 0, transno 0, xid 1785973155960256 00010000:00000001:0.0:1703243105.771890:0:228721:0:(ldlm_lib.c:3307:target_send_reply()) Process entered 00010000:00000200:0.0:1703243105.771892:0:228721:0:(ldlm_lib.c:3295:target_send_reply_msg()) @@@ sending reply req@00000000c24bd5b0 x1785973155960256/t0(0) o1101->lustre-MDT0000-mdtlov_UUID@10.240.38.25@tcp:111/0 lens 320/224 e 0 to 0 dl 1703243116 ref 1 fl Interpret:/200/0 rc -95/0 job:'lctl.0' uid:0 gid:0

            People

              zam Alexander Zarochentsev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: