Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6992

recovery-random-scale test_fail_client_mds: test_fail_client_mds returned 254

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.8.0
    • client and server: lustre-master build #3120 RHEL7.1
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/bd81fa12-35e2-11e5-8c30-5254006e85c2.

      The sub-test test_fail_client_mds failed with the following error:

      test_fail_client_mds returned 254
      

      MDS(onyx-40vm3) console shows following messages after rebooting
      https://testing.hpdd.intel.com/test_logs/bfaf0230-35e2-11e5-8c30-5254006e85c2/show_text

      10:36:13:[  102.850823] LDISKFS-fs warning (device dm-0): ldiskfs_multi_mount_protect:318: MMP interval 42 higher than expected, please wait.
      10:36:13:[  102.850823] 
      10:36:13:[  145.221059] LDISKFS-fs (dm-0): recovery complete
      10:36:13:[  145.245767] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
      10:36:13:[  146.012109] LustreError: 140-5: Server lustre-MDT0000 requested index 0, but that index is already in use. Use --writeconf to force
      10:36:13:[  146.013180] LustreError: 1800:0:(mgs_handler.c:439:mgs_target_reg()) Failed to write lustre-MDT0000 log (-98)
      10:36:13:[  147.148490] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/li
      10:36:13:[  148.851377] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      10:36:13:[  149.089600] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
      10:36:13:[  162.165215] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
      10:36:13:[  163.485841] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934593, ql: 1, comp: 1, conn: 2, next: 8589934596, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.591162] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934597, ql: 1, comp: 1, conn: 2, next: 8589934598, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.596731] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934599, ql: 1, comp: 1, conn: 2, next: 8589934608, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.610726] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934615, ql: 1, comp: 1, conn: 2, next: 8589934616, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.619367] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934618, ql: 1, comp: 1, conn: 2, next: 8589934621, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.630728] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934625, ql: 1, comp: 1, conn: 2, next: 8589934626, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.638126] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934628, ql: 1, comp: 1, conn: 2, next: 8589934631, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.660532] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934644, ql: 1, comp: 1, conn: 2, next: 8589934645, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.669816] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934648, ql: 1, comp: 1, conn: 2, next: 8589934651, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.677050] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934652, ql: 1, comp: 1, conn: 2, next: 8589934653, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.687074] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934656, ql: 1, comp: 1, conn: 2, next: 8589934659, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.701600] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934661, ql: 1, comp: 1, conn: 2, next: 8589934662, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.715377] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934664, ql: 1, comp: 1, conn: 2, next: 8589934666, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.727055] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934670, ql: 1, comp: 1, conn: 2, next: 8589934673, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.737215] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934675, ql: 1, comp: 1, conn: 2, next: 8589934678, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.752852] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934682, ql: 1, comp: 1, conn: 2, next: 8589934683, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.770265] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934684, ql: 1, comp: 1, conn: 2, next: 8589934685, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.784502] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934687, ql: 1, comp: 1, conn: 2, next: 8589934690, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.797854] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934697, ql: 1, comp: 1, conn: 2, next: 8589934698, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.806316] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934699, ql: 1, comp: 1, conn: 2, next: 8589934701, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.821236] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934710, ql: 1, comp: 1, conn: 2, next: 8589934713, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.831932] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934716, ql: 1, comp: 1, conn: 2, next: 8589934717, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.841241] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934719, ql: 1, comp: 1, conn: 2, next: 8589934724, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.853399] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934727, ql: 1, comp: 1, conn: 2, next: 8589934730, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.863080] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934731, ql: 1, comp: 1, conn: 2, next: 8589934732, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.872376] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934733, ql: 1, comp: 1, conn: 2, next: 8589934735, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.882055] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934736, ql: 1, comp: 1, conn: 2, next: 8589934737, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.889204] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934738, ql: 1, comp: 1, conn: 2, next: 8589934741, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.897106] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934742, ql: 1, comp: 1, conn: 2, next: 8589934743, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.905043] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934744, ql: 1, comp: 1, conn: 2, next: 8589934747, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.912760] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934748, ql: 1, comp: 1, conn: 2, next: 8589934749, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.920070] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934750, ql: 1, comp: 1, conn: 2, next: 8589934752, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.937850] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934756, ql: 1, comp: 1, conn: 2, next: 8589934757, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.948097] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934758, ql: 1, comp: 1, conn: 2, next: 8589934759, next_update 0 last_committed: 8589934592)
      10:36:14:[  164.103266] Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
      10:36:14:[  164.572861] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
      10:36:14:[  164.801625] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
      10:37:36:[  167.599971] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
      10:37:36:[  167.799548] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
      10:37:36:[  170.826361] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  auster : @@@@@@ FAIL: import is not in FULL state 
      10:37:36:[  173.608102] Lustre: DEBUG MARKER: auster : @@@@@@ FAIL: import is not in FULL state
      10:37:36:[  176.622637] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /tmp/test_logs/1438165590/auster..debug_log.$(hostname -s).1438166178.log;
      10:37:36:[  176.622637]          dmesg > /tmp/test_logs/1438165590/auster..dmesg.$(hostname -s).1438166178.log
      

      Attachments

        Issue Links

          Activity

            [LU-6992] recovery-random-scale test_fail_client_mds: test_fail_client_mds returned 254

            What about using $FSNAME.OST0000 ("." instead of "-") for get_param so that it always matches? Or won't it work properly if the MDS isn't fully configured?

            adilger Andreas Dilger added a comment - What about using $FSNAME.OST0000 ("." instead of "-") for get_param so that it always matches? Or won't it work properly if the MDS isn't fully configured?
            yujian Jian Yu added a comment -
            onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match
            

            The server target device label was wrong. "lustre:MDT0000" should be "lustre-MDT0000".

            In mount_facet() in test-framework.sh, maybe we need wait for the label on device to be changed from "lustre:MDT0000" to "lustre-MDT0000", and then run e2label?

            There is a similar way in init_facet_vars() to check whether label contains ffff or not and wait for it becoming initialized.

            yujian Jian Yu added a comment - onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match The server target device label was wrong. "lustre:MDT0000" should be "lustre-MDT0000". In mount_facet() in test-framework.sh, maybe we need wait for the label on device to be changed from "lustre:MDT0000" to "lustre-MDT0000", and then run e2label? There is a similar way in init_facet_vars() to check whether label contains ffff or not and wait for it becoming initialized.

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16430
            Subject: LU-6992 test: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4807567bf58a62cdde0536d7a4a4c6d5213fca93

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16430 Subject: LU-6992 test: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4807567bf58a62cdde0536d7a4a4c6d5213fca93

            this should be the same problem as LU-6890, LU-1875, LU-2415

            hongchao.zhang Hongchao Zhang added a comment - this should be the same problem as LU-6890 , LU-1875 , LU-2415
            green Oleg Drokin added a comment -

            Because it is rhel7.1, there is an active TEI issue.
            If you need vm3 logs - you just go to session and then enter the "lustre-initialization" before your test you ate intrested in and look a tthe logs there, there is your console logsfor all subsequent tests.

            green Oleg Drokin added a comment - Because it is rhel7.1, there is an active TEI issue. If you need vm3 logs - you just go to session and then enter the "lustre-initialization" before your test you ate intrested in and look a tthe logs there, there is your console logsfor all subsequent tests.

            the client connected to the "onyx-40vm3" to check whether the recovery is completed

            affected facets: mds1
            CMD: onyx-40vm7 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh _wait_recovery_complete *.lustre:MDT0000.recovery_status 1475 
            onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match
            mds1 recovery is not completed!
            

            and there is also no "onyx-40vm3" in the report at https://testing.hpdd.intel.com/test_sets/bd81fa12-35e2-11e5-8c30-5254006e85c2
            It could be a TEI problem?

            hongchao.zhang Hongchao Zhang added a comment - the client connected to the "onyx-40vm3" to check whether the recovery is completed affected facets: mds1 CMD: onyx-40vm7 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh _wait_recovery_complete *.lustre:MDT0000.recovery_status 1475 onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match mds1 recovery is not completed! and there is also no "onyx-40vm3" in the report at https://testing.hpdd.intel.com/test_sets/bd81fa12-35e2-11e5-8c30-5254006e85c2 It could be a TEI problem?
            pjones Peter Jones added a comment -

            Hongchao

            Could you please assist with this issue?

            thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please assist with this issue? thanks Peter

            The check_for_next_transno() messages should be fixed or quieted in some manner, since so many error messages cannot be useful (they were also spewing in LU-5805 and several other bugs).

            It would also be useful to print the NID in the 140-5: Server lustre-MDT0000 requested index 0, but that index is already in use message, since it isn't clear if the MDT0000 configuration request is coming from some spurious node or something.

            adilger Andreas Dilger added a comment - The check_for_next_transno() messages should be fixed or quieted in some manner, since so many error messages cannot be useful (they were also spewing in LU-5805 and several other bugs). It would also be useful to print the NID in the 140-5: Server lustre-MDT0000 requested index 0, but that index is already in use message, since it isn't clear if the MDT0000 configuration request is coming from some spurious node or something.
            pjones Peter Jones added a comment -

            Mike

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please advise? Thanks Peter

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: