Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6992

recovery-random-scale test_fail_client_mds: test_fail_client_mds returned 254

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.8.0
    • client and server: lustre-master build #3120 RHEL7.1
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/bd81fa12-35e2-11e5-8c30-5254006e85c2.

      The sub-test test_fail_client_mds failed with the following error:

      test_fail_client_mds returned 254
      

      MDS(onyx-40vm3) console shows following messages after rebooting
      https://testing.hpdd.intel.com/test_logs/bfaf0230-35e2-11e5-8c30-5254006e85c2/show_text

      10:36:13:[  102.850823] LDISKFS-fs warning (device dm-0): ldiskfs_multi_mount_protect:318: MMP interval 42 higher than expected, please wait.
      10:36:13:[  102.850823] 
      10:36:13:[  145.221059] LDISKFS-fs (dm-0): recovery complete
      10:36:13:[  145.245767] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
      10:36:13:[  146.012109] LustreError: 140-5: Server lustre-MDT0000 requested index 0, but that index is already in use. Use --writeconf to force
      10:36:13:[  146.013180] LustreError: 1800:0:(mgs_handler.c:439:mgs_target_reg()) Failed to write lustre-MDT0000 log (-98)
      10:36:13:[  147.148490] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/li
      10:36:13:[  148.851377] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      10:36:13:[  149.089600] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
      10:36:13:[  162.165215] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
      10:36:13:[  163.485841] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934593, ql: 1, comp: 1, conn: 2, next: 8589934596, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.591162] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934597, ql: 1, comp: 1, conn: 2, next: 8589934598, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.596731] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934599, ql: 1, comp: 1, conn: 2, next: 8589934608, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.610726] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934615, ql: 1, comp: 1, conn: 2, next: 8589934616, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.619367] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934618, ql: 1, comp: 1, conn: 2, next: 8589934621, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.630728] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934625, ql: 1, comp: 1, conn: 2, next: 8589934626, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.638126] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934628, ql: 1, comp: 1, conn: 2, next: 8589934631, next_update 0 last_committed: 8589934592)
      10:36:13:[  163.660532] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934644, ql: 1, comp: 1, conn: 2, next: 8589934645, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.669816] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934648, ql: 1, comp: 1, conn: 2, next: 8589934651, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.677050] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934652, ql: 1, comp: 1, conn: 2, next: 8589934653, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.687074] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934656, ql: 1, comp: 1, conn: 2, next: 8589934659, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.701600] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934661, ql: 1, comp: 1, conn: 2, next: 8589934662, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.715377] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934664, ql: 1, comp: 1, conn: 2, next: 8589934666, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.727055] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934670, ql: 1, comp: 1, conn: 2, next: 8589934673, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.737215] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934675, ql: 1, comp: 1, conn: 2, next: 8589934678, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.752852] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934682, ql: 1, comp: 1, conn: 2, next: 8589934683, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.770265] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934684, ql: 1, comp: 1, conn: 2, next: 8589934685, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.784502] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934687, ql: 1, comp: 1, conn: 2, next: 8589934690, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.797854] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934697, ql: 1, comp: 1, conn: 2, next: 8589934698, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.806316] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934699, ql: 1, comp: 1, conn: 2, next: 8589934701, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.821236] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934710, ql: 1, comp: 1, conn: 2, next: 8589934713, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.831932] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934716, ql: 1, comp: 1, conn: 2, next: 8589934717, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.841241] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934719, ql: 1, comp: 1, conn: 2, next: 8589934724, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.853399] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934727, ql: 1, comp: 1, conn: 2, next: 8589934730, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.863080] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934731, ql: 1, comp: 1, conn: 2, next: 8589934732, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.872376] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934733, ql: 1, comp: 1, conn: 2, next: 8589934735, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.882055] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934736, ql: 1, comp: 1, conn: 2, next: 8589934737, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.889204] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934738, ql: 1, comp: 1, conn: 2, next: 8589934741, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.897106] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934742, ql: 1, comp: 1, conn: 2, next: 8589934743, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.905043] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934744, ql: 1, comp: 1, conn: 2, next: 8589934747, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.912760] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934748, ql: 1, comp: 1, conn: 2, next: 8589934749, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.920070] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934750, ql: 1, comp: 1, conn: 2, next: 8589934752, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.937850] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934756, ql: 1, comp: 1, conn: 2, next: 8589934757, next_update 0 last_committed: 8589934592)
      10:36:14:[  163.948097] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934758, ql: 1, comp: 1, conn: 2, next: 8589934759, next_update 0 last_committed: 8589934592)
      10:36:14:[  164.103266] Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
      10:36:14:[  164.572861] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
      10:36:14:[  164.801625] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
      10:37:36:[  167.599971] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
      10:37:36:[  167.799548] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
      10:37:36:[  170.826361] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  auster : @@@@@@ FAIL: import is not in FULL state 
      10:37:36:[  173.608102] Lustre: DEBUG MARKER: auster : @@@@@@ FAIL: import is not in FULL state
      10:37:36:[  176.622637] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /tmp/test_logs/1438165590/auster..debug_log.$(hostname -s).1438166178.log;
      10:37:36:[  176.622637]          dmesg > /tmp/test_logs/1438165590/auster..dmesg.$(hostname -s).1438166178.log
      

      Attachments

        Issue Links

          Activity

            [LU-6992] recovery-random-scale test_fail_client_mds: test_fail_client_mds returned 254

            Landed for 2.8

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16781/
            Subject: LU-6992 test: wait device to be registered
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d76285d649545379b2bbf10aa5bcd3fdf4c9b52a

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16781/ Subject: LU-6992 test: wait device to be registered Project: fs/lustre-release Branch: master Current Patch Set: Commit: d76285d649545379b2bbf10aa5bcd3fdf4c9b52a

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16781
            Subject: LU-6992 test: wait device to be registered
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e40de978e757a412f46ddec39b1ea2f8d7e98852

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16781 Subject: LU-6992 test: wait device to be registered Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e40de978e757a412f46ddec39b1ea2f8d7e98852
            pjones Peter Jones added a comment -

            Hongchao, could you please push a patch to try this out? Thanks!

            pjones Peter Jones added a comment - Hongchao, could you please push a patch to try this out? Thanks!

            Jian, it is OK to wait_update() in init_facet_vars() for the label to be updated.

            adilger Andreas Dilger added a comment - Jian, it is OK to wait_update() in init_facet_vars() for the label to be updated.
            yujian Jian Yu added a comment -

            Hi Andreas,

            In wait_recovery_complete():

                            local var_svc=${facet}_svc
                            local param="*.${!var_svc}.recovery_status"
            

            And ${facet}_svc was defined in mount_facet():

                    label=$(devicelabel ${facet} ${!dev})
                    [ -z "$label" ] && echo no label for ${!dev} && exit 1
                    eval export ${facet}_svc=${label}
                    echo Started ${label}
            

            So, we can not specify the label like $FSNAME.OST0000 directly.

            yujian Jian Yu added a comment - Hi Andreas, In wait_recovery_complete(): local var_svc=${facet}_svc local param= "*.${!var_svc}.recovery_status" And ${facet}_svc was defined in mount_facet(): label=$(devicelabel ${facet} ${!dev}) [ -z "$label" ] && echo no label for ${!dev} && exit 1 eval export ${facet}_svc=${label} echo Started ${label} So, we can not specify the label like $FSNAME.OST0000 directly.

            What about using $FSNAME.OST0000 ("." instead of "-") for get_param so that it always matches? Or won't it work properly if the MDS isn't fully configured?

            adilger Andreas Dilger added a comment - What about using $FSNAME.OST0000 ("." instead of "-") for get_param so that it always matches? Or won't it work properly if the MDS isn't fully configured?
            yujian Jian Yu added a comment -
            onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match
            

            The server target device label was wrong. "lustre:MDT0000" should be "lustre-MDT0000".

            In mount_facet() in test-framework.sh, maybe we need wait for the label on device to be changed from "lustre:MDT0000" to "lustre-MDT0000", and then run e2label?

            There is a similar way in init_facet_vars() to check whether label contains ffff or not and wait for it becoming initialized.

            yujian Jian Yu added a comment - onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match The server target device label was wrong. "lustre:MDT0000" should be "lustre-MDT0000". In mount_facet() in test-framework.sh, maybe we need wait for the label on device to be changed from "lustre:MDT0000" to "lustre-MDT0000", and then run e2label? There is a similar way in init_facet_vars() to check whether label contains ffff or not and wait for it becoming initialized.

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16430
            Subject: LU-6992 test: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4807567bf58a62cdde0536d7a4a4c6d5213fca93

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16430 Subject: LU-6992 test: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4807567bf58a62cdde0536d7a4a4c6d5213fca93

            this should be the same problem as LU-6890, LU-1875, LU-2415

            hongchao.zhang Hongchao Zhang added a comment - this should be the same problem as LU-6890 , LU-1875 , LU-2415

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: