[LU-6992] recovery-random-scale test_fail_client_mds: test_fail_client_mds returned 254 Created: 12/Aug/15  Updated: 19/Nov/16  Resolved: 02/Nov/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: p4hc
Environment:

client and server: lustre-master build #3120 RHEL7.1


Issue Links:
Duplicate
is duplicated by LU-6725 replay-single test_0a: FAIL: import i... Resolved
Related
is related to LU-6890 recovery-mds-scale test_failover_mds:... Open
is related to LU-7428 conf-sanity test_84, replay-dual 0a: ... Resolved
is related to LU-2415 recovery-mds-scale test_failover_mds:... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/bd81fa12-35e2-11e5-8c30-5254006e85c2.

The sub-test test_fail_client_mds failed with the following error:

test_fail_client_mds returned 254

MDS(onyx-40vm3) console shows following messages after rebooting
https://testing.hpdd.intel.com/test_logs/bfaf0230-35e2-11e5-8c30-5254006e85c2/show_text

10:36:13:[  102.850823] LDISKFS-fs warning (device dm-0): ldiskfs_multi_mount_protect:318: MMP interval 42 higher than expected, please wait.
10:36:13:[  102.850823] 
10:36:13:[  145.221059] LDISKFS-fs (dm-0): recovery complete
10:36:13:[  145.245767] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache
10:36:13:[  146.012109] LustreError: 140-5: Server lustre-MDT0000 requested index 0, but that index is already in use. Use --writeconf to force
10:36:13:[  146.013180] LustreError: 1800:0:(mgs_handler.c:439:mgs_target_reg()) Failed to write lustre-MDT0000 log (-98)
10:36:13:[  147.148490] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/li
10:36:13:[  148.851377] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
10:36:13:[  149.089600] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
10:36:13:[  162.165215] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
10:36:13:[  163.485841] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934593, ql: 1, comp: 1, conn: 2, next: 8589934596, next_update 0 last_committed: 8589934592)
10:36:13:[  163.591162] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934597, ql: 1, comp: 1, conn: 2, next: 8589934598, next_update 0 last_committed: 8589934592)
10:36:13:[  163.596731] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934599, ql: 1, comp: 1, conn: 2, next: 8589934608, next_update 0 last_committed: 8589934592)
10:36:13:[  163.610726] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934615, ql: 1, comp: 1, conn: 2, next: 8589934616, next_update 0 last_committed: 8589934592)
10:36:13:[  163.619367] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934618, ql: 1, comp: 1, conn: 2, next: 8589934621, next_update 0 last_committed: 8589934592)
10:36:13:[  163.630728] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934625, ql: 1, comp: 1, conn: 2, next: 8589934626, next_update 0 last_committed: 8589934592)
10:36:13:[  163.638126] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934628, ql: 1, comp: 1, conn: 2, next: 8589934631, next_update 0 last_committed: 8589934592)
10:36:13:[  163.660532] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934644, ql: 1, comp: 1, conn: 2, next: 8589934645, next_update 0 last_committed: 8589934592)
10:36:14:[  163.669816] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934648, ql: 1, comp: 1, conn: 2, next: 8589934651, next_update 0 last_committed: 8589934592)
10:36:14:[  163.677050] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934652, ql: 1, comp: 1, conn: 2, next: 8589934653, next_update 0 last_committed: 8589934592)
10:36:14:[  163.687074] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934656, ql: 1, comp: 1, conn: 2, next: 8589934659, next_update 0 last_committed: 8589934592)
10:36:14:[  163.701600] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934661, ql: 1, comp: 1, conn: 2, next: 8589934662, next_update 0 last_committed: 8589934592)
10:36:14:[  163.715377] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934664, ql: 1, comp: 1, conn: 2, next: 8589934666, next_update 0 last_committed: 8589934592)
10:36:14:[  163.727055] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934670, ql: 1, comp: 1, conn: 2, next: 8589934673, next_update 0 last_committed: 8589934592)
10:36:14:[  163.737215] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934675, ql: 1, comp: 1, conn: 2, next: 8589934678, next_update 0 last_committed: 8589934592)
10:36:14:[  163.752852] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934682, ql: 1, comp: 1, conn: 2, next: 8589934683, next_update 0 last_committed: 8589934592)
10:36:14:[  163.770265] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934684, ql: 1, comp: 1, conn: 2, next: 8589934685, next_update 0 last_committed: 8589934592)
10:36:14:[  163.784502] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934687, ql: 1, comp: 1, conn: 2, next: 8589934690, next_update 0 last_committed: 8589934592)
10:36:14:[  163.797854] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934697, ql: 1, comp: 1, conn: 2, next: 8589934698, next_update 0 last_committed: 8589934592)
10:36:14:[  163.806316] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934699, ql: 1, comp: 1, conn: 2, next: 8589934701, next_update 0 last_committed: 8589934592)
10:36:14:[  163.821236] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934710, ql: 1, comp: 1, conn: 2, next: 8589934713, next_update 0 last_committed: 8589934592)
10:36:14:[  163.831932] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934716, ql: 1, comp: 1, conn: 2, next: 8589934717, next_update 0 last_committed: 8589934592)
10:36:14:[  163.841241] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934719, ql: 1, comp: 1, conn: 2, next: 8589934724, next_update 0 last_committed: 8589934592)
10:36:14:[  163.853399] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934727, ql: 1, comp: 1, conn: 2, next: 8589934730, next_update 0 last_committed: 8589934592)
10:36:14:[  163.863080] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934731, ql: 1, comp: 1, conn: 2, next: 8589934732, next_update 0 last_committed: 8589934592)
10:36:14:[  163.872376] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934733, ql: 1, comp: 1, conn: 2, next: 8589934735, next_update 0 last_committed: 8589934592)
10:36:14:[  163.882055] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934736, ql: 1, comp: 1, conn: 2, next: 8589934737, next_update 0 last_committed: 8589934592)
10:36:14:[  163.889204] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934738, ql: 1, comp: 1, conn: 2, next: 8589934741, next_update 0 last_committed: 8589934592)
10:36:14:[  163.897106] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934742, ql: 1, comp: 1, conn: 2, next: 8589934743, next_update 0 last_committed: 8589934592)
10:36:14:[  163.905043] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934744, ql: 1, comp: 1, conn: 2, next: 8589934747, next_update 0 last_committed: 8589934592)
10:36:14:[  163.912760] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934748, ql: 1, comp: 1, conn: 2, next: 8589934749, next_update 0 last_committed: 8589934592)
10:36:14:[  163.920070] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934750, ql: 1, comp: 1, conn: 2, next: 8589934752, next_update 0 last_committed: 8589934592)
10:36:14:[  163.937850] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934756, ql: 1, comp: 1, conn: 2, next: 8589934757, next_update 0 last_committed: 8589934592)
10:36:14:[  163.948097] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934758, ql: 1, comp: 1, conn: 2, next: 8589934759, next_update 0 last_committed: 8589934592)
10:36:14:[  164.103266] Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
10:36:14:[  164.572861] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
10:36:14:[  164.801625] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
10:37:36:[  167.599971] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
10:37:36:[  167.799548] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec
10:37:36:[  170.826361] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  auster : @@@@@@ FAIL: import is not in FULL state 
10:37:36:[  173.608102] Lustre: DEBUG MARKER: auster : @@@@@@ FAIL: import is not in FULL state
10:37:36:[  176.622637] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /tmp/test_logs/1438165590/auster..debug_log.$(hostname -s).1438166178.log;
10:37:36:[  176.622637]          dmesg > /tmp/test_logs/1438165590/auster..dmesg.$(hostname -s).1438166178.log


 Comments   
Comment by Peter Jones [ 12/Aug/15 ]

Mike

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 12/Aug/15 ]

The check_for_next_transno() messages should be fixed or quieted in some manner, since so many error messages cannot be useful (they were also spewing in LU-5805 and several other bugs).

It would also be useful to print the NID in the 140-5: Server lustre-MDT0000 requested index 0, but that index is already in use message, since it isn't clear if the MDT0000 configuration request is coming from some spurious node or something.

Comment by Peter Jones [ 31/Aug/15 ]

Hongchao

Could you please assist with this issue?

thanks

Peter

Comment by Hongchao Zhang [ 01/Sep/15 ]

the client connected to the "onyx-40vm3" to check whether the recovery is completed

affected facets: mds1
CMD: onyx-40vm7 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh _wait_recovery_complete *.lustre:MDT0000.recovery_status 1475 
onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match
mds1 recovery is not completed!

and there is also no "onyx-40vm3" in the report at https://testing.hpdd.intel.com/test_sets/bd81fa12-35e2-11e5-8c30-5254006e85c2
It could be a TEI problem?

Comment by Oleg Drokin [ 08/Sep/15 ]

Because it is rhel7.1, there is an active TEI issue.
If you need vm3 logs - you just go to session and then enter the "lustre-initialization" before your test you ate intrested in and look a tthe logs there, there is your console logsfor all subsequent tests.

Comment by Hongchao Zhang [ 15/Sep/15 ]

this should be the same problem as LU-6890, LU-1875, LU-2415

Comment by Gerrit Updater [ 15/Sep/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16430
Subject: LU-6992 test: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4807567bf58a62cdde0536d7a4a4c6d5213fca93

Comment by Jian Yu [ 02/Oct/15 ]
onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match

The server target device label was wrong. "lustre:MDT0000" should be "lustre-MDT0000".

In mount_facet() in test-framework.sh, maybe we need wait for the label on device to be changed from "lustre:MDT0000" to "lustre-MDT0000", and then run e2label?

There is a similar way in init_facet_vars() to check whether label contains ffff or not and wait for it becoming initialized.

Comment by Andreas Dilger [ 02/Oct/15 ]

What about using $FSNAME.OST0000 ("." instead of "-") for get_param so that it always matches? Or won't it work properly if the MDS isn't fully configured?

Comment by Jian Yu [ 07/Oct/15 ]

Hi Andreas,

In wait_recovery_complete():

                local var_svc=${facet}_svc
                local param="*.${!var_svc}.recovery_status"

And ${facet}_svc was defined in mount_facet():

        label=$(devicelabel ${facet} ${!dev})
        [ -z "$label" ] && echo no label for ${!dev} && exit 1
        eval export ${facet}_svc=${label}
        echo Started ${label}

So, we can not specify the label like $FSNAME.OST0000 directly.

Comment by Andreas Dilger [ 08/Oct/15 ]

Jian, it is OK to wait_update() in init_facet_vars() for the label to be updated.

Comment by Peter Jones [ 08/Oct/15 ]

Hongchao, could you please push a patch to try this out? Thanks!

Comment by Gerrit Updater [ 09/Oct/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16781
Subject: LU-6992 test: wait device to be registered
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e40de978e757a412f46ddec39b1ea2f8d7e98852

Comment by Gerrit Updater [ 02/Nov/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16781/
Subject: LU-6992 test: wait device to be registered
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d76285d649545379b2bbf10aa5bcd3fdf4c9b52a

Comment by Joseph Gmitter (Inactive) [ 02/Nov/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:05:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.