[LU-6992] recovery-random-scale test_fail_client_mds: test_fail_client_mds returned 254 Created: 12/Aug/15 Updated: 19/Nov/16 Resolved: 02/Nov/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | p4hc | ||
| Environment: |
client and server: lustre-master build #3120 RHEL7.1 |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/bd81fa12-35e2-11e5-8c30-5254006e85c2. The sub-test test_fail_client_mds failed with the following error: test_fail_client_mds returned 254 MDS(onyx-40vm3) console shows following messages after rebooting 10:36:13:[ 102.850823] LDISKFS-fs warning (device dm-0): ldiskfs_multi_mount_protect:318: MMP interval 42 higher than expected, please wait. 10:36:13:[ 102.850823] 10:36:13:[ 145.221059] LDISKFS-fs (dm-0): recovery complete 10:36:13:[ 145.245767] LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache 10:36:13:[ 146.012109] LustreError: 140-5: Server lustre-MDT0000 requested index 0, but that index is already in use. Use --writeconf to force 10:36:13:[ 146.013180] LustreError: 1800:0:(mgs_handler.c:439:mgs_target_reg()) Failed to write lustre-MDT0000 log (-98) 10:36:13:[ 147.148490] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/li 10:36:13:[ 148.851377] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1 10:36:13:[ 149.089600] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null 10:36:13:[ 162.165215] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect 10:36:13:[ 163.485841] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934593, ql: 1, comp: 1, conn: 2, next: 8589934596, next_update 0 last_committed: 8589934592) 10:36:13:[ 163.591162] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934597, ql: 1, comp: 1, conn: 2, next: 8589934598, next_update 0 last_committed: 8589934592) 10:36:13:[ 163.596731] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934599, ql: 1, comp: 1, conn: 2, next: 8589934608, next_update 0 last_committed: 8589934592) 10:36:13:[ 163.610726] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934615, ql: 1, comp: 1, conn: 2, next: 8589934616, next_update 0 last_committed: 8589934592) 10:36:13:[ 163.619367] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934618, ql: 1, comp: 1, conn: 2, next: 8589934621, next_update 0 last_committed: 8589934592) 10:36:13:[ 163.630728] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934625, ql: 1, comp: 1, conn: 2, next: 8589934626, next_update 0 last_committed: 8589934592) 10:36:13:[ 163.638126] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934628, ql: 1, comp: 1, conn: 2, next: 8589934631, next_update 0 last_committed: 8589934592) 10:36:13:[ 163.660532] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934644, ql: 1, comp: 1, conn: 2, next: 8589934645, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.669816] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934648, ql: 1, comp: 1, conn: 2, next: 8589934651, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.677050] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934652, ql: 1, comp: 1, conn: 2, next: 8589934653, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.687074] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934656, ql: 1, comp: 1, conn: 2, next: 8589934659, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.701600] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934661, ql: 1, comp: 1, conn: 2, next: 8589934662, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.715377] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934664, ql: 1, comp: 1, conn: 2, next: 8589934666, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.727055] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934670, ql: 1, comp: 1, conn: 2, next: 8589934673, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.737215] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934675, ql: 1, comp: 1, conn: 2, next: 8589934678, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.752852] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934682, ql: 1, comp: 1, conn: 2, next: 8589934683, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.770265] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934684, ql: 1, comp: 1, conn: 2, next: 8589934685, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.784502] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934687, ql: 1, comp: 1, conn: 2, next: 8589934690, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.797854] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934697, ql: 1, comp: 1, conn: 2, next: 8589934698, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.806316] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934699, ql: 1, comp: 1, conn: 2, next: 8589934701, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.821236] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934710, ql: 1, comp: 1, conn: 2, next: 8589934713, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.831932] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934716, ql: 1, comp: 1, conn: 2, next: 8589934717, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.841241] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934719, ql: 1, comp: 1, conn: 2, next: 8589934724, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.853399] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934727, ql: 1, comp: 1, conn: 2, next: 8589934730, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.863080] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934731, ql: 1, comp: 1, conn: 2, next: 8589934732, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.872376] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934733, ql: 1, comp: 1, conn: 2, next: 8589934735, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.882055] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934736, ql: 1, comp: 1, conn: 2, next: 8589934737, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.889204] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934738, ql: 1, comp: 1, conn: 2, next: 8589934741, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.897106] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934742, ql: 1, comp: 1, conn: 2, next: 8589934743, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.905043] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934744, ql: 1, comp: 1, conn: 2, next: 8589934747, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.912760] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934748, ql: 1, comp: 1, conn: 2, next: 8589934749, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.920070] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934750, ql: 1, comp: 1, conn: 2, next: 8589934752, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.937850] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934756, ql: 1, comp: 1, conn: 2, next: 8589934757, next_update 0 last_committed: 8589934592) 10:36:14:[ 163.948097] LustreError: 1851:0:(ldlm_lib.c:1831:check_for_next_transno()) lustre-MDT0000: waking for gap in transno, VBR is OFF (skip: 8589934758, ql: 1, comp: 1, conn: 2, next: 8589934759, next_update 0 last_committed: 8589934592) 10:36:14:[ 164.103266] Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. 10:36:14:[ 164.572861] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec 10:36:14:[ 164.801625] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec 10:37:36:[ 167.599971] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec 10:37:36:[ 167.799548] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 12 sec 10:37:36:[ 170.826361] Lustre: DEBUG MARKER: /usr/sbin/lctl mark auster : @@@@@@ FAIL: import is not in FULL state 10:37:36:[ 173.608102] Lustre: DEBUG MARKER: auster : @@@@@@ FAIL: import is not in FULL state 10:37:36:[ 176.622637] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /tmp/test_logs/1438165590/auster..debug_log.$(hostname -s).1438166178.log; 10:37:36:[ 176.622637] dmesg > /tmp/test_logs/1438165590/auster..dmesg.$(hostname -s).1438166178.log |
| Comments |
| Comment by Peter Jones [ 12/Aug/15 ] |
|
Mike Could you please advise? Thanks Peter |
| Comment by Andreas Dilger [ 12/Aug/15 ] |
|
The check_for_next_transno() messages should be fixed or quieted in some manner, since so many error messages cannot be useful (they were also spewing in It would also be useful to print the NID in the 140-5: Server lustre-MDT0000 requested index 0, but that index is already in use message, since it isn't clear if the MDT0000 configuration request is coming from some spurious node or something. |
| Comment by Peter Jones [ 31/Aug/15 ] |
|
Hongchao Could you please assist with this issue? thanks Peter |
| Comment by Hongchao Zhang [ 01/Sep/15 ] |
|
the client connected to the "onyx-40vm3" to check whether the recovery is completed affected facets: mds1 CMD: onyx-40vm7 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh _wait_recovery_complete *.lustre:MDT0000.recovery_status 1475 onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match mds1 recovery is not completed! and there is also no "onyx-40vm3" in the report at https://testing.hpdd.intel.com/test_sets/bd81fa12-35e2-11e5-8c30-5254006e85c2 |
| Comment by Oleg Drokin [ 08/Sep/15 ] |
|
Because it is rhel7.1, there is an active TEI issue. |
| Comment by Hongchao Zhang [ 15/Sep/15 ] |
|
this should be the same problem as LU-6890, |
| Comment by Gerrit Updater [ 15/Sep/15 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16430 |
| Comment by Jian Yu [ 02/Oct/15 ] |
onyx-40vm7: error: get_param: */lustre:MDT0000/recovery_status: Found no match The server target device label was wrong. "lustre:MDT0000" should be "lustre-MDT0000". In mount_facet() in test-framework.sh, maybe we need wait for the label on device to be changed from "lustre:MDT0000" to "lustre-MDT0000", and then run e2label? There is a similar way in init_facet_vars() to check whether label contains ffff or not and wait for it becoming initialized. |
| Comment by Andreas Dilger [ 02/Oct/15 ] |
|
What about using $FSNAME.OST0000 ("." instead of "-") for get_param so that it always matches? Or won't it work properly if the MDS isn't fully configured? |
| Comment by Jian Yu [ 07/Oct/15 ] |
|
Hi Andreas, In wait_recovery_complete(): local var_svc=${facet}_svc
local param="*.${!var_svc}.recovery_status"
And ${facet}_svc was defined in mount_facet(): label=$(devicelabel ${facet} ${!dev})
[ -z "$label" ] && echo no label for ${!dev} && exit 1
eval export ${facet}_svc=${label}
echo Started ${label}
So, we can not specify the label like $FSNAME.OST0000 directly. |
| Comment by Andreas Dilger [ 08/Oct/15 ] |
|
Jian, it is OK to wait_update() in init_facet_vars() for the label to be updated. |
| Comment by Peter Jones [ 08/Oct/15 ] |
|
Hongchao, could you please push a patch to try this out? Thanks! |
| Comment by Gerrit Updater [ 09/Oct/15 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/16781 |
| Comment by Gerrit Updater [ 02/Nov/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16781/ |
| Comment by Joseph Gmitter (Inactive) [ 02/Nov/15 ] |
|
Landed for 2.8 |