[LU-16408] replay-dual test_33: unable to mount /mnt/lustre2 Created: 15/Dec/22 Updated: 27/Oct/23 Resolved: 23/Sep/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Etienne Aujames |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for jianyu <yujian@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/bd566ffb-4675-4852-9ac7-ec33bd93b99f test_33 failed with the following error: Started lustre-MDT0000 Starting client: trevis-83vm7.trevis.whamcloud.com: -o user_xattr,flock trevis-80vm7@tcp:/lustre /mnt/lustre2 CMD: trevis-83vm7.trevis.whamcloud.com mkdir -p /mnt/lustre2 CMD: trevis-83vm7.trevis.whamcloud.com mount -t lustre -o user_xattr,flock trevis-80vm7@tcp:/lustre /mnt/lustre2 Dmesg on client: LustreError: 11-0: lustre-MDT0000-mdc-ffff9aa6d17cc800: operation mds_connect to node 10.240.42.182@tcp failed: rc = -11 LustreError: Skipped 3 previous similar messages Lustre: lustre-OST0006-osc-ffff9aa6d92e2800: Connection to lustre-OST0006 (at 10.240.42.196@tcp) was lost; in progress operations using this service will wait for recovery to complete Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0001-osc-ffff9aa6d92e2800: Connection restored to (at 10.240.42.196@tcp) Lustre: Skipped 5 previous similar messages LustreError: 10874:0:(lmv_obd.c:1287:lmv_statfs()) lustre-MDT0000-mdc-ffff9aa6d17cc800: can't stat MDS #0: rc = -11 Lustre: Unmounted lustre-client LustreError: 10874:0:(super25.c:181:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -11 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Jian Yu [ 15/Dec/22 ] |
|
The replay-dual test_33 was added by https://review.whamcloud.com/48082 (" |
| Comment by Etienne Aujames [ 27/Mar/23 ] |
|
Hi, The mounted client did not finish completely its recovery and was evicted in REPLAY_LOCKS: [Tue Nov 15 16:29:06 2022] LustreError: 876057:0:(mdt_handler.c:7441:mdt_iocontrol()) lustre-MDT0000: Aborting client recovery [Tue Nov 15 16:29:06 2022] LustreError: 876057:0:(ldlm_lib.c:2888:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery [Tue Nov 15 16:29:06 2022] Lustre: 874819:0:(ldlm_lib.c:2294:target_recovery_overseer()) recovery is aborted, evict exports in recovery [Tue Nov 15 16:29:06 2022] Lustre: 874819:0:(ldlm_lib.c:2294:target_recovery_overseer()) Skipped 2 previous similar messages [Tue Nov 15 16:29:06 2022] Lustre: lustre-MDT0000: disconnecting 2 stale clients [Tue Nov 15 16:29:06 2022] LustreError: 874819:0:(tgt_grant.c:257:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 4194304 != fo_tot_granted 6291456 [Tue Nov 15 16:29:06 2022] LustreError: 874819:0:(ldlm_lib.c:1829:abort_lock_replay_queue()) @@@ aborted: req@00000000b51e723d x1749569008570176/t0(0) o101->1b7e53c5-3301-41c4-8e4d-aab3eade9ae8@10.240.42.242@tcp:151/0 lens 328/0 e 0 to 0 dl 1668529766 ref 1 fl Complete:/40/ffffffff rc 0/-1 job:'ldlm_lock_repla.0' [Tue Nov 15 16:29:06 2022] LustreError: 874819:0:(ldlm_lib.c:1829:abort_lock_replay_queue()) Skipped 25 previous similar messages [Tue Nov 15 16:29:06 2022] Lustre: lustre-MDT0000: Denying connection for new client 1b7e53c5-3301-41c4-8e4d-aab3eade9ae8 (at 10.240.42.242@tcp), waiting for 4 known clients (2 recovered, 0 in progress, and 2 evicted) to recover in 1:14 [Tue Nov 15 16:29:06 2022] Lustre: Skipped 3 previous similar messages But the client import is set to FULL, the MDT recovery did not finish and then the MDT was unmounted. After remounting the MDT the 2sd client is unable to remount. This looks like a new bug. For now, I will stabilize the test. |
| Comment by Gerrit Updater [ 27/Mar/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50434 |
| Comment by Gerrit Updater [ 23/Sep/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50434/ |
| Comment by Peter Jones [ 23/Sep/23 ] |
|
Landed for 2.16 |
| Comment by Alex Zhuravlev [ 25/Sep/23 ] |
|
the patch breaks replay-dual/33 on a local (single VM) setup. before it took ~30 seconds to complete replay-dual, now it can't complete within 30 minutes. |
| Comment by Alex Zhuravlev [ 27/Sep/23 ] |
|
replay-dual/33 gets stuck at the following operation:
! combined_mgs_mds || $LCTL get_param mdc.*.ping || true
can you please explain the purpose of this get_param ? |
| Comment by Alex Zhuravlev [ 27/Sep/23 ] |
|
actually it mount_facet mds1 fails:
[ 122.710153] LustreError: 15c-8: MGC192.168.127.51@tcp: Confguration from log lustre-MDT0000 failed from MGS -5. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info
[ 122.715086] LustreError: 7365:0:(tgt_mount.c:1524:server_start_targets()) failed to start server lustre-MDT0000: -5
[ 122.715704] LustreError: 7365:0:(tgt_mount.c:2216:server_fill_super()) Unable to start targets: -5
[ 122.716325] LustreError: 7365:0:(tgt_mount.c:1752:server_put_super()) no obd lustre-MDT0000
[ 122.718737] Lustre: server umount lustre-MDT0000 complete
[ 122.719112] LustreError: 7365:0:(super25.c:188:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -5
mount.lustre: mount /dev/mapper/mds1_flakey at /mnt/lustre-mds1 failed: Input/output error
Is the MGS running?
Start of /dev/mapper/mds1_flakey on mds1 failed 5
|
| Comment by Etienne Aujames [ 27/Sep/23 ] |
|
Yes, for combined mgt/mdt, when we umount the target (failover) we disable all mgs services. So the IR (Imperative Recovery) is disabled, then the targets states are not directly updated. In the real world it has to wait for the next pinger ping (obd_timeout/4 s). I think the issue here is that if the client is the on the MGS, it uses a pingless connection on 0@lo. Do you have a failed test link with logs? |