[LU-16408] replay-dual test_33: unable to mount /mnt/lustre2 Created: 15/Dec/22  Updated: 27/Oct/23  Resolved: 23/Sep/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Etienne Aujames
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-17229 replay-dual test_33: import is not in... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/bd566ffb-4675-4852-9ac7-ec33bd93b99f

test_33 failed with the following error:

Started lustre-MDT0000
Starting client: trevis-83vm7.trevis.whamcloud.com:  -o user_xattr,flock trevis-80vm7@tcp:/lustre /mnt/lustre2
CMD: trevis-83vm7.trevis.whamcloud.com mkdir -p /mnt/lustre2
CMD: trevis-83vm7.trevis.whamcloud.com mount -t lustre -o user_xattr,flock trevis-80vm7@tcp:/lustre /mnt/lustre2

Dmesg on client:

LustreError: 11-0: lustre-MDT0000-mdc-ffff9aa6d17cc800: operation mds_connect to node 10.240.42.182@tcp failed: rc = -11
LustreError: Skipped 3 previous similar messages
Lustre: lustre-OST0006-osc-ffff9aa6d92e2800: Connection to lustre-OST0006 (at 10.240.42.196@tcp) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0001-osc-ffff9aa6d92e2800: Connection restored to  (at 10.240.42.196@tcp) 
Lustre: Skipped 5 previous similar messages
LustreError: 10874:0:(lmv_obd.c:1287:lmv_statfs()) lustre-MDT0000-mdc-ffff9aa6d17cc800: can't stat MDS #0: rc = -11
Lustre: Unmounted lustre-client
LustreError: 10874:0:(super25.c:181:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -11

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-dual test_33 - Timeout occurred after 434 minutes, last suite running was replay-dual



 Comments   
Comment by Jian Yu [ 15/Dec/22 ]

The replay-dual test_33 was added by https://review.whamcloud.com/48082 ("LU-15935 target: keep track of multirpc slots in last_rcvd").
Hi eaujames, could you please advise?

Comment by Etienne Aujames [ 27/Mar/23 ]

Hi,

The mounted client did not finish completely its recovery and was evicted in REPLAY_LOCKS:

[Tue Nov 15 16:29:06 2022] LustreError: 876057:0:(mdt_handler.c:7441:mdt_iocontrol()) lustre-MDT0000: Aborting client recovery
[Tue Nov 15 16:29:06 2022] LustreError: 876057:0:(ldlm_lib.c:2888:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery
[Tue Nov 15 16:29:06 2022] Lustre: 874819:0:(ldlm_lib.c:2294:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[Tue Nov 15 16:29:06 2022] Lustre: 874819:0:(ldlm_lib.c:2294:target_recovery_overseer()) Skipped 2 previous similar messages
[Tue Nov 15 16:29:06 2022] Lustre: lustre-MDT0000: disconnecting 2 stale clients
[Tue Nov 15 16:29:06 2022] LustreError: 874819:0:(tgt_grant.c:257:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 4194304 != fo_tot_granted 6291456
[Tue Nov 15 16:29:06 2022] LustreError: 874819:0:(ldlm_lib.c:1829:abort_lock_replay_queue()) @@@ aborted:  req@00000000b51e723d x1749569008570176/t0(0) o101->1b7e53c5-3301-41c4-8e4d-aab3eade9ae8@10.240.42.242@tcp:151/0 lens 328/0 e 0 to 0 dl 1668529766 ref 1 fl Complete:/40/ffffffff rc 0/-1 job:'ldlm_lock_repla.0'
[Tue Nov 15 16:29:06 2022] LustreError: 874819:0:(ldlm_lib.c:1829:abort_lock_replay_queue()) Skipped 25 previous similar messages
[Tue Nov 15 16:29:06 2022] Lustre: lustre-MDT0000: Denying connection for new client 1b7e53c5-3301-41c4-8e4d-aab3eade9ae8 (at 10.240.42.242@tcp), waiting for 4 known clients (2 recovered, 0 in progress, and 2 evicted) to recover in 1:14
[Tue Nov 15 16:29:06 2022] Lustre: Skipped 3 previous similar messages 

But the client import is set to FULL, the MDT recovery did not finish and then the MDT was unmounted. After remounting the MDT the 2sd client is unable to remount.

This looks like a new bug. For now, I will stabilize the test.

Comment by Gerrit Updater [ 27/Mar/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50434
Subject: LU-16408 tests: fix replay-dual test 33
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b513f684834cf0fc7ebab6b319fc2ae098ff60d6

Comment by Gerrit Updater [ 23/Sep/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50434/
Subject: LU-16408 tests: fix replay-dual test 33
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7f89e8c8975fcc82983c5756438861d66e64ec23

Comment by Peter Jones [ 23/Sep/23 ]

Landed for 2.16

Comment by Alex Zhuravlev [ 25/Sep/23 ]

the patch breaks replay-dual/33 on a local (single VM) setup. before it took ~30 seconds to complete replay-dual, now it can't complete within 30 minutes.

Comment by Alex Zhuravlev [ 27/Sep/23 ]

replay-dual/33 gets stuck at the following operation:

	! combined_mgs_mds || $LCTL get_param mdc.*.ping || true

can you please explain the purpose of this get_param ?

Comment by Alex Zhuravlev [ 27/Sep/23 ]

actually it mount_facet mds1 fails:

[  122.710153] LustreError: 15c-8: MGC192.168.127.51@tcp: Confguration from log lustre-MDT0000 failed from MGS -5. Communication error between node & MGS, a bad configuration, or other errors. See syslog for more info
[  122.715086] LustreError: 7365:0:(tgt_mount.c:1524:server_start_targets()) failed to start server lustre-MDT0000: -5
[  122.715704] LustreError: 7365:0:(tgt_mount.c:2216:server_fill_super()) Unable to start targets: -5
[  122.716325] LustreError: 7365:0:(tgt_mount.c:1752:server_put_super()) no obd lustre-MDT0000
[  122.718737] Lustre: server umount lustre-MDT0000 complete
[  122.719112] LustreError: 7365:0:(super25.c:188:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -5
mount.lustre: mount /dev/mapper/mds1_flakey at /mnt/lustre-mds1 failed: Input/output error
Is the MGS running?
Start of /dev/mapper/mds1_flakey on mds1 failed 5
Comment by Etienne Aujames [ 27/Sep/23 ]

Yes, for combined mgt/mdt, when we umount the target (failover) we disable all mgs services. So the IR (Imperative Recovery) is disabled, then the targets states are not directly updated. In the real world it has to wait for the next pinger ping (obd_timeout/4 s).
The "ping" here is workarround to force update the import state.

I think the issue here is that if the client is the on the MGS, it uses a pingless connection on 0@lo.
We have to skip this test for single node testing.

Do you have a failed test link with logs?

Generated at Sat Feb 10 03:26:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.