[LU-16103] replay-dual test_0a: failed with 1 Created: 24/Aug/22  Updated: 24/Aug/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Alexander Boyko <c17825@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/80540e53-61df-461a-96c3-0494310acb41

test_0a failed with the following error:

Starting mds1: -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
CMD: onyx-25vm6 mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
CMD: onyx-25vm6 /usr/sbin/lctl get_param -n health_check
CMD: onyx-25vm6 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh set_default_debug \"-1\" \"all\" 4 
onyx-25vm6: onyx-25vm6.onyx.whamcloud.com: executing set_default_debug -1 all 4
CMD: onyx-25vm6 e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
pdsh@onyx-70vm5: onyx-25vm6: ssh exited with exit code 1
CMD: onyx-25vm6 e2label /dev/mapper/mds1_flakey 2>/dev/null
Started lustre-MDT0000
14:00:59 (1661263259) targets are mounted
14:00:59 (1661263259) facet_failover done
onyx-70vm5: error: invalid path '/mnt/lustre': Input/output error
pdsh@onyx-70vm5: onyx-70vm5: ssh exited with exit code 5
 replay-dual test_0a: @@@@@@ FAIL: test_0a failed with 1 

MDT0 logs

 1016.053375] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
[ 1016.758507] LDISKFS-fs (dm-6): recovery complete
[ 1016.769329] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[ 1018.408238] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.240.24.133@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[ 1018.411430] LustreError: Skipped 37 previous similar messages
[ 1024.172062] Lustre: Evicted from MGS (at 10.240.22.105@tcp) after server handle changed from 0xdfcaf8a1d05f1194 to 0xdfcaf8a1d05f2b6b
[ 1024.174648] Lustre: MGC10.240.22.105@tcp: Connection restored to  (at 0@lo)
[ 1024.498684] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
[ 1024.538571] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect
[ 1024.972661] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
[ 1025.561493] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
[ 1026.052233] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 7 clients reconnect
[ 1026.871725] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-25vm6.onyx.whamcloud.com: executing set_default_debug -1 all 4
[ 1026.883230] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-25vm6.onyx.whamcloud.com: executing set_default_debug -1 all 4
[ 1027.318845] Lustre: DEBUG MARKER: onyx-25vm6.onyx.whamcloud.com: executing set_default_debug -1 all 4
[ 1027.327146] Lustre: DEBUG MARKER: onyx-25vm6.onyx.whamcloud.com: executing set_default_debug -1 all 4
[ 1027.644129] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
[ 1028.228622] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null
[ 1029.808694] Lustre: lustre-MDT0000-lwp-MDT0002: Connection restored to 10.240.22.105@tcp (at 0@lo)
[ 1127.688565] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
[ 1127.690016] Lustre: lustre-MDT0000: disconnecting 1 stale clients
[ 1127.691192] LustreError: 32287:0:(tgt_grant.c:257:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 6291456 != fo_tot_granted 8388608
[ 1128.078197] Lustre: lustre-MDT0000: haven't heard from client e2671c3e-fba6-400e-b342-9ef8ad147c4e (at 10.240.25.236@tcp) in 98 seconds. I think it's dead, and I am evicting it. exp 0000000088422ea8, cur 1661263358 expire 1661263328 last 1661263260
[ 1128.079720] Lustre: 32287:0:(ldlm_lib.c:2823:target_recovery_thread()) too long recovery - read logs
[ 1128.083485] Lustre: lustre-MDT0000-osp-MDT0002: Connection restored to 10.240.22.105@tcp (at 0@lo)
[ 1128.084648] LustreError: dumping log to /tmp/lustre-log.1661263358.32287
[ 1128.194808] Lustre: lustre-MDT0000: Recovery over after 1:42, of 7 clients 6 recovered and 1 was evicted.
[ 1128.575157] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-dual test_0a: @@@@@@ FAIL: test_0a failed with 1 

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-dual test_0a - test_0a failed with 1


Generated at Sat Feb 10 03:24:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.