[LU-11795] replay-vbr test 8b fails with 'Restart of mds1 failed!' Created: 17/Dec/18  Updated: 08/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: failover

Issue Links:
Related
is related to LU-9707 Failover: recovery-random-scale test_... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-vbr test_8b fails with 'Restart of mds1 failed!'. So far, this test has only failed once; https://testing.whamcloud.com/test_sets/4fca0808-fd1b-11e8-8512-52540065bddc .

Looking at the client test_log, we see the MDS has problems

mount facets: mds1
CMD: trevis-16vm8 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
CMD: trevis-16vm8 test -b /dev/lvm-Role_MDS/P1
CMD: trevis-16vm8 loop_dev=\$(losetup -j /dev/lvm-Role_MDS/P1 | cut -d : -f 1);
			 if [[ -z \$loop_dev ]]; then
				loop_dev=\$(losetup -f);
				losetup \$loop_dev /dev/lvm-Role_MDS/P1 || loop_dev=;
			 fi;
			 echo -n \$loop_dev
trevis-16vm8: losetup: /dev/lvm-Role_MDS/P1: failed to set up loop device: No such file or directory
CMD: trevis-16vm8 test -b /dev/lvm-Role_MDS/P1
CMD: trevis-16vm8 e2label /dev/lvm-Role_MDS/P1
trevis-16vm8: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1
trevis-16vm8: Couldn't find valid filesystem superblock.
Starting mds1:   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
CMD: trevis-16vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
trevis-16vm8: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32
 replay-vbr test_8b: @@@@@@ FAIL: Restart of mds1 failed! 

Looking at the MDS1 (vm8) console log, we see replay-vbr test 8a start up, MDS1 disconnect, some stack traces (possibly for the replay-vbr test_8c hang) and the next Lustre test script output is for replay-single test 0a. Similar console log content for MDS2 (vm7).

In the dmesg log for the OSS (vm5), we see some errors for test 8b

[ 8057.763283] Lustre: DEBUG MARKER: == replay-vbr test 8b: create | unlink, create shouldn't fail ======================================== 17:11:00 (1544490660)
[ 8058.291385] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-16vm3: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 8058.489932] Lustre: DEBUG MARKER: trevis-16vm3: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 8067.329126] LNetError: 6975:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.191@tcp[1], ip 10.9.4.191:7988, with error, wanted: 152, left: 152, last alive is 5 secs ago
[ 8067.330996] LustreError: 6975:0:(events.c:305:request_in_callback()) event type 2, status -5, service ost
[ 8067.331917] LustreError: 26344:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
[ 8067.332999] LustreError: 26344:0:(sec.c:2068:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.9.4.191@tcp x1619515714571600
[ 8077.892741] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-vbr test_8b: @@@@@@ FAIL: Restart of mds1 failed! 
[ 8078.081333] Lustre: DEBUG MARKER: replay-vbr test_8b: @@@@@@ FAIL: Restart of mds1 failed!


 Comments   
Comment by James Nunez (Inactive) [ 25/Apr/19 ]

We're seeing a very similar issue with not being able to start an OSS during recovery-mds-scale test_failover_ost; https://testing.whamcloud.com/test_sets/a2fd473a-6632-11e9-bd0e-52540065bddc .

From the suite_log, we see

trevis-34vm6: losetup: /dev/lvm-Role_OSS/P1: failed to set up loop device: No such file or directory
CMD: trevis-34vm6 test -b /dev/lvm-Role_OSS/P1
CMD: trevis-34vm6 e2label /dev/lvm-Role_OSS/P1
trevis-34vm6: e2label: No such file or directory while trying to open /dev/lvm-Role_OSS/P1
trevis-34vm6: Couldn't find valid filesystem superblock.
Starting ost1:   -o loop /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
CMD: trevis-34vm6 mkdir -p /mnt/lustre-ost1; mount -t lustre   -o loop /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
trevis-34vm6: mount: /dev/lvm-Role_OSS/P1: failed to setup loop device: No such file or directory
Start of /dev/lvm-Role_OSS/P1 on ost1 failed 32
 recovery-mds-scale test_failover_ost: @@@@@@ FAIL: Restart of ost1 failed! 
Comment by James Nunez (Inactive) [ 17/Oct/19 ]

We're seeing the same errors in non-failover testing as in all tests failing for performance-sanity; https://testing.whamcloud.com/test_sets/c9195b96-e591-11e9-9874-52540065bddc.

Generated at Sat Feb 10 02:47:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.