[LU-11795] replay-vbr test 8b fails with 'Restart of mds1 failed!' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.4
Labels:
- failover

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

replay-vbr test_8b fails with 'Restart of mds1 failed!'. So far, this test has only failed once; https://testing.whamcloud.com/test_sets/4fca0808-fd1b-11e8-8512-52540065bddc .

Looking at the client test_log, we see the MDS has problems

mount facets: mds1
CMD: trevis-16vm8 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
CMD: trevis-16vm8 test -b /dev/lvm-Role_MDS/P1
CMD: trevis-16vm8 loop_dev=\$(losetup -j /dev/lvm-Role_MDS/P1 | cut -d : -f 1);
			 if [[ -z \$loop_dev ]]; then
				loop_dev=\$(losetup -f);
				losetup \$loop_dev /dev/lvm-Role_MDS/P1 || loop_dev=;
			 fi;
			 echo -n \$loop_dev
trevis-16vm8: losetup: /dev/lvm-Role_MDS/P1: failed to set up loop device: No such file or directory
CMD: trevis-16vm8 test -b /dev/lvm-Role_MDS/P1
CMD: trevis-16vm8 e2label /dev/lvm-Role_MDS/P1
trevis-16vm8: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1
trevis-16vm8: Couldn't find valid filesystem superblock.
Starting mds1:   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
CMD: trevis-16vm8 mkdir -p /mnt/lustre-mds1; mount -t lustre   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
trevis-16vm8: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32
 replay-vbr test_8b: @@@@@@ FAIL: Restart of mds1 failed!

Looking at the MDS1 (vm8) console log, we see replay-vbr test 8a start up, MDS1 disconnect, some stack traces (possibly for the replay-vbr test_8c hang) and the next Lustre test script output is for replay-single test 0a. Similar console log content for MDS2 (vm7).

In the dmesg log for the OSS (vm5), we see some errors for test 8b

[ 8057.763283] Lustre: DEBUG MARKER: == replay-vbr test 8b: create | unlink, create shouldn't fail ======================================== 17:11:00 (1544490660)
[ 8058.291385] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-16vm3: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 8058.489932] Lustre: DEBUG MARKER: trevis-16vm3: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[ 8067.329126] LNetError: 6975:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.191@tcp[1], ip 10.9.4.191:7988, with error, wanted: 152, left: 152, last alive is 5 secs ago
[ 8067.330996] LustreError: 6975:0:(events.c:305:request_in_callback()) event type 2, status -5, service ost
[ 8067.331917] LustreError: 26344:0:(pack_generic.c:590:__lustre_unpack_msg()) message length 0 too small for magic/version check
[ 8067.332999] LustreError: 26344:0:(sec.c:2068:sptlrpc_svc_unwrap_request()) error unpacking request from 12345-10.9.4.191@tcp x1619515714571600
[ 8077.892741] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-vbr test_8b: @@@@@@ FAIL: Restart of mds1 failed! 
[ 8078.081333] Lustre: DEBUG MARKER: replay-vbr test_8b: @@@@@@ FAIL: Restart of mds1 failed!

Attachments

Issue Links

is related to

LU-9707 Failover: recovery-random-scale test_fail_client_mds: Restart of mds1 failed!

Open

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...

Activity

People

Assignee:: WC Triage

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Dec/18 6:00 PM

Updated:: 08/Oct/20 3:03 AM