[LU-12245] replay-vbr test 5b fails with 'Restart of mds1 failed!' Created: 29/Apr/19  Updated: 15/Nov/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.1, Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: failover
Environment:

SLES12 SP4


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-vbr test_5b fails with 'Restart of mds1 failed!', so far, only for SLES12 SP4.

Looking the suite_log for a recent failure with logs at https://testing.whamcloud.com/test_sets/be1853f6-6692-11e9-8bb1-52540065bddc , we see that mounting the failed over MDS does not work

Failover mds1 to trevis-35vm7
00:10:41 (1556089841) waiting for trevis-35vm7 network 900 secs ...
00:10:41 (1556089841) network interface is UP
CMD: trevis-35vm7 hostname
mount facets: mds1
CMD: trevis-35vm7 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
CMD: trevis-35vm7 test -b /dev/lvm-Role_MDS/P1
CMD: trevis-35vm7 loop_dev=\$(losetup -j /dev/lvm-Role_MDS/P1 | cut -d : -f 1);
			 if [[ -z \$loop_dev ]]; then
				loop_dev=\$(losetup -f);
				losetup \$loop_dev /dev/lvm-Role_MDS/P1 || loop_dev=;
			 fi;
			 echo -n \$loop_dev
trevis-35vm7: losetup: /dev/lvm-Role_MDS/P1: failed to set up loop device: No such file or directory
CMD: trevis-35vm7 test -b /dev/lvm-Role_MDS/P1
CMD: trevis-35vm7 e2label /dev/lvm-Role_MDS/P1
trevis-35vm7: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1
trevis-35vm7: Couldn't find valid filesystem superblock.
Starting mds1:   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
CMD: trevis-35vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
trevis-35vm7: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32
 replay-vbr test_5b: @@@@@@ FAIL: Restart of mds1 failed! 

Looking at the console log for MDS1 (vm8), we see the MDS failover

 [  266.800298] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-vbr test 5b: link checks version of target parent ========================================== 00:10:22 \(1556089822\)
[  266.982167] Lustre: DEBUG MARKER: == replay-vbr test 5b: link checks version of target parent ========================================== 00:10:22 (1556089822)
[  267.088184] Lustre: lustre-MDT0000: Connection restored to 144cd783-70f4-6475-93c8-b9b0a8f6fd6b (at 10.9.5.120@tcp)
[  267.089958] Lustre: Skipped 6 previous similar messages
[  267.890140] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdd.lustre-MDT0000.sync_permission=0
[  268.211897] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.lustre-MDT0000.commit_on_sharing=0
[  268.641380] Lustre: DEBUG MARKER: sync; sync; sync
[  269.842265] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
[  270.163063] Lustre: DEBUG MARKER: modprobe dm-flakey;
[  270.163063] 			 dmsetup targets | grep -q flakey
[  270.487834] Lustre: DEBUG MARKER: dmsetup table /dev/mapper/mds1_flakey
[  270.820089] Lustre: DEBUG MARKER: dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey
[  271.139890] Lustre: DEBUG MARKER: dmsetup load /dev/mapper/mds1_flakey --table "0 20971520 flakey 252:0 0 0 1800 1 drop_writes"
[  271.459528] Lustre: DEBUG MARKER: dmsetup resume /dev/mapper/mds1_flakey
[  271.812452] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
[  271.975373] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[  272.633700] Lustre: DEBUG MARKER: /usr/sbin/lctl dl
[  272.964835] Lustre: DEBUG MARKER: modprobe dm-flakey;
[  272.964835] 			 dmsetup targets | grep -q flakey
[  273.353504] Lustre: DEBUG MARKER: /usr/sbin/lctl dl

<ConMan> Console [trevis-35vm8] disconnected from <trevis-35:6007> at 04-24 07:10.

Looking at the failover MDS (vm7), we don’t see an indication that the MDS failed over and we don’t see replay-vbr test 5c start on either MDS.

There are no other reply-vbr test 5b failures like this in the past four months.



 Comments   
Comment by James Nunez (Inactive) [ 15/Oct/19 ]

We see a similar test hang for replay-single test 101 at https://testing.whamcloud.com/test_sets/6c0fb4d6-ea6e-11e9-be86-52540065bddc

Comment by Sebastien Buisson [ 15/Nov/19 ]

Possibly a new occurence via recovery-small test_136:
https://testing.whamcloud.com/test_sets/f00a801a-0722-11ea-b934-52540065bddc

Generated at Sat Feb 10 02:50:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.