Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.1, Lustre 2.12.3
-
SLES12 SP4
-
3
-
9223372036854775807
Description
replay-vbr test_5b fails with 'Restart of mds1 failed!', so far, only for SLES12 SP4.
Looking the suite_log for a recent failure with logs at https://testing.whamcloud.com/test_sets/be1853f6-6692-11e9-8bb1-52540065bddc , we see that mounting the failed over MDS does not work
Failover mds1 to trevis-35vm7 00:10:41 (1556089841) waiting for trevis-35vm7 network 900 secs ... 00:10:41 (1556089841) network interface is UP CMD: trevis-35vm7 hostname mount facets: mds1 CMD: trevis-35vm7 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1 CMD: trevis-35vm7 test -b /dev/lvm-Role_MDS/P1 CMD: trevis-35vm7 loop_dev=\$(losetup -j /dev/lvm-Role_MDS/P1 | cut -d : -f 1); if [[ -z \$loop_dev ]]; then loop_dev=\$(losetup -f); losetup \$loop_dev /dev/lvm-Role_MDS/P1 || loop_dev=; fi; echo -n \$loop_dev trevis-35vm7: losetup: /dev/lvm-Role_MDS/P1: failed to set up loop device: No such file or directory CMD: trevis-35vm7 test -b /dev/lvm-Role_MDS/P1 CMD: trevis-35vm7 e2label /dev/lvm-Role_MDS/P1 trevis-35vm7: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1 trevis-35vm7: Couldn't find valid filesystem superblock. Starting mds1: -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 CMD: trevis-35vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 trevis-35vm7: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32 replay-vbr test_5b: @@@@@@ FAIL: Restart of mds1 failed!
Looking at the console log for MDS1 (vm8), we see the MDS failover
[ 266.800298] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-vbr test 5b: link checks version of target parent ========================================== 00:10:22 \(1556089822\) [ 266.982167] Lustre: DEBUG MARKER: == replay-vbr test 5b: link checks version of target parent ========================================== 00:10:22 (1556089822) [ 267.088184] Lustre: lustre-MDT0000: Connection restored to 144cd783-70f4-6475-93c8-b9b0a8f6fd6b (at 10.9.5.120@tcp) [ 267.089958] Lustre: Skipped 6 previous similar messages [ 267.890140] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdd.lustre-MDT0000.sync_permission=0 [ 268.211897] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.lustre-MDT0000.commit_on_sharing=0 [ 268.641380] Lustre: DEBUG MARKER: sync; sync; sync [ 269.842265] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno [ 270.163063] Lustre: DEBUG MARKER: modprobe dm-flakey; [ 270.163063] dmsetup targets | grep -q flakey [ 270.487834] Lustre: DEBUG MARKER: dmsetup table /dev/mapper/mds1_flakey [ 270.820089] Lustre: DEBUG MARKER: dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey [ 271.139890] Lustre: DEBUG MARKER: dmsetup load /dev/mapper/mds1_flakey --table "0 20971520 flakey 252:0 0 0 1800 1 drop_writes" [ 271.459528] Lustre: DEBUG MARKER: dmsetup resume /dev/mapper/mds1_flakey [ 271.812452] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 [ 271.975373] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 272.633700] Lustre: DEBUG MARKER: /usr/sbin/lctl dl [ 272.964835] Lustre: DEBUG MARKER: modprobe dm-flakey; [ 272.964835] dmsetup targets | grep -q flakey [ 273.353504] Lustre: DEBUG MARKER: /usr/sbin/lctl dl <ConMan> Console [trevis-35vm8] disconnected from <trevis-35:6007> at 04-24 07:10.
Looking at the failover MDS (vm7), we don’t see an indication that the MDS failed over and we don’t see replay-vbr test 5c start on either MDS.
There are no other reply-vbr test 5b failures like this in the past four months.