Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12245

replay-vbr test 5b fails with 'Restart of mds1 failed!'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.1, Lustre 2.12.3
    • SLES12 SP4
    • 3
    • 9223372036854775807

    Description

      replay-vbr test_5b fails with 'Restart of mds1 failed!', so far, only for SLES12 SP4.

      Looking the suite_log for a recent failure with logs at https://testing.whamcloud.com/test_sets/be1853f6-6692-11e9-8bb1-52540065bddc , we see that mounting the failed over MDS does not work

      Failover mds1 to trevis-35vm7
      00:10:41 (1556089841) waiting for trevis-35vm7 network 900 secs ...
      00:10:41 (1556089841) network interface is UP
      CMD: trevis-35vm7 hostname
      mount facets: mds1
      CMD: trevis-35vm7 dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
      CMD: trevis-35vm7 test -b /dev/lvm-Role_MDS/P1
      CMD: trevis-35vm7 loop_dev=\$(losetup -j /dev/lvm-Role_MDS/P1 | cut -d : -f 1);
      			 if [[ -z \$loop_dev ]]; then
      				loop_dev=\$(losetup -f);
      				losetup \$loop_dev /dev/lvm-Role_MDS/P1 || loop_dev=;
      			 fi;
      			 echo -n \$loop_dev
      trevis-35vm7: losetup: /dev/lvm-Role_MDS/P1: failed to set up loop device: No such file or directory
      CMD: trevis-35vm7 test -b /dev/lvm-Role_MDS/P1
      CMD: trevis-35vm7 e2label /dev/lvm-Role_MDS/P1
      trevis-35vm7: e2label: No such file or directory while trying to open /dev/lvm-Role_MDS/P1
      trevis-35vm7: Couldn't find valid filesystem superblock.
      Starting mds1:   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
      CMD: trevis-35vm7 mkdir -p /mnt/lustre-mds1; mount -t lustre   -o loop /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
      trevis-35vm7: mount: /dev/lvm-Role_MDS/P1: failed to setup loop device: No such file or directory
      Start of /dev/lvm-Role_MDS/P1 on mds1 failed 32
       replay-vbr test_5b: @@@@@@ FAIL: Restart of mds1 failed! 
      

      Looking at the console log for MDS1 (vm8), we see the MDS failover

       [  266.800298] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-vbr test 5b: link checks version of target parent ========================================== 00:10:22 \(1556089822\)
      [  266.982167] Lustre: DEBUG MARKER: == replay-vbr test 5b: link checks version of target parent ========================================== 00:10:22 (1556089822)
      [  267.088184] Lustre: lustre-MDT0000: Connection restored to 144cd783-70f4-6475-93c8-b9b0a8f6fd6b (at 10.9.5.120@tcp)
      [  267.089958] Lustre: Skipped 6 previous similar messages
      [  267.890140] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdd.lustre-MDT0000.sync_permission=0
      [  268.211897] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.lustre-MDT0000.commit_on_sharing=0
      [  268.641380] Lustre: DEBUG MARKER: sync; sync; sync
      [  269.842265] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
      [  270.163063] Lustre: DEBUG MARKER: modprobe dm-flakey;
      [  270.163063] 			 dmsetup targets | grep -q flakey
      [  270.487834] Lustre: DEBUG MARKER: dmsetup table /dev/mapper/mds1_flakey
      [  270.820089] Lustre: DEBUG MARKER: dmsetup suspend --nolockfs --noflush /dev/mapper/mds1_flakey
      [  271.139890] Lustre: DEBUG MARKER: dmsetup load /dev/mapper/mds1_flakey --table "0 20971520 flakey 252:0 0 0 1800 1 drop_writes"
      [  271.459528] Lustre: DEBUG MARKER: dmsetup resume /dev/mapper/mds1_flakey
      [  271.812452] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      [  271.975373] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
      [  272.633700] Lustre: DEBUG MARKER: /usr/sbin/lctl dl
      [  272.964835] Lustre: DEBUG MARKER: modprobe dm-flakey;
      [  272.964835] 			 dmsetup targets | grep -q flakey
      [  273.353504] Lustre: DEBUG MARKER: /usr/sbin/lctl dl
      
      <ConMan> Console [trevis-35vm8] disconnected from <trevis-35:6007> at 04-24 07:10.
      

      Looking at the failover MDS (vm7), we don’t see an indication that the MDS failed over and we don’t see replay-vbr test 5c start on either MDS.

      There are no other reply-vbr test 5b failures like this in the past four months.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: