Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11791

recovery-mds-scale test failover_ost fails with 'test_failover_ost returned 1'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.10.7, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.15.3, Lustre 2.15.6
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale test_failover_ost fails with 'test_failover_ost returned 1'

      Looking at the client test_log from https://testing.whamcloud.com/test_sets/e36f9e0c-fea5-11e8-b837-52540065bddc , we see tht there were several successful OST failovers with one failure

      Found the END_RUN_FILE file: /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/shared_dir/end_run_file
      trevis-25vm8.trevis.whamcloud.com
      Client load  failed on node trevis-25vm8.trevis.whamcloud.com:
      /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run__stdout.trevis-25vm8.trevis.whamcloud.com.log
      /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run__debug.trevis-25vm8.trevis.whamcloud.com.log
      2018-12-11 23:22:47 Terminating clients loads ...
      Duration:               86400
      Server failover period: 1200 seconds
      Exited after:           21768 seconds
      Number of failovers before exit:
      mds1: 0 times
      ost1: 3 times
      ost2: 1 times
      ost3: 6 times
      ost4: 1 times
      ost5: 6 times
      ost6: 0 times
      ost7: 2 times
      Status: FAIL: rc=1
      CMD: trevis-25vm7,trevis-25vm8 test -f /tmp/client-load.pid &&
              { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
      trevis-25vm8: sh: line 1: kill: (11606) - No such process
      trevis-25vm7: sh: line 1: kill: (18301) - No such process
      Dumping lctl log to /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.*.1544570568.log
      CMD: trevis-25vm10,trevis-25vm11,trevis-25vm12,trevis-25vm8.trevis.whamcloud.com,trevis-25vm9 /usr/sbin/lctl dk > /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.debug_log.\$(hostname -s).1544570568.log;
               dmesg > /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.dmesg.\$(hostname -s).1544570568.log
      trevis-25vm9: invalid parameter 'dump_kernel'
      trevis-25vm9: open(dump_kernel) failed: No such file or directory
      trevis-25vm12: invalid parameter 'dump_kernel'
      trevis-25vm12: open(dump_kernel) failed: No such file or directory
      test_failover_ost returned 1
      FAIL failover_ost (22821s)
      

      Looking at the logs from Client 3 (vm8), we can see some issues with tar. From the run_tar_debug log, we see a client load return a positive return code

      2018-12-11 22:58:05: tar run starting
      + mkdir -p /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
      + cd /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
      + sync
      ++ du -s /etc
      ++ awk '{print $1}'
      + USAGE=34864
      + /usr/sbin/lctl set_param 'llite.*.lazystatfs=0'
      + df /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
      + sleep 2
      ++ df /mnt/lustre/d0.tar-trevis-25vm8.trevis.whamcloud.com
      ++ awk '/:/ { print $4 }'
      + FREE_SPACE=9359360
      + AVAIL=4211712
      + '[' 4211712 -lt 34864 ']'
      + do_tar
      + tar cf - /etc
      + tar xf -
      tar: Removing leading `/' from member names
      + return 2
      + RC=2
      ++ grep 'exit delayed from previous errors' /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/recovery-mds-scale.test_failover_ost.run_tar_stdout.trevis-25vm8.log
      + PREV_ERRORS=
      + true
      + '[' 2 -ne 0 -a '' -a '' ']'
      + '[' 2 -eq 0 ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2018-12-11 23:17:05: tar failed'
      + echo '2018-12-11 23:17:05: tar failed'
      2018-12-11 23:17:05: tar failed
      + '[' -z '' ']'
      ++ hostname
      + echo trevis-25vm8.trevis.whamcloud.com
      + '[' ']'
      + '[' '!' -e /autotest/trevis/2018-12-10/lustre-master-el7_6-x86_64--failover--1_32_1__3837___6af7940a-41a2-4a12-b890-ae54e8237ab3/shared_dir/end_run_file ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2018-12-11 23:17:05: tar run exiting'
      + echo '2018-12-11 23:17:05: tar run exiting'
      2018-12-11 23:17:05: tar run exiting
      

      From the run_tar_stdout log, we see some write errors

      tar: etc/mke2fs.conf: Cannot write: Input/output error
      tar: Exiting with failure status due to previous errors
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: