Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10540

recovery-small test 104 fails with 'ir status on ost1 should be DISABLED'

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.10.7
    • None
    • SLES12 SP2 and SP3 environments
    • 3
    • 9223372036854775807

    Description

      recovery-small test_104 fails in full and failover test sessions for, so far, only SLES12 SP2 and SLES12 SP3.

      Looking at the client test_log, we see two failures:

      Started lustre-OST0000
      CMD: trevis-7vm7 /usr/sbin/lctl get_param -n obdfilter.lustre-OST0000.recovery_status |
      			awk '/status:/{ print \$2}'
      CMD: trevis-7vm7 lctl get_param -n obdfilter.lustre-OST0000.recovery_status |
                                     awk '/IR:/{ print \$2}'
      /usr/lib64/lustre/tests/recovery-small.sh: line 1630: [: too many arguments
       recovery-small test_104: @@@@@@ FAIL: Error state , must be ENABLED or DISABLED 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5335:error()
        = /usr/lib64/lustre/tests/recovery-small.sh:1631:check_target_ir_state()
        = /usr/lib64/lustre/tests/recovery-small.sh:1873:test_104()
        = /usr/lib64/lustre/tests/test-framework.sh:5611:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5650:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5497:run_test()
        = /usr/lib64/lustre/tests/recovery-small.sh:1877:main()
      CMD: trevis-7vm5,trevis-7vm6,trevis-7vm7,trevis-7vm8 /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2018-01-19/lustre-master-patchless-sles12sp3-x86_64--full--1_5_1__58___d9f8a5c0-4038-4a31-8ae1-d00da7add1bf/recovery-small.test_104.debug_log.\$(hostname -s).1516426665.log;
               dmesg > /home/autotest/autotest/logs/test_logs/2018-01-19/lustre-master-patchless-sles12sp3-x86_64--full--1_5_1__58___d9f8a5c0-4038-4a31-8ae1-d00da7add1bf/recovery-small.test_104.dmesg.\$(hostname -s).1516426665.log
      CMD: trevis-7vm5,trevis-7vm6,trevis-7vm7,trevis-7vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      /usr/lib64/lustre/tests/recovery-small.sh: line 1874: [: too many arguments
       recovery-small test_104: @@@@@@ FAIL: ir status on ost1 should be DISABLED 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5335:error()
        = /usr/lib64/lustre/tests/recovery-small.sh:1875:test_104()
        = /usr/lib64/lustre/tests/test-framework.sh:5611:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5650:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5497:run_test()
        = /usr/lib64/lustre/tests/recovery-small.sh:1877:main()
      

      The first error comes from the routine check_target_ir_state():

      1616 check_target_ir_state()
      1617 {
      1618         local target=${1}
      1619         local name=${target}_svc
      1620         local recovery_proc=obdfilter.${!name}.recovery_status
      1621         local st
      1622 
      1623         while : ; do
      1624                 st=$(do_facet $target "$LCTL get_param -n $recovery_proc |
      1625                         awk '/status:/{ print \\\$2}'")
      1626                 [ x$st = xRECOVERING ] || break
      1627         done
      1628         st=$(do_facet $target "lctl get_param -n $recovery_proc |
      1629                                awk '/IR:/{ print \\\$2}'")
      1630         [ $st != ON -o $st != OFF -o $st != ENABLED -o $st != DISABLED ] ||
      1631                 error "Error state $st, must be ENABLED or DISABLED"
      1632         echo -n $st
      1633 }
      

      The second error comes from test_104 itself from the following test code and is due to the previous failure check_target_ir_state() error:

      1873         local ir_state=$(check_target_ir_state ost1)
      1874         [ $ir_state = "DISABLED" -o $ir_state = "OFF" ] ||
      1875                 error "ir status on ost1 should be DISABLED"
      

      This test started failing on 2018-01-09 for lustre-master-patchless branch build #53 and lustre-master branch build #3693. Logs for these failures are at
      https://testing.hpdd.intel.com/test_sets/ab46a562-f595-11e7-a169-52540065bddc
      https://testing.hpdd.intel.com/test_sets/0f1fc7ca-f68c-11e7-a7cd-52540065bddc
      https://testing.hpdd.intel.com/test_sets/8802771e-fde4-11e7-a7cd-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-10540] recovery-small test 104 fails with 'ir status on ost1 should be DISABLED'
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/ec4ca7c6-c346-11e8-b748-52540065bddc
            mdiep Minh Diep added a comment - +1 on 2.10 https://testing.hpdd.intel.com/test_sets/808b8dda-3aa3-11e8-8f8a-52540065bddc  

            Here's a recent example of this failure https://testing.hpdd.intel.com/test_sessions/b012c507-db61-4fe2-8611-3f30754a759a. As you can see, this is running SLES12 SP3 distro with DNE configured, 2 MDTs on each of two MDSs and a single OSS with seven OSTs with two clients. So, this configuration is the same that is run for review-dne-part-* testing.

            jamesanunez James Nunez (Inactive) added a comment - Here's a recent example of this failure https://testing.hpdd.intel.com/test_sessions/b012c507-db61-4fe2-8611-3f30754a759a.  As you can see, this is running SLES12 SP3 distro with DNE configured, 2 MDTs on each of two MDSs and a single OSS with seven OSTs with two clients. So, this configuration is the same that is run for review-dne-part-* testing.
            bogl Bob Glossman (Inactive) added a comment - - edited

            I note that all the reported test sets with the fail are 'full' or 'failover' tests. none are the more common 'review' tests. Is there some difference in the execution environment or test config of 'full' and 'failover' that might be significant?

            bogl Bob Glossman (Inactive) added a comment - - edited I note that all the reported test sets with the fail are 'full' or 'failover' tests. none are the more common 'review' tests. Is there some difference in the execution environment or test config of 'full' and 'failover' that might be significant?

            Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/31233
            Subject: LU-10540 test: try to reproduce reported test fail
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 52683b43d3ada5b3149754928757598d190c370e

            gerrit Gerrit Updater added a comment - Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/31233 Subject: LU-10540 test: try to reproduce reported test fail Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 52683b43d3ada5b3149754928757598d190c370e

            I note that the calls to lctl in the script are inconsistent. Some use 'lctl', others use '$LCTL'. I don't know why that would cause problems, but it does seem odd.

            Doesn't explain why I can't reproduce the failure.

            bogl Bob Glossman (Inactive) added a comment - I note that the calls to lctl in the script are inconsistent. Some use 'lctl', others use '$LCTL'. I don't know why that would cause problems, but it does seem odd. Doesn't explain why I can't reproduce the failure.

            I can't reproduce the failure in order to see what is wrong. Have tried on both sles12sp2 and sles12sp3. Not seeing any error at line 1630 as reported in logs. call to check_target_ir_state() is returning DISABLED without error.

            bogl Bob Glossman (Inactive) added a comment - I can't reproduce the failure in order to see what is wrong. Have tried on both sles12sp2 and sles12sp3. Not seeing any error at line 1630 as reported in logs. call to check_target_ir_state() is returning DISABLED without error.
            pjones Peter Jones added a comment -

            Bob

            Can you please advise

            Peter

            pjones Peter Jones added a comment - Bob Can you please advise Peter

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: