[LU-10540] recovery-small test 104 fails with 'ir status on ost1 should be DISABLED' Created: 21/Jan/18  Updated: 12/Mar/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

SLES12 SP2 and SP3 environments


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-small test_104 fails in full and failover test sessions for, so far, only SLES12 SP2 and SLES12 SP3.

Looking at the client test_log, we see two failures:

Started lustre-OST0000
CMD: trevis-7vm7 /usr/sbin/lctl get_param -n obdfilter.lustre-OST0000.recovery_status |
			awk '/status:/{ print \$2}'
CMD: trevis-7vm7 lctl get_param -n obdfilter.lustre-OST0000.recovery_status |
                               awk '/IR:/{ print \$2}'
/usr/lib64/lustre/tests/recovery-small.sh: line 1630: [: too many arguments
 recovery-small test_104: @@@@@@ FAIL: Error state , must be ENABLED or DISABLED 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5335:error()
  = /usr/lib64/lustre/tests/recovery-small.sh:1631:check_target_ir_state()
  = /usr/lib64/lustre/tests/recovery-small.sh:1873:test_104()
  = /usr/lib64/lustre/tests/test-framework.sh:5611:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5650:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5497:run_test()
  = /usr/lib64/lustre/tests/recovery-small.sh:1877:main()
CMD: trevis-7vm5,trevis-7vm6,trevis-7vm7,trevis-7vm8 /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2018-01-19/lustre-master-patchless-sles12sp3-x86_64--full--1_5_1__58___d9f8a5c0-4038-4a31-8ae1-d00da7add1bf/recovery-small.test_104.debug_log.\$(hostname -s).1516426665.log;
         dmesg > /home/autotest/autotest/logs/test_logs/2018-01-19/lustre-master-patchless-sles12sp3-x86_64--full--1_5_1__58___d9f8a5c0-4038-4a31-8ae1-d00da7add1bf/recovery-small.test_104.dmesg.\$(hostname -s).1516426665.log
CMD: trevis-7vm5,trevis-7vm6,trevis-7vm7,trevis-7vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
/usr/lib64/lustre/tests/recovery-small.sh: line 1874: [: too many arguments
 recovery-small test_104: @@@@@@ FAIL: ir status on ost1 should be DISABLED 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5335:error()
  = /usr/lib64/lustre/tests/recovery-small.sh:1875:test_104()
  = /usr/lib64/lustre/tests/test-framework.sh:5611:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5650:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5497:run_test()
  = /usr/lib64/lustre/tests/recovery-small.sh:1877:main()

The first error comes from the routine check_target_ir_state():

1616 check_target_ir_state()
1617 {
1618         local target=${1}
1619         local name=${target}_svc
1620         local recovery_proc=obdfilter.${!name}.recovery_status
1621         local st
1622 
1623         while : ; do
1624                 st=$(do_facet $target "$LCTL get_param -n $recovery_proc |
1625                         awk '/status:/{ print \\\$2}'")
1626                 [ x$st = xRECOVERING ] || break
1627         done
1628         st=$(do_facet $target "lctl get_param -n $recovery_proc |
1629                                awk '/IR:/{ print \\\$2}'")
1630         [ $st != ON -o $st != OFF -o $st != ENABLED -o $st != DISABLED ] ||
1631                 error "Error state $st, must be ENABLED or DISABLED"
1632         echo -n $st
1633 }

The second error comes from test_104 itself from the following test code and is due to the previous failure check_target_ir_state() error:

1873         local ir_state=$(check_target_ir_state ost1)
1874         [ $ir_state = "DISABLED" -o $ir_state = "OFF" ] ||
1875                 error "ir status on ost1 should be DISABLED"

This test started failing on 2018-01-09 for lustre-master-patchless branch build #53 and lustre-master branch build #3693. Logs for these failures are at
https://testing.hpdd.intel.com/test_sets/ab46a562-f595-11e7-a169-52540065bddc
https://testing.hpdd.intel.com/test_sets/0f1fc7ca-f68c-11e7-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/8802771e-fde4-11e7-a7cd-52540065bddc



 Comments   
Comment by Peter Jones [ 08/Feb/18 ]

Bob

Can you please advise

Peter

Comment by Bob Glossman (Inactive) [ 08/Feb/18 ]

I can't reproduce the failure in order to see what is wrong. Have tried on both sles12sp2 and sles12sp3. Not seeing any error at line 1630 as reported in logs. call to check_target_ir_state() is returning DISABLED without error.

Comment by Bob Glossman (Inactive) [ 08/Feb/18 ]

I note that the calls to lctl in the script are inconsistent. Some use 'lctl', others use '$LCTL'. I don't know why that would cause problems, but it does seem odd.

Doesn't explain why I can't reproduce the failure.

Comment by Gerrit Updater [ 08/Feb/18 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/31233
Subject: LU-10540 test: try to reproduce reported test fail
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 52683b43d3ada5b3149754928757598d190c370e

Comment by Bob Glossman (Inactive) [ 08/Feb/18 ]

I note that all the reported test sets with the fail are 'full' or 'failover' tests. none are the more common 'review' tests. Is there some difference in the execution environment or test config of 'full' and 'failover' that might be significant?

Comment by James Nunez (Inactive) [ 15/Mar/18 ]

Here's a recent example of this failure https://testing.hpdd.intel.com/test_sessions/b012c507-db61-4fe2-8611-3f30754a759a. As you can see, this is running SLES12 SP3 distro with DNE configured, 2 MDTs on each of two MDSs and a single OSS with seven OSTs with two clients. So, this configuration is the same that is run for review-dne-part-* testing.

Comment by Minh Diep [ 09/Apr/18 ]

+1 on 2.10 https://testing.hpdd.intel.com/test_sets/808b8dda-3aa3-11e8-8f8a-52540065bddc

 

Comment by Jian Yu [ 29/Sep/18 ]

+1 on master branch:
https://testing.whamcloud.com/test_sets/ec4ca7c6-c346-11e8-b748-52540065bddc

Generated at Sat Feb 10 02:36:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.