[LU-10540] recovery-small test 104 fails with 'ir status on ost1 should be DISABLED' Created: 21/Jan/18 Updated: 12/Mar/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.10.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SLES12 SP2 and SP3 environments |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
recovery-small test_104 fails in full and failover test sessions for, so far, only SLES12 SP2 and SLES12 SP3. Looking at the client test_log, we see two failures: Started lustre-OST0000
CMD: trevis-7vm7 /usr/sbin/lctl get_param -n obdfilter.lustre-OST0000.recovery_status |
awk '/status:/{ print \$2}'
CMD: trevis-7vm7 lctl get_param -n obdfilter.lustre-OST0000.recovery_status |
awk '/IR:/{ print \$2}'
/usr/lib64/lustre/tests/recovery-small.sh: line 1630: [: too many arguments
recovery-small test_104: @@@@@@ FAIL: Error state , must be ENABLED or DISABLED
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:5335:error()
= /usr/lib64/lustre/tests/recovery-small.sh:1631:check_target_ir_state()
= /usr/lib64/lustre/tests/recovery-small.sh:1873:test_104()
= /usr/lib64/lustre/tests/test-framework.sh:5611:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5650:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:5497:run_test()
= /usr/lib64/lustre/tests/recovery-small.sh:1877:main()
CMD: trevis-7vm5,trevis-7vm6,trevis-7vm7,trevis-7vm8 /usr/sbin/lctl dk > /home/autotest/autotest/logs/test_logs/2018-01-19/lustre-master-patchless-sles12sp3-x86_64--full--1_5_1__58___d9f8a5c0-4038-4a31-8ae1-d00da7add1bf/recovery-small.test_104.debug_log.\$(hostname -s).1516426665.log;
dmesg > /home/autotest/autotest/logs/test_logs/2018-01-19/lustre-master-patchless-sles12sp3-x86_64--full--1_5_1__58___d9f8a5c0-4038-4a31-8ae1-d00da7add1bf/recovery-small.test_104.dmesg.\$(hostname -s).1516426665.log
CMD: trevis-7vm5,trevis-7vm6,trevis-7vm7,trevis-7vm8 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null
/usr/lib64/lustre/tests/recovery-small.sh: line 1874: [: too many arguments
recovery-small test_104: @@@@@@ FAIL: ir status on ost1 should be DISABLED
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:5335:error()
= /usr/lib64/lustre/tests/recovery-small.sh:1875:test_104()
= /usr/lib64/lustre/tests/test-framework.sh:5611:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5650:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:5497:run_test()
= /usr/lib64/lustre/tests/recovery-small.sh:1877:main()
The first error comes from the routine check_target_ir_state(): 1616 check_target_ir_state()
1617 {
1618 local target=${1}
1619 local name=${target}_svc
1620 local recovery_proc=obdfilter.${!name}.recovery_status
1621 local st
1622
1623 while : ; do
1624 st=$(do_facet $target "$LCTL get_param -n $recovery_proc |
1625 awk '/status:/{ print \\\$2}'")
1626 [ x$st = xRECOVERING ] || break
1627 done
1628 st=$(do_facet $target "lctl get_param -n $recovery_proc |
1629 awk '/IR:/{ print \\\$2}'")
1630 [ $st != ON -o $st != OFF -o $st != ENABLED -o $st != DISABLED ] ||
1631 error "Error state $st, must be ENABLED or DISABLED"
1632 echo -n $st
1633 }
The second error comes from test_104 itself from the following test code and is due to the previous failure check_target_ir_state() error: 1873 local ir_state=$(check_target_ir_state ost1) 1874 [ $ir_state = "DISABLED" -o $ir_state = "OFF" ] || 1875 error "ir status on ost1 should be DISABLED" This test started failing on 2018-01-09 for lustre-master-patchless branch build #53 and lustre-master branch build #3693. Logs for these failures are at |
| Comments |
| Comment by Peter Jones [ 08/Feb/18 ] |
|
Bob Can you please advise Peter |
| Comment by Bob Glossman (Inactive) [ 08/Feb/18 ] |
|
I can't reproduce the failure in order to see what is wrong. Have tried on both sles12sp2 and sles12sp3. Not seeing any error at line 1630 as reported in logs. call to check_target_ir_state() is returning DISABLED without error. |
| Comment by Bob Glossman (Inactive) [ 08/Feb/18 ] |
|
I note that the calls to lctl in the script are inconsistent. Some use 'lctl', others use '$LCTL'. I don't know why that would cause problems, but it does seem odd. Doesn't explain why I can't reproduce the failure. |
| Comment by Gerrit Updater [ 08/Feb/18 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/31233 |
| Comment by Bob Glossman (Inactive) [ 08/Feb/18 ] |
|
I note that all the reported test sets with the fail are 'full' or 'failover' tests. none are the more common 'review' tests. Is there some difference in the execution environment or test config of 'full' and 'failover' that might be significant? |
| Comment by James Nunez (Inactive) [ 15/Mar/18 ] |
|
Here's a recent example of this failure https://testing.hpdd.intel.com/test_sessions/b012c507-db61-4fe2-8611-3f30754a759a. As you can see, this is running SLES12 SP3 distro with DNE configured, 2 MDTs on each of two MDSs and a single OSS with seven OSTs with two clients. So, this configuration is the same that is run for review-dne-part-* testing. |
| Comment by Minh Diep [ 09/Apr/18 ] |
|
+1 on 2.10 https://testing.hpdd.intel.com/test_sets/808b8dda-3aa3-11e8-8f8a-52540065bddc
|
| Comment by Jian Yu [ 29/Sep/18 ] |
|
+1 on master branch: |