[LU-6844] replay-single test 70b failure: 'rundbench load on * failed!' Created: 14/Jul/15 Updated: 29/May/17 Resolved: 29/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
review-dne-part-2 test group |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
replay-single test 70b fails on rename in review-dne-part-2 test sessions. Logs are at: 2015-07-12 00:40:18 - https://testing.hpdd.intel.com/test_sets/4d9d6272-2877-11e5-8d7f-5254006e85c2 Although this failures looks like onyx-30vm5: [4766] rename ./clients/client0/~dmtmp/WORD/~WRD3497.TMP ./clients/client0/~dmtmp/WORD/TIPS.DOC failed (No such file or directory) - expected NT_STATUS_OK
onyx-30vm5: ERROR: child 0 failed at line 4766
onyx-30vm5: Child failed with status 1
onyx-30vm5: status script Total(sec) E(xcluded) S(low)
onyx-30vm5: ------------------------------------------------------------------------------------
onyx-30vm5:
onyx-30vm5: touch: missing file operand
onyx-30vm5: Try `touch --help' for more information.
onyx-30vm5: mdc.lustre-MDT0002-mdc-*.mds_server_uuid in FULL state after 4 sec
onyx-30vm6: mdc.lustre-MDT0002-mdc-*.mds_server_uuid in FULL state after 4 sec
onyx-30vm6: 1 4685 0.19 MB/sec execute 82 sec latency 22735.691 ms
onyx-30vm6: 1 5047 0.23 MB/sec execute 83 sec latency 559.323 ms
CMD: onyx-30vm5,onyx-30vm6.onyx.hpdd.intel.com killall -0 dbench
onyx-30vm5: dbench: no process killed
replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-30vm5,onyx-30vm6.onyx.hpdd.intel.com!
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4727:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4758:error()
= /usr/lib64/lustre/tests/replay-single.sh:2099:test_70b()
= /usr/lib64/lustre/tests/test-framework.sh:5020:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5057:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4907:run_test()
= /usr/lib64/lustre/tests/replay-single.sh:2101:main()
Dumping lctl log to /logdir/test_logs/2015-07-11/lustre-reviews-el6_6-x86_64--review-dne-part-2--1_5_1__33266__-70239005896920-232641/replay-single.test_70b.*.1436664129.log
CMD: onyx-30vm3,onyx-30vm4,onyx-30vm5,onyx-30vm6.onyx.hpdd.intel.com,onyx-30vm7 /usr/sbin/lctl dk > /logdir/test_logs/2015-07-11/lustre-reviews-el6_6-x86_64--review-dne-part-2--1_5_1__33266__-70239005896920-232641/replay-single.test_70b.debug_log.\$(hostname -s).1436664129.log;
dmesg > /logdir/test_logs/2015-07-11/lustre-reviews-el6_6-x86_64--review-dne-part-2--1_5_1__33266__-70239005896920-232641/replay-single.test_70b.dmesg.\$(hostname -s).1436664129.log
Info required for matching: replay-single 70b |
| Comments |
| Comment by Di Wang [ 14/Jul/15 ] |
|
this is probably due to |
| Comment by Joseph Gmitter (Inactive) [ 14/Jul/15 ] |
|
Di, can you have a look and if it is the same issue as 6840, close it is a duplicate? |
| Comment by James Nunez (Inactive) [ 20/Jul/15 ] |
|
More occurrences in review-dne-*: |
| Comment by Sarah Liu [ 20/Jul/15 ] |
|
another instance in Hard Failover: https://testing.hpdd.intel.com/test_sets/71cf0a7c-2793-11e5-9951-5254006e85c2 |
| Comment by Di Wang [ 23/Jul/15 ] |
|
"Di, can you have a look and if it is the same issue as 6840, close it is a duplicate?" There are not enough information for me to know what is the real reason here, but most recent DNE failover failure is indeed related with |
| Comment by James Nunez (Inactive) [ 24/Jul/15 ] |
|
More failures in review-dne-part-2: |
| Comment by Sebastien Buisson (Inactive) [ 29/Jul/15 ] |
|
One more instance: |
| Comment by James Nunez (Inactive) [ 29/Jul/15 ] |
|
More failures in review-dne-part-2: |
| Comment by Jian Yu [ 17/Aug/15 ] |
|
More instance on master branch: |
| Comment by Bob Glossman (Inactive) [ 20/Aug/15 ] |
|
another on master: |
| Comment by Sarah Liu [ 24/Aug/15 ] |
|
another on master: |
| Comment by James Nunez (Inactive) [ 27/Aug/15 ] |
|
Another on master |
| Comment by Di Wang [ 16/Sep/15 ] |
|
Most of these failures happened before 2.7.59. And there are a few fixes like |
| Comment by Andreas Dilger [ 13/Oct/15 ] |
|
Closing this as a duplicate of |
| Comment by James Nunez (Inactive) [ 04/Nov/15 ] |
|
Reopening this ticket because we are seeing this error again on master. The patches being tested are based on master where patches for The following replay-single test 70b failures fail in rename or unlink. Logs are at: |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for interop : 2.5.5 Server/EL7 Client |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance occurred for hardfailover : EL7 Server/Client - ZFS |
| Comment by James Nunez (Inactive) [ 01/Feb/16 ] |
|
For the following, the problem is unlink, but unlink fails for the same reason as above 'No such file or directory' onyx-42vm1: [993] unlink ./clients/client0/~dmtmp/PWRPNT/PPTC321.TMP failed (No such file or directory) - expected NT_STATUS_OK onyx-42vm1: ERROR: child 0 failed at line 993 Logs at https://testing.hpdd.intel.com/test_sets/3e6e0f84-c6a8-11e5-8cac-5254006e85c2 |
| Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ] |
|
Another instance failing with the same error as above for tag 2.7.66 for FULL - EL6.7 Server/EL6.7 Client - DNE , master build# 3314. |
| Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ] |
|
Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - DNE, build# 3314 |
| Comment by nasf (Inactive) [ 07/Mar/16 ] |
|
Another failure instance on master: |
| Comment by Bob Glossman (Inactive) [ 08/Mar/16 ] |
|
another on master: |
| Comment by Andreas Dilger [ 08/Mar/16 ] |
|
Di, there are still ongoing test failures in replay-single test_70b after the landing of the
|
| Comment by Di Wang [ 08/Mar/16 ] |
|
Sure, I will check it. |
| Comment by Richard Henwood (Inactive) [ 29/Mar/16 ] |
|
another instance of this failure from Master review-dne-part-2 on Mar 28th: https://testing.hpdd.intel.com/test_sets/1789dade-f50f-11e5-87ab-5254006e85c2 |
| Comment by Richard Henwood (Inactive) [ 04/Apr/16 ] |
|
again: review-dne-part-2 on April 2nd on Master: https://testing.hpdd.intel.com/test_sets/78352216-f92a-11e5-a22e-5254006e85c2 Error: 'rundbench load on trevis-56vm1.trevis.hpdd.intel.com,trevis-56vm2 failed!' |
| Comment by Jian Yu [ 08/Apr/16 ] |
|
More failure instance on master branch: |
| Comment by Jinshan Xiong (Inactive) [ 11/Apr/16 ] |
|
happened again at: https://testing.hpdd.intel.com/test_sets/e8648d38-fcc7-11e5-abd3-5254006e85c2 |
| Comment by Di Wang [ 12/Apr/16 ] |
|
the test is failed because of trevis-42vm2: [13576] unlink ./clients/client0/~dmtmp/WORDPRO/BENCHS1.LWP failed (No such file or directory) - expected NT_STATUS_OK trevis-42vm2: ERROR: child 0 failed at line 13576 I suspect this is related with the slave stripes update. int lmv_revalidate_slaves(struct obd_export *exp,
const struct lmv_stripe_md *lsm,
ldlm_blocking_callback cb_blocking,
int extra_lock_flags)
{
.........
if (body == NULL) {
if (it.d.lustre.it_lock_mode && lockh) {
ldlm_lock_decref(lockh,
it.d.lustre.it_lock_mode);
it.d.lustre.it_lock_mode = 0;
}
GOTO(cleanup, rc = -ENOENT);
}
Though I did not find anything useful in the debug log. I am trying to push a debug patch, but it seems I can not push any patch to master repository for the moment, and will retry later. |
| Comment by Gerrit Updater [ 12/Apr/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/19489 |
| Comment by Bob Glossman (Inactive) [ 22/Apr/16 ] |
|
another on master: |
| Comment by Bob Glossman (Inactive) [ 23/Apr/16 ] |
|
another on master: |
| Comment by Andreas Dilger [ 28/Apr/16 ] |
|
This is failing fairly frequently in testing. It would be good to make some progress with the debugging patch, or actual fix. |
| Comment by Emoly Liu [ 03/May/16 ] |
|
Another on master: |
| Comment by Gerrit Updater [ 06/May/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20022 |
| Comment by John Hammond [ 06/May/16 ] |
|
https://testing.hpdd.intel.com/test_sets/cdbfbfcc-1386-11e6-855a-5254006e85c2 |
| Comment by Gerrit Updater [ 09/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20022/ |
| Comment by Andreas Dilger [ 09/May/16 ] |
|
Note this bug should not be closed because of the above patch landing, which only changed the test to run on a single MDS. |
| Comment by Di Wang [ 28/Jun/16 ] |
|
looks quite similar as |
| Comment by Di Wang [ 22/Jul/16 ] |
|
I pushed a patch http://review.whamcloud.com/19489 to see if http://review.whamcloud.com/#/c/20940/ and http://review.whamcloud.com/#/c/21088/ can fix 6844. |
| Comment by Di Wang [ 24/Jul/16 ] |
|
According to the test, it looks like 20940 and 21088 can fix 6844. I will then make a patch to revert http://review.whamcloud.com/20022 . |
| Comment by Gerrit Updater [ 26/Jul/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/21508 |
| Comment by Joseph Gmitter (Inactive) [ 29/Jul/16 ] |
|
For tracking purposes, the patch remaining to be landed here for the fix is from and re-enabling the test is: http://review.whamcloud.com/#/c/21508/ |
| Comment by Joseph Gmitter (Inactive) [ 20/Aug/16 ] |
|
The above patch from |
| Comment by Gerrit Updater [ 29/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21508/ |
| Comment by Peter Jones [ 29/Aug/16 ] |
|
Landed for 2.9 |