[LU-6844] replay-single test 70b failure: 'rundbench load on * failed!' Created: 14/Jul/15  Updated: 29/May/17  Resolved: 29/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: None
Environment:

review-dne-part-2 test group


Issue Links:
Duplicate
is duplicated by LU-8107 replay-single test_70b: rundbench can... Closed
Related
is related to LU-6840 update memory reply data in DNE updat... Resolved
is related to LU-7117 replay-single test_70d: timeout and m... Resolved
is related to LU-7739 replay-single test 70b hangs with LBU... Resolved
is related to LU-8353 mdt unlink should lock parent before ... Resolved
is related to LU-4439 Test failure on replay-single test_70... Resolved
is related to LU-7617 replay-single test_70b hangs ASSERTIO... Resolved
is related to LU-7298 replay-single test_70b: ASSERTION( dt... Closed
is related to LU-7309 replay-single test_70b: no space left... Resolved
is related to LU-6919 replay-single test_70b: "Cannot send ... Resolved
is related to LU-7788 replay-single test_70b timeout with M... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test 70b fails on rename in review-dne-part-2 test sessions. Logs are at:

2015-07-12 00:40:18 - https://testing.hpdd.intel.com/test_sets/4d9d6272-2877-11e5-8d7f-5254006e85c2
2015-07-13 07:41:04 - https://testing.hpdd.intel.com/test_sets/21c48224-2977-11e5-a9c5-5254006e85c2
2015-07-13 12:08:49 - https://testing.hpdd.intel.com/test_sets/7a85b58a-29aa-11e5-b07d-5254006e85c2

Although this failures looks like LU-4439, the error message in the client test log is different:

onyx-30vm5: [4766] rename ./clients/client0/~dmtmp/WORD/~WRD3497.TMP ./clients/client0/~dmtmp/WORD/TIPS.DOC failed (No such file or directory) - expected NT_STATUS_OK
onyx-30vm5: ERROR: child 0 failed at line 4766
onyx-30vm5: Child failed with status 1
onyx-30vm5: status        script            Total(sec) E(xcluded) S(low) 
onyx-30vm5: ------------------------------------------------------------------------------------
onyx-30vm5: 
onyx-30vm5: touch: missing file operand
onyx-30vm5: Try `touch --help' for more information.
onyx-30vm5: mdc.lustre-MDT0002-mdc-*.mds_server_uuid in FULL state after 4 sec
onyx-30vm6: mdc.lustre-MDT0002-mdc-*.mds_server_uuid in FULL state after 4 sec
onyx-30vm6:    1      4685     0.19 MB/sec  execute  82 sec  latency 22735.691 ms
onyx-30vm6:    1      5047     0.23 MB/sec  execute  83 sec  latency 559.323 ms
CMD: onyx-30vm5,onyx-30vm6.onyx.hpdd.intel.com killall -0 dbench
onyx-30vm5: dbench: no process killed
 replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of onyx-30vm5,onyx-30vm6.onyx.hpdd.intel.com! 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4727:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4758:error()
  = /usr/lib64/lustre/tests/replay-single.sh:2099:test_70b()
  = /usr/lib64/lustre/tests/test-framework.sh:5020:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5057:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:4907:run_test()
  = /usr/lib64/lustre/tests/replay-single.sh:2101:main()
Dumping lctl log to /logdir/test_logs/2015-07-11/lustre-reviews-el6_6-x86_64--review-dne-part-2--1_5_1__33266__-70239005896920-232641/replay-single.test_70b.*.1436664129.log
CMD: onyx-30vm3,onyx-30vm4,onyx-30vm5,onyx-30vm6.onyx.hpdd.intel.com,onyx-30vm7 /usr/sbin/lctl dk > /logdir/test_logs/2015-07-11/lustre-reviews-el6_6-x86_64--review-dne-part-2--1_5_1__33266__-70239005896920-232641/replay-single.test_70b.debug_log.\$(hostname -s).1436664129.log;
         dmesg > /logdir/test_logs/2015-07-11/lustre-reviews-el6_6-x86_64--review-dne-part-2--1_5_1__33266__-70239005896920-232641/replay-single.test_70b.dmesg.\$(hostname -s).1436664129.log

Info required for matching: replay-single 70b



 Comments   
Comment by Di Wang [ 14/Jul/15 ]

this is probably due to LU-6840.

Comment by Joseph Gmitter (Inactive) [ 14/Jul/15 ]

Di, can you have a look and if it is the same issue as 6840, close it is a duplicate?

Comment by James Nunez (Inactive) [ 20/Jul/15 ]

More occurrences in review-dne-*:
2015-07-18 04:47:54 - https://testing.hpdd.intel.com/test_sets/983347a8-2d4e-11e5-b883-5254006e85c2
2015-07-20 06:05:40 - https://testing.hpdd.intel.com/test_sets/2d98d876-2ee9-11e5-bc70-5254006e85c2
2015-07-21 19:50:14 - https://testing.hpdd.intel.com/test_sets/05744940-302c-11e5-bb7b-5254006e85c2

Comment by Sarah Liu [ 20/Jul/15 ]

another instance in Hard Failover:

https://testing.hpdd.intel.com/test_sets/71cf0a7c-2793-11e5-9951-5254006e85c2

Comment by Di Wang [ 23/Jul/15 ]

"Di, can you have a look and if it is the same issue as 6840, close it is a duplicate?"

There are not enough information for me to know what is the real reason here, but most recent DNE failover failure is indeed related with LU-6840. Let's see how LU-6840 tests goes.

Comment by James Nunez (Inactive) [ 24/Jul/15 ]

More failures in review-dne-part-2:
2015-07-23 15:37:17 - https://testing.hpdd.intel.com/test_sets/cfd11ad8-3194-11e5-8dbe-5254006e85c2
2015-07-24 15:21:48 - https://testing.hpdd.intel.com/test_sets/a96a0748-325b-11e5-94c1-5254006e85c2
2015-07-25 06:47:00 - https://testing.hpdd.intel.com/test_sets/538f2a26-32db-11e5-8214-5254006e85c2

Comment by Sebastien Buisson (Inactive) [ 29/Jul/15 ]

One more instance:
2015-07-28 16:04:13 - https://testing.hpdd.intel.com/test_sets/8aee9dd8-358f-11e5-af5d-5254006e85c2

Comment by James Nunez (Inactive) [ 29/Jul/15 ]

More failures in review-dne-part-2:
2015-07-27 13:37:43 - https://testing.hpdd.intel.com/test_sets/d1558c8e-34a7-11e5-ac46-5254006e85c2
2015-07-28 08:07:47 - https://testing.hpdd.intel.com/test_sets/3d0ae810-3549-11e5-b949-5254006e85c2
2015-07-28 16:04:13 - https://testing.hpdd.intel.com/test_sets/8aee9dd8-358f-11e5-af5d-5254006e85c2
2015-07-29 06:14:34 - https://testing.hpdd.intel.com/test_sets/9485875c-35fe-11e5-90a5-5254006e85c2
2015-07-30 20:58:25 - https://testing.hpdd.intel.com/test_sets/c8f5154e-3742-11e5-a8a9-5254006e85c2

Comment by Jian Yu [ 17/Aug/15 ]

More instance on master branch:
https://testing.hpdd.intel.com/test_sets/1a8282a6-43d8-11e5-a4bc-5254006e85c2

Comment by Bob Glossman (Inactive) [ 20/Aug/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/268c4872-4714-11e5-b14a-5254006e85c2

Comment by Sarah Liu [ 24/Aug/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/65b0104c-46d3-11e5-90a5-5254006e85c2

Comment by James Nunez (Inactive) [ 27/Aug/15 ]

Another on master
2015-08-26 06:57:16 - https://testing.hpdd.intel.com/test_sets/7d2d65ee-4c05-11e5-b537-5254006e85c2

Comment by Di Wang [ 16/Sep/15 ]

Most of these failures happened before 2.7.59. And there are a few fixes like LU-6880, LU-6840 and LU-7050 will help fix this issue. Let's how this goes in the following run.

Comment by Andreas Dilger [ 13/Oct/15 ]

Closing this as a duplicate of LU-6880 and/or LU-6840 per most recent comments.

Comment by James Nunez (Inactive) [ 04/Nov/15 ]

Reopening this ticket because we are seeing this error again on master. The patches being tested are based on master where patches for LU-6880, LU-6840, and LU-7050 have landed.

The following replay-single test 70b failures fail in rename or unlink. Logs are at:
2015-10-24 14:58:48 - https://testing.hpdd.intel.com/test_sets/73971ed4-7ac3-11e5-a4dd-5254006e85c2
2015-10-29 15:33:48 - https://testing.hpdd.intel.com/test_sets/46363fee-7ea4-11e5-965a-5254006e85c2
2015-10-29 18:05:21 - https://testing.hpdd.intel.com/test_sets/c7260530-7eb3-11e5-8602-5254006e85c2
2015-11-03 19:55:24 - https://testing.hpdd.intel.com/test_sets/e0dc4530-82ae-11e5-b9d3-5254006e85c2
2015-12-19 23:55:53 - https://testing.hpdd.intel.com/test_sets/556a5e64-a6ea-11e5-ab33-5254006e85c2
2015-12-20 04:29:48 - https://testing.hpdd.intel.com/test_sets/40dd4a8e-a71b-11e5-ab33-5254006e85c2
2015-12-24 09:10:55 - https://testing.hpdd.intel.com/test_sets/03ee6f0e-aa6d-11e5-bcca-5254006e85c2
2016-01-11 17:17:30 - https://testing.hpdd.intel.com/test_sets/f3c9ebe6-b8ca-11e5-8c15-5254006e85c2
2016-02-04 08:17:35 - https://testing.hpdd.intel.com/test_sets/c7aac802-cb65-11e5-a59a-5254006e85c2
2016-02-10 03:02:33 - https://testing.hpdd.intel.com/test_sets/a4af3ff4-cfe8-11e5-a49b-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for interop : 2.5.5 Server/EL7 Client
Server: 2.5.5, b2_5_fe/62
Client: master, build# 3303, RHEL 7
https://testing.hpdd.intel.com/test_sets/6e9d7126-bb0a-11e5-87b4-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance occurred for hardfailover : EL7 Server/Client - ZFS
build# 3305
https://testing.hpdd.intel.com/test_sets/02982ada-bbc7-11e5-8506-5254006e85c2

Comment by James Nunez (Inactive) [ 01/Feb/16 ]

For the following, the problem is unlink, but unlink fails for the same reason as above 'No such file or directory'

onyx-42vm1: [993] unlink ./clients/client0/~dmtmp/PWRPNT/PPTC321.TMP failed (No such file or directory) - expected NT_STATUS_OK
onyx-42vm1: ERROR: child 0 failed at line 993

Logs at https://testing.hpdd.intel.com/test_sets/3e6e0f84-c6a8-11e5-8cac-5254006e85c2
2016-02-17 16:04:41 - https://testing.hpdd.intel.com/test_sets/11d720fe-d5d1-11e5-afbd-5254006e85c2
2016-02-18 15:18:28 - https://testing.hpdd.intel.com/test_sets/728bbcf0-d696-11e5-8955-5254006e85c2
2016-02-18 15:56:44 - https://testing.hpdd.intel.com/test_sets/dbe15260-d69f-11e5-8955-5254006e85c2
2016-02-18 16:16:04 - https://testing.hpdd.intel.com/test_sets/48cfac56-d69f-11e5-afe8-5254006e85c2
2016-02-21 15:50:36 - https://testing.hpdd.intel.com/test_sets/abb1c872-d8de-11e5-b4e5-5254006e85c2
2016-02-22 00:38:56 - https://testing.hpdd.intel.com/test_sets/ea7684ba-d94e-11e5-8b17-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 03/Feb/16 ]

Another instance failing with the same error as above for tag 2.7.66 for FULL - EL6.7 Server/EL6.7 Client - DNE , master build# 3314.
https://testing.hpdd.intel.com/test_sets/7b2c4326-ca83-11e5-9215-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - DNE, build# 3314
https://testing.hpdd.intel.com/test_sets/7b2c4326-ca83-11e5-9215-5254006e85c2

Comment by nasf (Inactive) [ 07/Mar/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/73d8bd0c-e3f1-11e5-9e66-5254006e85c2#

Comment by Bob Glossman (Inactive) [ 08/Mar/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/ae54f072-e4d7-11e5-8045-5254006e85c2

Comment by Andreas Dilger [ 08/Mar/16 ]

Di, there are still ongoing test failures in replay-single test_70b after the landing of the LU-6840 patch. There are quite a few different failure modes, which makes me think that there is some kind of memory corruption happening here:

LU-6919 replay-single test_70b: "Cannot send after transport endpoint shutdown" running dbench
LU-7788 replay-single test_70b timeout with MDS OOM
LU-7298 replay-single test_70b: ASSERTION( dt->do_body_ops->dbo_write ) failed
LU-7617 replay-single test 70b hangs with LBUG ‘ASSERTION( ((o)>lo_header>loh_attr & LOHA_EXISTS) != 0 ) failed:’
LU-7739 replay-single test 70b hangs with LBUG '(mgc_request.c:995:mgc_blocking_ast()) ASSERTION( atomic_read(&cld->cld_refcount) > 0 )'

Comment by Di Wang [ 08/Mar/16 ]

Sure, I will check it.

Comment by Richard Henwood (Inactive) [ 29/Mar/16 ]

another instance of this failure from Master review-dne-part-2 on Mar 28th:

https://testing.hpdd.intel.com/test_sets/1789dade-f50f-11e5-87ab-5254006e85c2

Comment by Richard Henwood (Inactive) [ 04/Apr/16 ]

again: review-dne-part-2 on April 2nd on Master:

https://testing.hpdd.intel.com/test_sets/78352216-f92a-11e5-a22e-5254006e85c2

Error: 'rundbench load on trevis-56vm1.trevis.hpdd.intel.com,trevis-56vm2 failed!'

Comment by Jian Yu [ 08/Apr/16 ]

More failure instance on master branch:
https://testing.hpdd.intel.com/test_sets/0e03afbc-fd34-11e5-a858-5254006e85c2

Comment by Jinshan Xiong (Inactive) [ 11/Apr/16 ]

happened again at: https://testing.hpdd.intel.com/test_sets/e8648d38-fcc7-11e5-abd3-5254006e85c2

Comment by Di Wang [ 12/Apr/16 ]

the test is failed because of

trevis-42vm2: [13576] unlink ./clients/client0/~dmtmp/WORDPRO/BENCHS1.LWP failed (No such file or directory) - expected NT_STATUS_OK
trevis-42vm2: ERROR: child 0 failed at line 13576

I suspect this is related with the slave stripes update.

     int lmv_revalidate_slaves(struct obd_export *exp,
                          const struct lmv_stripe_md *lsm,
                          ldlm_blocking_callback cb_blocking,
                          int extra_lock_flags)
{
.........            
                    if (body == NULL) {
                                if (it.d.lustre.it_lock_mode && lockh) {
                                        ldlm_lock_decref(lockh,
                                                 it.d.lustre.it_lock_mode);
                                        it.d.lustre.it_lock_mode = 0;
                                }
                                GOTO(cleanup, rc = -ENOENT);
                        }

Though I did not find anything useful in the debug log. I am trying to push a debug patch, but it seems I can not push any patch to master repository for the moment, and will retry later.

Comment by Gerrit Updater [ 12/Apr/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/19489
Subject: LU-6844 lmv: add debug message
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5e029e3c772d28edcce6379ecca25ec75b038240

Comment by Bob Glossman (Inactive) [ 22/Apr/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/339f2d82-0888-11e6-855a-5254006e85c2

Comment by Bob Glossman (Inactive) [ 23/Apr/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/03deece8-08fd-11e6-855a-5254006e85c2

Comment by Andreas Dilger [ 28/Apr/16 ]

This is failing fairly frequently in testing. It would be good to make some progress with the debugging patch, or actual fix.

Comment by Emoly Liu [ 03/May/16 ]

Another on master:
https://testing.hpdd.intel.com/test_sets/da1ade02-0df8-11e6-855a-5254006e85c2

Comment by Gerrit Updater [ 06/May/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20022
Subject: LU-6844 tests: disable DNE testing of dbench
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 52eda32e8d21bc1a58ca7d96db6e10e0a325d216

Comment by John Hammond [ 06/May/16 ]

https://testing.hpdd.intel.com/test_sets/cdbfbfcc-1386-11e6-855a-5254006e85c2

Comment by Gerrit Updater [ 09/May/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20022/
Subject: LU-6844 tests: disable DNE testing of dbench
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ed30857c852f7cdb0a29e25a2ddb030f76f5c16b

Comment by Andreas Dilger [ 09/May/16 ]

Note this bug should not be closed because of the above patch landing, which only changed the test to run on a single MDS.

Comment by Di Wang [ 28/Jun/16 ]

looks quite similar as LU-7117. will see if they are related.

Comment by Di Wang [ 22/Jul/16 ]

I pushed a patch http://review.whamcloud.com/19489 to see if http://review.whamcloud.com/#/c/20940/ and http://review.whamcloud.com/#/c/21088/ can fix 6844.

Comment by Di Wang [ 24/Jul/16 ]

According to the test, it looks like 20940 and 21088 can fix 6844. I will then make a patch to revert http://review.whamcloud.com/20022 .

Comment by Gerrit Updater [ 26/Jul/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/21508
Subject: LU-6844 tests: re-enable striped dir
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2a35e51dd81cd72d166c1cf14d6a6ebe43a973ef

Comment by Joseph Gmitter (Inactive) [ 29/Jul/16 ]

For tracking purposes, the patch remaining to be landed here for the fix is from LU-7117 http://review.whamcloud.com/#/c/20940/

and re-enabling the test is: http://review.whamcloud.com/#/c/21508/

Comment by Joseph Gmitter (Inactive) [ 20/Aug/16 ]

The above patch from LU-7117 has landed, the only remaining work here is re-enabling the test in patch http://review.whamcloud.com/#/c/21508/

Comment by Gerrit Updater [ 29/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21508/
Subject: LU-6844 tests: re-enable striped dir
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c34013e21ae4c14cc2eac5ef58c20fee0124e51d

Comment by Peter Jones [ 29/Aug/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:03:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.