[LU-7117] replay-single test_70d: timeout and mkdir/rmdir stopped Created: 08/Sep/15  Updated: 21/Sep/17  Resolved: 15/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-7213 replay-single test_70d: mkdir: create... Resolved
is duplicated by LU-7924 replay-single test_70d: short descrip... Closed
Related
is related to LU-7172 replay-single test_70d hung on MDT un... Resolved
is related to LU-7775 replay-single test_70d: cannot touch ... Open
is related to LU-6844 replay-single test 70b failure: 'rund... Resolved
is related to LU-8353 mdt unlink should lock parent before ... Resolved
is related to LU-8617 replay-single test_70d: stripe alread... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/56e1b56e-53ff-11e5-8f2c-5254006e85c2.

The sub-test test_70d failed with the following error:

error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d70d.replay-single/test1' (3): stripe already set
error: mkdir: create stripe dir '/mnt/lustre/d70d.replay-single/test1' failed
mkdir fails
/usr/lib64/lustre/tests/replay-single.sh: line 2236: kill: (25189) - No such process
mkdir/rmdir 25189 stopped

There are several test failures and timeouts for 70d since 2015-09-02 so I suspect a patch landed on that day or the previous day that introduced a regression.

Info required for matching: replay-single 70d



 Comments   
Comment by James Nunez (Inactive) [ 07/Oct/15 ]

Another failure of replay-single test_70d with logs at https://testing.hpdd.intel.com/test_sets/e4bc4b96-6cb0-11e5-ab7f-5254006e85c2
2015-10-26 15:56:26 - https://testing.hpdd.intel.com/test_sets/0f6e4b5c-7c3b-11e5-bc82-5254006e85c2
2015-11-04 01:58:10 - https://testing.hpdd.intel.com/test_sets/40bafc72-82e0-11e5-8b6b-5254006e85c2

Comment by Bob Glossman (Inactive) [ 11/Nov/15 ]

another on master:
https://testing.hpdd.intel.com/test_sets/65711b06-8857-11e5-84a2-5254006e85c2

Comment by James Nunez (Inactive) [ 13/Nov/15 ]

Another failure of stopping the mkdir/rmdir process, but this one takes place inside the random_fail_mdt() routine. I suspect the cause of the failure is the same since logs are the similar:

shadow-17vm9: CMD: shadow-17vm9.shadow.whamcloud.com lctl get_param -n at_max
shadow-17vm10: CMD: shadow-17vm10.shadow.whamcloud.com lctl get_param -n at_max
touch: cannot touch `/mnt/lustre/d70d.replay-single/test1/a': No such file or directory
touch fails
shadow-17vm9: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 6 sec
shadow-17vm10: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 6 sec
/usr/lib64/lustre/tests/replay-single.sh: line 2114: kill: (16877) - No such process

Logs at
2015-10-27 14:12:35 - https://testing.hpdd.intel.com/test_sets/1637670e-7cf7-11e5-bceb-5254006e85c2
2015-11-12 02:07:27 - https://testing.hpdd.intel.com/test_sets/f8324f14-8924-11e5-8ba4-5254006e85c2

Comment by Jian Yu [ 02/Dec/15 ]

More instance on master:
https://testing.hpdd.intel.com/test_sets/5fc600d8-9836-11e5-8fa3-5254006e85c2

Comment by James Nunez (Inactive) [ 28/Dec/15 ]

Another failure on master with the 'No such file or directory' error:
2015-12-27 14:00:02 - https://testing.hpdd.intel.com/test_sets/6af615f6-acea-11e5-9134-5254006e85c2
2016-01-07 03:11:56 - https://testing.hpdd.intel.com/test_sets/2f511064-b52d-11e5-aa1f-5254006e85c2
2016-01-12 13:29:58 - https://testing.hpdd.intel.com/test_sets/7aeee7ae-b97d-11e5-825c-5254006e85c2
2016-02-03 10:29:58 - https://testing.hpdd.intel.com/test_sets/7cf30a58-caa8-11e5-a610-5254006e85c2

Comment by Jian Yu [ 29/Jan/16 ]

More failure instance on master branch:
https://testing.hpdd.intel.com/test_sets/48380772-c66c-11e5-8cac-5254006e85c2

All of the instances occurred with DNE configuration. Patch review testing on master branch is affected by this failure.

Comment by Richard Henwood (Inactive) [ 01/Mar/16 ]

another master branch failure, during review-dne-part-2:

https://testing.hpdd.intel.com/test_sets/0a65754c-dd7d-11e5-ab2a-5254006e85c2

Comment by Jian Yu [ 06/Apr/16 ]

Occurred again on master branch:
https://testing.hpdd.intel.com/test_sets/4d2a6bf2-fb4c-11e5-acc0-5254006e85c2

Comment by Emoly Liu [ 18/Apr/16 ]

Another failure on master:
https://testing.hpdd.intel.com/test_sets/9bbba39e-0326-11e6-b5f1-5254006e85c2
https://testing.hpdd.intel.com/test_sets/21698126-0fb3-11e6-9b34-5254006e85c2

Comment by nasf (Inactive) [ 17/May/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/86f8d552-1c04-11e6-b5f1-5254006e85c2

Comment by John Hammond [ 19/May/16 ]

https://testing.hpdd.intel.com/test_sets/59ca4542-1d4a-11e6-9089-5254006e85c2

Comment by nasf (Inactive) [ 13/Jun/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/e17c313a-3096-11e6-acf3-5254006e85c2

Comment by Andreas Dilger [ 13/Jun/16 ]

This is now the number one cause of review test failures on master.

Comment by nasf (Inactive) [ 13/Jun/16 ]

More failure instances on master:
https://testing.hpdd.intel.com/test_sets/478fbade-3180-11e6-bbf5-5254006e85c2
https://testing.hpdd.intel.com/test_sets/9d99755c-316f-11e6-a0ce-5254006e85c2

Comment by Gerrit Updater [ 17/Jun/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20853
Subject: LU-7117 tests: enable full debug for reply-single test_70d
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 61ec7fa12d28daffa43bc7a4f0e927d4e6550061

Comment by nasf (Inactive) [ 23/Jun/16 ]

After the painful debugging from millions of lines logs, I found that the failure should related with the following scenario:
1) The client sent unlink (PFID/test1) RPC to the MDT1;
2) The MDT0 failover;
3) The MDT1 is replaying with the MDT0;
4) At that time, the step 1) trigger lookup(test1) RPC from the MDT1 to the MDT0 for the test1's FID.
5) But because the MDT0 recovery is not completed yet, the test1's FID on the MDT0 is still the old one. That means the MDT0 reply the MDT1 with an invalid test1's FID.
6) The MDT1 uses the MDT0 given test1's FID to locate the target child object to be deleted, found nothing. Then the unlink operation failed.

Comment by Alex Zhuravlev [ 23/Jun/16 ]

no new RPCS (except update log redo) should be sent until the recovery is completed?

Comment by Gerrit Updater [ 23/Jun/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20940
Subject: LU-7117 osp: control RPC to be sent when recovery
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5fb68c49af3fa08189110a4e980ad792efd7128b

Comment by nasf (Inactive) [ 23/Jun/16 ]

no new RPCS (except update log redo) should be sent until the recovery is completed?

The MDT0 is in recovery, but MDT1 is normal, so the RPC from the client to MDT1 is not blocked.

Comment by Bob Glossman (Inactive) [ 25/Jun/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/6a54094a-3a9e-11e6-a0ce-5254006e85c2

Comment by Gerrit Updater [ 29/Jun/16 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/21064
Subject: LU-7117 mdt: mdt unlink should lock before lookup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0ae3d50ed2745366f677b79586f3fe72645330ee

Comment by Sebastien Buisson (Inactive) [ 06/Jul/16 ]

Another occurrence:
https://testing.hpdd.intel.com/test_sets/823971fa-433e-11e6-acf3-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 14/Jul/16 ]

+1 on master at https://testing.hpdd.intel.com/test_sets/79710324-49a3-11e6-a80f-5254006e85c2

Comment by Sebastien Buisson (Inactive) [ 15/Jul/16 ]

A lot more occurrences recently, like this one on master:
https://testing.hpdd.intel.com/test_sets/4d6f9aea-4a09-11e6-8968-5254006e85c2

Comment by Andreas Dilger [ 20/Jul/16 ]

How do the patches http://review.whamcloud.com/20940 "LU-7117 osp: control RPC to be sent when recovery" and http://review.whamcloud.com/21064 "LU-7117 mdt: mdt unlink should lock before lookup" relate to each other? Are they both needed? Are they different approaches to fixing the same problem, and only one is needed?

Comment by Andreas Dilger [ 20/Jul/16 ]

Never mind, I see that http://review.whamcloud.com/21064 "LU-7117 mdt: mdt unlink should lock before lookup" is abandoned since it was landed as http://review.whamcloud.com/21088 via LU-8353.

Comment by Gerrit Updater [ 15/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20940/
Subject: LU-7117 osp: set ptlrpc_request::rq_allow_replay properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e3d507eec50fc1ff79acf2a9f93d52d698c887d7

Comment by Peter Jones [ 15/Aug/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:06:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.