[LU-7117] replay-single test_70d: timeout and mkdir/rmdir stopped Created: 08/Sep/15 Updated: 21/Sep/17 Resolved: 15/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/56e1b56e-53ff-11e5-8f2c-5254006e85c2. The sub-test test_70d failed with the following error: error on LL_IOC_LMV_SETSTRIPE '/mnt/lustre/d70d.replay-single/test1' (3): stripe already set error: mkdir: create stripe dir '/mnt/lustre/d70d.replay-single/test1' failed mkdir fails /usr/lib64/lustre/tests/replay-single.sh: line 2236: kill: (25189) - No such process mkdir/rmdir 25189 stopped There are several test failures and timeouts for 70d since 2015-09-02 so I suspect a patch landed on that day or the previous day that introduced a regression. Info required for matching: replay-single 70d |
| Comments |
| Comment by James Nunez (Inactive) [ 07/Oct/15 ] |
|
Another failure of replay-single test_70d with logs at https://testing.hpdd.intel.com/test_sets/e4bc4b96-6cb0-11e5-ab7f-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 11/Nov/15 ] |
|
another on master: |
| Comment by James Nunez (Inactive) [ 13/Nov/15 ] |
|
Another failure of stopping the mkdir/rmdir process, but this one takes place inside the random_fail_mdt() routine. I suspect the cause of the failure is the same since logs are the similar: shadow-17vm9: CMD: shadow-17vm9.shadow.whamcloud.com lctl get_param -n at_max shadow-17vm10: CMD: shadow-17vm10.shadow.whamcloud.com lctl get_param -n at_max touch: cannot touch `/mnt/lustre/d70d.replay-single/test1/a': No such file or directory touch fails shadow-17vm9: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 6 sec shadow-17vm10: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 6 sec /usr/lib64/lustre/tests/replay-single.sh: line 2114: kill: (16877) - No such process Logs at |
| Comment by Jian Yu [ 02/Dec/15 ] |
|
More instance on master: |
| Comment by James Nunez (Inactive) [ 28/Dec/15 ] |
|
Another failure on master with the 'No such file or directory' error: |
| Comment by Jian Yu [ 29/Jan/16 ] |
|
More failure instance on master branch: All of the instances occurred with DNE configuration. Patch review testing on master branch is affected by this failure. |
| Comment by Richard Henwood (Inactive) [ 01/Mar/16 ] |
|
another master branch failure, during review-dne-part-2: https://testing.hpdd.intel.com/test_sets/0a65754c-dd7d-11e5-ab2a-5254006e85c2 |
| Comment by Jian Yu [ 06/Apr/16 ] |
|
Occurred again on master branch: |
| Comment by Emoly Liu [ 18/Apr/16 ] |
|
Another failure on master: |
| Comment by nasf (Inactive) [ 17/May/16 ] |
|
Another failure instance on master: |
| Comment by John Hammond [ 19/May/16 ] |
|
https://testing.hpdd.intel.com/test_sets/59ca4542-1d4a-11e6-9089-5254006e85c2 |
| Comment by nasf (Inactive) [ 13/Jun/16 ] |
|
Another failure instance on master: |
| Comment by Andreas Dilger [ 13/Jun/16 ] |
|
This is now the number one cause of review test failures on master. |
| Comment by nasf (Inactive) [ 13/Jun/16 ] |
|
More failure instances on master: |
| Comment by Gerrit Updater [ 17/Jun/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20853 |
| Comment by nasf (Inactive) [ 23/Jun/16 ] |
|
After the painful debugging from millions of lines logs, I found that the failure should related with the following scenario: |
| Comment by Alex Zhuravlev [ 23/Jun/16 ] |
|
no new RPCS (except update log redo) should be sent until the recovery is completed? |
| Comment by Gerrit Updater [ 23/Jun/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20940 |
| Comment by nasf (Inactive) [ 23/Jun/16 ] |
The MDT0 is in recovery, but MDT1 is normal, so the RPC from the client to MDT1 is not blocked. |
| Comment by Bob Glossman (Inactive) [ 25/Jun/16 ] |
|
another on master: |
| Comment by Gerrit Updater [ 29/Jun/16 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/21064 |
| Comment by Sebastien Buisson (Inactive) [ 06/Jul/16 ] |
|
Another occurrence: |
| Comment by Bruno Faccini (Inactive) [ 14/Jul/16 ] |
|
+1 on master at https://testing.hpdd.intel.com/test_sets/79710324-49a3-11e6-a80f-5254006e85c2 |
| Comment by Sebastien Buisson (Inactive) [ 15/Jul/16 ] |
|
A lot more occurrences recently, like this one on master: |
| Comment by Andreas Dilger [ 20/Jul/16 ] |
|
How do the patches http://review.whamcloud.com/20940 " |
| Comment by Andreas Dilger [ 20/Jul/16 ] |
|
Never mind, I see that http://review.whamcloud.com/21064 " |
| Comment by Gerrit Updater [ 15/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20940/ |
| Comment by Peter Jones [ 15/Aug/16 ] |
|
Landed for 2.9 |