[LU-8062] recovery-small test_115b: @@@@@@ FAIL: dd success Created: 25/Apr/16 Updated: 03/Aug/17 Resolved: 19/Jul/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0, Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.1, Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Bhagyesh Dudhediya (Inactive) | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Epic/Theme: | test | ||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
== recovery-small test 115b: write: late REQ MDunlink and no bulk == 21:12:09 (1461384729) |
| Comments |
| Comment by Bhagyesh Dudhediya (Inactive) [ 25/Apr/16 ] |
|
In recovery-small/test_115b : 00002000:00000020:1.0:1461384731.470636:0:7993:0:(ofd_obd.c:842:ofd_statfs()) 2317411 blocks: 2307535 free, 2183803 avail; 618944 objects: 618375 free; state 0 00002000:02000000:1.0:1461384731.470641:0:7993:0:(libcfs_fail.h:96:cfs_fail_check_set()) *** cfs_fail_loc=215, val=0*** i.e. statfs is in progress by thread 7993 and it catches 215 fail_loc. |
| Comment by Andreas Dilger [ 25/Apr/16 ] |
|
It looks like this failure has only been happening since 2016-04-21: so it is likely related to some patch that landed shortly before that time. |
| Comment by Andreas Dilger [ 25/Apr/16 ] |
|
This appears to be the top failing test on master in the past week, so increasing the priority. I checked a couple of the failing patches, and there doesn't seem to be any common source of the failure (i.e. it isn't caused by a bad patch that is repeatedly being retested) and some of the failures are on master. The few that I looked at more closely are based on commit 0f62ba942939c26edf07176d2eb082d38e95caec: CommitDate: Thu Apr 21 02:28:41 2016 +0000
LU-8024 kernel: kernel update [SLES12 SP1 3.12.57-60.35]
Update target and kernel_config files for new version
Test-Parameters: clientdistro=sles12 testgroup=review-ldiskfs \
mdsdistro=sles12 ossdistro=sles12 mdsfilesystemtype=ldiskfs \
mdtfilesystemtype=ldiskfs ostfilesystemtype=ldiskfs
Signed-off-by: Bob Glossman <bob.glossman@intel.com>
Change-Id: Ic9f33c4877f93249b37726d8ad60c98ee624f719
Reviewed-on: http://review.whamcloud.com/19593
Which isn't to imply that this is the root cause of the problem (many of the test failures are on RHEL6.7, not SLES at all), but rather that this is the point where the other patches are failing this test, so the problematic commit is somewhere before this point. |
| Comment by Andreas Dilger [ 25/Apr/16 ] |
|
It appears the root cause of these failures is the following patch, which added test_115b and landed on 2016-04-21: commit 55f8520817a31dabf19fe0a8ac2492b85d039c38
Author: Vitaly Fertman <vitaly.fertman@seagate.com>
CommitDate: Thu Apr 21 02:27:54 2016 +0000
LU-7434 ptlrpc: lost bulk leads to a hang
The reverse order of request_out_callback() and reply_in_callback()
puts the RPC into UNREGISTERING state, which is waiting for RPC &
bulk md unlink, whereas only RPC md unlink has been called so far.
If bulk is lost, even expired_set does not check for UNREGISTERING
state.
The same for write if server returns an error.
This phase is ambiguous, split to UNREG_RPC and UNREG_BULK.
Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com>
Reviewed-on: http://review.whamcloud.com/17221
|
| Comment by Andreas Dilger [ 25/Apr/16 ] |
|
Vitaly, could you please take a look at this? Is there a simple fix, or should the patch be reverted to give you more time to look into it? |
| Comment by Gerrit Updater [ 25/Apr/16 ] |
|
Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/19758 |
| Comment by Richard Henwood (Inactive) [ 25/Apr/16 ] |
|
Another occurrence of this failure, on Master, on review-dne-part-1 https://testing.hpdd.intel.com/sub_tests/637e1824-0a36-11e6-855a-5254006e85c2 |
| Comment by Andreas Dilger [ 26/Apr/16 ] |
|
Closing this bug, since the problematic patch was reverted, and |
| Comment by parinay v kondekar (Inactive) [ 26/Apr/16 ] |
| Comment by Vitaly Fertman [ 26/Apr/16 ] |
actually, the fix was submitted above. please re-land the original patch with the fix above. |
| Comment by Niu Yawei (Inactive) [ 05/Jul/16 ] |
|
https://testing.hpdd.intel.com/test_sets/5766e584-41c7-11e6-bbf5-5254006e85c2 The failure reoccurred in master review. |
| Comment by Jian Yu [ 05/Aug/16 ] |
|
One more failure instance on master branch: https://testing.hpdd.intel.com/test_sets/11d1b22e-5b2b-11e6-b2e2-5254006e85c2 |
| Comment by nasf (Inactive) [ 03/Nov/16 ] |
|
+1 on master: |
| Comment by Niu Yawei (Inactive) [ 10/Nov/16 ] |
|
Hit on master: https://testing.hpdd.intel.com/test_sets/3c6bacec-a680-11e6-8859-5254006e85c2 |
| Comment by nasf (Inactive) [ 16/Nov/16 ] |
|
+1 on master: |
| Comment by Emoly Liu [ 30/Nov/16 ] |
|
+1 on master: |
| Comment by Jian Yu [ 05/Dec/16 ] |
|
+1 on master: https://testing.hpdd.intel.com/test_sets/c78d6572-b396-11e6-85c4-5254006e85c2 |
| Comment by nasf (Inactive) [ 11/Dec/16 ] |
|
+1 on master: |
| Comment by nasf (Inactive) [ 23/Dec/16 ] |
|
+1 on master: |
| Comment by Bob Glossman (Inactive) [ 21/Jan/17 ] |
|
another on master: |
| Comment by Bob Glossman (Inactive) [ 25/Jan/17 ] |
|
another on master: |
| Comment by Bob Glossman (Inactive) [ 04/Feb/17 ] |
|
another on master: |
| Comment by Sebastien Buisson (Inactive) [ 06/Feb/17 ] |
|
+1 on master: |
| Comment by Andreas Dilger [ 06/Feb/17 ] |
|
As 529964 mentioned in his first comment, this problem looks to be a problem with the test itself, rather than a problem with the code. There is a race condition because fail_loc=0x215 (OBD_FAIL_OST_ENOSPC) is insufficiently specific to cause only the write to fail, but it also fails for unrelated OST_STATFS RPCs on that OST (e.g. from the MDS). |
| Comment by Gerrit Updater [ 06/Feb/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25279 |
| Comment by James Casper [ 30/Mar/17 ] |
|
Saw a very similar write not blocked issue with master b3541: fail_loc=0x720 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.0340116 s, 120 kB/s sanity test_313: @@@@@@ FAIL: write should failed https://testing.hpdd.intel.com/test_sets/1c10446a-0a05-11e7-9053-5254006e85c2 |
| Comment by Emoly Liu [ 18/Apr/17 ] |
|
+1 on master: |
| Comment by Vitaly Fertman [ 25/Apr/17 ] |
can you clarify how does it happen? there is a protection made with $OSTCOUNT, that failcheck is triggered only if the fail_val matches the ostid. not supposed to happen. |
| Comment by Gerrit Updater [ 25/Apr/17 ] |
|
Vitaly Fertman (vitaly.fertman@seagate.com) uploaded a new patch: https://review.whamcloud.com/26815 |
| Comment by Gerrit Updater [ 19/Jul/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26815/ |
| Comment by Gerrit Updater [ 21/Jul/17 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/28181 |
| Comment by Gerrit Updater [ 03/Aug/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28181/ |