[LU-8062]  recovery-small test_115b: @@@@@@ FAIL: dd success Created: 25/Apr/16  Updated: 03/Aug/17  Resolved: 19/Jul/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0, Lustre 2.10.0
Fix Version/s: Lustre 2.10.1, Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Bhagyesh Dudhediya (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-8061 recovery-small: short description of ... Resolved
Related
is related to LU-7434 lost bulk leads to a hang Resolved
is related to LU-8067 sanityn test_31a: read dd: reading `/... Resolved
Epic/Theme: test
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

== recovery-small test 115b: write: late REQ MDunlink and no bulk == 21:12:09 (1461384729)
Filesystem 1K-blocks Used Available Use% Mounted on
onyx-38vm7@tcp:/lustre
74157152 309236 69890576 1% /mnt/lustre
fail_loc=0x8000051b
fail_val=4
Filesystem 1K-blocks Used Available Use% Mounted on
onyx-38vm7@tcp:/lustre
74157152 309236 69890576 1% /mnt/lustre
CMD: onyx-38vm8 lctl set_param fail_val=0 fail_loc=0x80000215
fail_val=0
fail_loc=0x80000215
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 2.13538 s, 1.9 kB/s
recovery-small test_115b: @@@@@@ FAIL: dd success
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4764:error()
= /usr/lib64/lustre/tests/recovery-small.sh:2161:test_115_write()
= /usr/lib64/lustre/tests/recovery-small.sh:2181:test_115b()
= /usr/lib64/lustre/tests/test-framework.sh:5028:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5067:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4914:run_test()
= /usr/lib64/lustre/tests/recovery-small.sh:2183:main()
Dumping lctl log to /logdir/test_logs/2016-04-22/lustre-reviews-el6_7-x86_64-review-dne-part-1-1_6_1_38438_-70130481106820-100010/recovery-small.test_115b.*.1461384732.log



 Comments   
Comment by Bhagyesh Dudhediya (Inactive) [ 25/Apr/16 ]

In recovery-small/test_115b :
An IO RPC is infected with some fail_loc value so that it would wait for some completion (req/reply lnet buffers unlink or even bulk unlink) for a while.
Also a delay is injected before it is send.
The delay gives some time to inject another fail_loc for the server side this time, the server will return an error in this particular case, whereas the 1st fail_loc is injected accurately, there is no more IO RPCs at this time, the problem with the test is that 2nd fail_loc is too generic, whatever RPC would be handled, it returns an error on it. it means if some other RPC appears just before our IO RPC, it gets an error but our IO does not.
On the OST side logs following lines are seen indicating that fail_loc which should ne caught in IO path is actually caught by the statfs thread:

00002000:00000020:1.0:1461384731.470636:0:7993:0:(ofd_obd.c:842:ofd_statfs()) 2317411 blocks: 2307535 free, 2183803 avail; 618944 objects: 618375 free; state 0
00002000:02000000:1.0:1461384731.470641:0:7993:0:(libcfs_fail.h:96:cfs_fail_check_set()) *** cfs_fail_loc=215, val=0***

i.e. statfs is in progress by thread 7993 and it catches 215 fail_loc.

Comment by Andreas Dilger [ 25/Apr/16 ]

It looks like this failure has only been happening since 2016-04-21:
https://testing.hpdd.intel.com/sub_tests/b09ba16c-07d9-11e6-9b34-5254006e85c2

so it is likely related to some patch that landed shortly before that time.

Comment by Andreas Dilger [ 25/Apr/16 ]

This appears to be the top failing test on master in the past week, so increasing the priority.

I checked a couple of the failing patches, and there doesn't seem to be any common source of the failure (i.e. it isn't caused by a bad patch that is repeatedly being retested) and some of the failures are on master. The few that I looked at more closely are based on commit 0f62ba942939c26edf07176d2eb082d38e95caec:

CommitDate: Thu Apr 21 02:28:41 2016 +0000

    LU-8024 kernel: kernel update [SLES12 SP1 3.12.57-60.35]
    
    Update target and kernel_config files for new version
    
    Test-Parameters: clientdistro=sles12 testgroup=review-ldiskfs \
      mdsdistro=sles12 ossdistro=sles12 mdsfilesystemtype=ldiskfs \
      mdtfilesystemtype=ldiskfs ostfilesystemtype=ldiskfs
    
    Signed-off-by: Bob Glossman <bob.glossman@intel.com>
    Change-Id: Ic9f33c4877f93249b37726d8ad60c98ee624f719
    Reviewed-on: http://review.whamcloud.com/19593

Which isn't to imply that this is the root cause of the problem (many of the test failures are on RHEL6.7, not SLES at all), but rather that this is the point where the other patches are failing this test, so the problematic commit is somewhere before this point.

Comment by Andreas Dilger [ 25/Apr/16 ]

It appears the root cause of these failures is the following patch, which added test_115b and landed on 2016-04-21:

commit 55f8520817a31dabf19fe0a8ac2492b85d039c38
Author:     Vitaly Fertman <vitaly.fertman@seagate.com>
CommitDate: Thu Apr 21 02:27:54 2016 +0000

    LU-7434 ptlrpc: lost bulk leads to a hang
    
    The reverse order of request_out_callback() and reply_in_callback()
    puts the RPC into UNREGISTERING state, which is waiting for RPC &
    bulk md unlink, whereas only RPC md unlink has been called so far.
    If bulk is lost, even expired_set does not check for UNREGISTERING
    state.
    
    The same for write if server returns an error.
    
    This phase is ambiguous, split to UNREG_RPC and UNREG_BULK.
    
    Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com>
    Reviewed-on: http://review.whamcloud.com/17221
Comment by Andreas Dilger [ 25/Apr/16 ]

Vitaly, could you please take a look at this? Is there a simple fix, or should the patch be reverted to give you more time to look into it?

Comment by Gerrit Updater [ 25/Apr/16 ]

Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/19758
Subject: LU-8062 test: fix fail_val in recovery-small/115b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f5eb98a57ed6cfd36321cb2e7e6e2ac2eb309528

Comment by Richard Henwood (Inactive) [ 25/Apr/16 ]

Another occurrence of this failure, on Master, on review-dne-part-1

https://testing.hpdd.intel.com/sub_tests/637e1824-0a36-11e6-855a-5254006e85c2

Comment by Andreas Dilger [ 26/Apr/16 ]

Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch.

Comment by parinay v kondekar (Inactive) [ 26/Apr/16 ]

s/LU-7343/LU-7434/
(correcting the typo)

Comment by Vitaly Fertman [ 26/Apr/16 ]

Andreas Dilger added a comment - 8 hours ago - edited
Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch.

actually, the fix was submitted above. please re-land the original patch with the fix above.

Comment by Niu Yawei (Inactive) [ 05/Jul/16 ]

https://testing.hpdd.intel.com/test_sets/5766e584-41c7-11e6-bbf5-5254006e85c2

The failure reoccurred in master review.

Comment by Jian Yu [ 05/Aug/16 ]

One more failure instance on master branch: https://testing.hpdd.intel.com/test_sets/11d1b22e-5b2b-11e6-b2e2-5254006e85c2

Comment by nasf (Inactive) [ 03/Nov/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/0409e532-a1d3-11e6-9ab0-5254006e85c2

Comment by Niu Yawei (Inactive) [ 10/Nov/16 ]

Hit on master: https://testing.hpdd.intel.com/test_sets/3c6bacec-a680-11e6-8859-5254006e85c2

Comment by nasf (Inactive) [ 16/Nov/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/0cf1f83e-ab7d-11e6-a726-5254006e85c2

Comment by Emoly Liu [ 30/Nov/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/552d00bc-b663-11e6-b603-5254006e85c2

Comment by Jian Yu [ 05/Dec/16 ]

+1 on master:

https://testing.hpdd.intel.com/test_sets/c78d6572-b396-11e6-85c4-5254006e85c2

Comment by nasf (Inactive) [ 11/Dec/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/c1bf78dc-be55-11e6-9f18-5254006e85c2

Comment by nasf (Inactive) [ 23/Dec/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/91f16c44-c8d4-11e6-8515-5254006e85c2

Comment by Bob Glossman (Inactive) [ 21/Jan/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/7b0a1850-df81-11e6-be8a-5254006e85c2

Comment by Bob Glossman (Inactive) [ 25/Jan/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/f3a0e4da-e321-11e6-981b-5254006e85c2

Comment by Bob Glossman (Inactive) [ 04/Feb/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/8bb48412-eadd-11e6-9fd4-5254006e85c2

Comment by Sebastien Buisson (Inactive) [ 06/Feb/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/904c2bbc-ea9f-11e6-b844-5254006e85c2

Comment by Andreas Dilger [ 06/Feb/17 ]

As 529964 mentioned in his first comment, this problem looks to be a problem with the test itself, rather than a problem with the code. There is a race condition because fail_loc=0x215 (OBD_FAIL_OST_ENOSPC) is insufficiently specific to cause only the write to fail, but it also fails for unrelated OST_STATFS RPCs on that OST (e.g. from the MDS).

Comment by Gerrit Updater [ 06/Feb/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/25279
Subject: LU-8062 tests: fix recovery-small test_115b fail_loc
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8b7d81d7c6437ce085870d0c434b7b44c3d6601f

Comment by James Casper [ 30/Mar/17 ]

Saw a very similar write not blocked issue with master b3541:

fail_loc=0x720
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0340116 s, 120 kB/s
 sanity test_313: @@@@@@ FAIL: write should failed

https://testing.hpdd.intel.com/test_sets/1c10446a-0a05-11e7-9053-5254006e85c2

Comment by Emoly Liu [ 18/Apr/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/41cb5880-23e0-11e7-b742-5254006e85c2

Comment by Vitaly Fertman [ 25/Apr/17 ]

As Bhagyesh Dudhediya mentioned in his first comment, this problem looks to be a problem with the test itself, rather than a problem with the code. There is a race condition because fail_loc=0x215 (OBD_FAIL_OST_ENOSPC) is insufficiently specific to cause only the write to fail, but it also fails for unrelated OST_STATFS RPCs on that OST (e.g. from the MDS).

can you clarify how does it happen? there is a protection made with $OSTCOUNT, that failcheck is triggered only if the fail_val matches the ostid. not supposed to happen.

Comment by Gerrit Updater [ 25/Apr/17 ]

Vitaly Fertman (vitaly.fertman@seagate.com) uploaded a new patch: https://review.whamcloud.com/26815
Subject: LU-8062 libcfs: schedule_timeout fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fbc35bf832d17afa3aa7c7c9e4178986e7cc0458

Comment by Gerrit Updater [ 19/Jul/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26815/
Subject: LU-8062 ptlrpc: increase sleep time in ptlrpc_request_bufs_pack()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e9e744ea7352ea0d1a5d9b2bd05e0e7c19f08596

Comment by Gerrit Updater [ 21/Jul/17 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/28181
Subject: LU-8062 ptlrpc: increase sleep time in ptlrpc_request_bufs_pack()
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: b13e61f2e689efb8de1b558e9499c497921187e1

Comment by Gerrit Updater [ 03/Aug/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28181/
Subject: LU-8062 ptlrpc: increase sleep time in ptlrpc_request_bufs_pack()
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 39c090bdb9beacc0837cf921d87a451308364131

Generated at Sat Feb 10 02:14:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.