[LU-8062] recovery-small test_115b: @@@@@@ FAIL: dd success - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.10.1, Lustre 2.11.0
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0, Lustre 2.10.0
Labels:
None

Epic/Theme:
- test
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

== recovery-small test 115b: write: late REQ MDunlink and no bulk == 21:12:09 (1461384729)
Filesystem 1K-blocks Used Available Use% Mounted on
onyx-38vm7@tcp:/lustre
74157152 309236 69890576 1% /mnt/lustre
fail_loc=0x8000051b
fail_val=4
Filesystem 1K-blocks Used Available Use% Mounted on
onyx-38vm7@tcp:/lustre
74157152 309236 69890576 1% /mnt/lustre
CMD: onyx-38vm8 lctl set_param fail_val=0 fail_loc=0x80000215
fail_val=0
fail_loc=0x80000215
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 2.13538 s, 1.9 kB/s
recovery-small test_115b: @@@@@@ FAIL: dd success
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4764:error()
= /usr/lib64/lustre/tests/recovery-small.sh:2161:test_115_write()
= /usr/lib64/lustre/tests/recovery-small.sh:2181:test_115b()
= /usr/lib64/lustre/tests/test-framework.sh:5028:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5067:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4914:run_test()
= /usr/lib64/lustre/tests/recovery-small.sh:2183:main()
Dumping lctl log to /logdir/test_logs/2016-04-22/lustre-reviews-el6_7-x86_64-~~review-dne-part-1~~-1_6_1_38438_-70130481106820-100010/recovery-small.test_115b.*.1461384732.log

Attachments

Issue Links

is duplicated by

LU-8061 recovery-small: short description of the failure

Resolved

is related to

LU-8067 sanityn test_31a: read dd: reading `/mnt/lustre2/d31a.sanityn/f31a.sanityn': Input/output error 0 blocks, must be

Resolved

is related to

LU-7434 lost bulk leads to a hang

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(5 mentioned in)

Activity

[LU-8062] recovery-small test_115b: @@@@@@ FAIL: dd success

Niu Yawei (Inactive) added a comment - 10/Nov/16 3:28 AM

Hit on master: https://testing.hpdd.intel.com/test_sets/3c6bacec-a680-11e6-8859-5254006e85c2

Niu Yawei (Inactive) added a comment - 10/Nov/16 3:28 AM Hit on master: https://testing.hpdd.intel.com/test_sets/3c6bacec-a680-11e6-8859-5254006e85c2

nasf (Inactive) added a comment - 03/Nov/16 3:15 PM

+1 on master:
https://testing.hpdd.intel.com/test_sets/0409e532-a1d3-11e6-9ab0-5254006e85c2

nasf (Inactive) added a comment - 03/Nov/16 3:15 PM +1 on master: https://testing.hpdd.intel.com/test_sets/0409e532-a1d3-11e6-9ab0-5254006e85c2

Jian Yu added a comment - 05/Aug/16 11:52 PM

One more failure instance on master branch: https://testing.hpdd.intel.com/test_sets/11d1b22e-5b2b-11e6-b2e2-5254006e85c2

Jian Yu added a comment - 05/Aug/16 11:52 PM One more failure instance on master branch: https://testing.hpdd.intel.com/test_sets/11d1b22e-5b2b-11e6-b2e2-5254006e85c2

Niu Yawei (Inactive) added a comment - 05/Jul/16 7:20 AM

https://testing.hpdd.intel.com/test_sets/5766e584-41c7-11e6-bbf5-5254006e85c2

The failure reoccurred in master review.

Niu Yawei (Inactive) added a comment - 05/Jul/16 7:20 AM https://testing.hpdd.intel.com/test_sets/5766e584-41c7-11e6-bbf5-5254006e85c2 The failure reoccurred in master review.

Vitaly Fertman added a comment - 26/Apr/16 9:03 AM

Andreas Dilger added a comment - 8 hours ago - edited
Closing this bug, since the problematic patch was reverted, and ~~LU-7434~~ is still open to track the re-landing of the fixed patch.

actually, the fix was submitted above. please re-land the original patch with the fix above.

Vitaly Fertman added a comment - 26/Apr/16 9:03 AM Andreas Dilger added a comment - 8 hours ago - edited Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch. actually, the fix was submitted above. please re-land the original patch with the fix above.

parinay v kondekar (Inactive) added a comment - 26/Apr/16 4:25 AM

s/~~LU-7343~~/~~LU-7434~~/
(correcting the typo)

parinay v kondekar (Inactive) added a comment - 26/Apr/16 4:25 AM s/ LU-7343 / LU-7434 / (correcting the typo)

Andreas Dilger added a comment - 26/Apr/16 1:24 AM - edited

Closing this bug, since the problematic patch was reverted, and ~~LU-7434~~ is still open to track the re-landing of the fixed patch.

Andreas Dilger added a comment - 26/Apr/16 1:24 AM - edited Closing this bug, since the problematic patch was reverted, and LU-7434 is still open to track the re-landing of the fixed patch.

Richard Henwood (Inactive) added a comment - 25/Apr/16 2:42 PM

Another occurrence of this failure, on Master, on review-dne-part-1

https://testing.hpdd.intel.com/sub_tests/637e1824-0a36-11e6-855a-5254006e85c2

Richard Henwood (Inactive) added a comment - 25/Apr/16 2:42 PM Another occurrence of this failure, on Master, on review-dne-part-1 https://testing.hpdd.intel.com/sub_tests/637e1824-0a36-11e6-855a-5254006e85c2

Gerrit Updater added a comment - 25/Apr/16 6:40 AM

Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/19758
Subject: ~~LU-8062~~ test: fix fail_val in recovery-small/115b
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f5eb98a57ed6cfd36321cb2e7e6e2ac2eb309528

Gerrit Updater added a comment - 25/Apr/16 6:40 AM Bhagyesh Dudhediya (bhagyesh.dudhediya@seagate.com) uploaded a new patch: http://review.whamcloud.com/19758 Subject: LU-8062 test: fix fail_val in recovery-small/115b Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f5eb98a57ed6cfd36321cb2e7e6e2ac2eb309528

Andreas Dilger added a comment - 25/Apr/16 6:19 AM

Vitaly, could you please take a look at this? Is there a simple fix, or should the patch be reverted to give you more time to look into it?

Andreas Dilger added a comment - 25/Apr/16 6:19 AM Vitaly, could you please take a look at this? Is there a simple fix, or should the patch be reverted to give you more time to look into it?

Andreas Dilger added a comment - 25/Apr/16 6:18 AM

It appears the root cause of these failures is the following patch, which added test_115b and landed on 2016-04-21:

commit 55f8520817a31dabf19fe0a8ac2492b85d039c38
Author:     Vitaly Fertman <vitaly.fertman@seagate.com>
CommitDate: Thu Apr 21 02:27:54 2016 +0000

    LU-7434 ptlrpc: lost bulk leads to a hang
    
    The reverse order of request_out_callback() and reply_in_callback()
    puts the RPC into UNREGISTERING state, which is waiting for RPC &
    bulk md unlink, whereas only RPC md unlink has been called so far.
    If bulk is lost, even expired_set does not check for UNREGISTERING
    state.
    
    The same for write if server returns an error.
    
    This phase is ambiguous, split to UNREG_RPC and UNREG_BULK.
    
    Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com>
    Reviewed-on: http://review.whamcloud.com/17221

Andreas Dilger added a comment - 25/Apr/16 6:18 AM It appears the root cause of these failures is the following patch, which added test_115b and landed on 2016-04-21: commit 55f8520817a31dabf19fe0a8ac2492b85d039c38 Author: Vitaly Fertman <vitaly.fertman@seagate.com> CommitDate: Thu Apr 21 02:27:54 2016 +0000 LU-7434 ptlrpc: lost bulk leads to a hang The reverse order of request_out_callback() and reply_in_callback() puts the RPC into UNREGISTERING state, which is waiting for RPC & bulk md unlink, whereas only RPC md unlink has been called so far. If bulk is lost, even expired_set does not check for UNREGISTERING state. The same for write if server returns an error. This phase is ambiguous, split to UNREG_RPC and UNREG_BULK. Signed-off-by: Vitaly Fertman <vitaly.fertman@seagate.com> Reviewed-on: http://review.whamcloud.com/17221

People

Assignee:: WC Triage

Reporter:: Bhagyesh Dudhediya (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 25/Apr/16 5:04 AM

Updated:: 03/Aug/17 11:04 PM

Resolved:: 19/Jul/17 3:36 AM