[LU-6200] Failover recovery-mds-scale test_failover_ost: test_failover_ost returned 1 Created: 03/Feb/15  Updated: 26/Mar/19  Resolved: 26/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.8.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: p4hc
Environment:

client and server: lustre-master build # 2835 RHEL6


Issue Links:
Duplicate
is duplicated by LU-463 orphan recovery happens too late, cau... Resolved
Related
is related to LU-5483 recovery-mds-scale test failover_mds:... Reopened
is related to LU-5526 recovery-mds-scale test failover_mds:... Resolved
Severity: 3
Rank (Obsolete): 17329

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/be3ebe76-a817-11e4-93dd-5254006e85c2.

The sub-test test_failover_ost failed with the following error:

test_failover_ost returned 1

client 3 shows

tar: etc/sysconfig/quota_nld: Cannot write: No such file or directory
tar: etc/sysconfig/quota_nld: Cannot utime: No such file or directory
tar: etc/sysconfig/sandbox: Cannot write: No such file or directory
tar: etc/sysconfig/nfs: Cannot write: No such file or directory
tar: Exiting with failure status due to previous errors


 Comments   
Comment by Jian Yu [ 03/Feb/15 ]

Hi Hongchao,

Is this similar to LU-4621?

Comment by Andreas Dilger [ 03/Feb/15 ]

Hongchao, what is the severity of this bug? Is it something that will break normal failover for users or is it only affecting testing?

Comment by Andreas Dilger [ 03/Feb/15 ]

Is this a new regression from a recently landed patch?

Comment by Hongchao Zhang [ 04/Feb/15 ]

Hi Andreas,
this issue should not affecting testing only.
there is no corresponding object when replaying the "OST_SETATTR"(op=2) request, and return -2(ENOENT).

11:14:39:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff8800448fd980 x1491531070413056/t30064787885(30064787885) o2->lustre-OST0000-osc-ffff880037b57c00@10.2.4.145@tcp:28/4 lens 440/400 e 0 to 0 dl 1422443475 ref 2 fl Interpret:R/4/0 rc -2/-2
11:14:39:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) Skipped 12 previous similar messages
11:14:39:Lustre: lustre-OST0000-osc-ffff880037b57c00: Connection restored to lustre-OST0000 (at 10.2.4.145@tcp)
11:14:39:Lustre: Skipped 2 previous similar messages
11:14:39:Lustre: 2377:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422443458/real 1422443458]  req@ffff88004303c680 x1491531070428380/t0(0) o8->lustre-OST0001-osc-ffff880037b57c00@10.2.4.141@tcp:28/4 lens 400/544 e 0 to 1 dl 1422443483 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
11:14:39:Lustre: 2377:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 26 previous similar messages
11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff88004293c680 x1491531070413096/t30064787867(30064787867) o2->lustre-OST0001-osc-ffff880037b57c00@10.2.4.145@tcp:28/4 lens 440/400 e 0 to 0 dl 1422443544 ref 2 fl Interpret:R/4/0 rc -2/-2
11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) Skipped 7 previous similar messages
11:14:40:Lustre: lustre-OST0001-osc-ffff880037b57c00: Connection restored to lustre-OST0001 (at 10.2.4.145@tcp)
11:14:40:Lustre: 2377:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422443593/real 1422443593]  req@ffff880043240080 x1491531070431228/t0(0) o8->lustre-OST0005-osc-ffff880037b57c00@10.2.4.141@tcp:28/4 lens 400/544 e 0 to 1 dl 1422443619 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
11:14:40:Lustre: 2377:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 46 previous similar messages
11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff880042d84c80 x1491531070413176/t30064787863(30064787863) o2->lustre-OST0003-osc-ffff880037b57c00@10.2.4.145@tcp:28/4 lens 440/400 e 0 to 0 dl 1422443660 ref 2 fl Interpret:R/4/0 rc -2/-2
11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) Skipped 17 previous similar messages
11:14:40:Lustre: lustre-OST0003-osc-ffff880037b57c00: Connection restored to lustre-OST0003 (at 10.2.4.145@tcp)
11:14:40:Lustre: Skipped 1 previous similar message
11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) @@@ status -2, old was 0  req@ffff880042dbec80 x1491531070412976/t30064787924(30064787924) o2->lustre-OST0005-osc-ffff880037b57c00@10.2.4.145@tcp:28/4 lens 440/400 e 0 to 0 dl 1422443725 ref 2 fl Interpret:R/4/0 rc -2/-2
11:31:06:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) Skipped 16 previous similar messages
11:31:06:Lustre: lustre-OST0005-osc-ffff880037b57c00: Connection restored to lustre-OST0005 (at 10.2.4.145@tcp)

how about recreating those missing objects in "ofd_setattr_hdl" just like "ofd_preprw_write"?

Comment by Gerrit Updater [ 06/Feb/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/13668
Subject: LU-6200 ofd: recreate objects for setattr
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c22faf7b7a1721ca3da24f31f2164b18d0cfb666

Comment by Hongchao Zhang [ 27/Feb/15 ]

the patch for this ticket has been merged with the patch for LU-5526, and the "No such file or directory" seems to have been
fixed by the patch, and the issue of LU-5526(No space left on device) still occurs, and its cause is known but the best way to fix it
is still under way.

https://testing.hpdd.intel.com/test_sets/09d2e80a-b798-11e4-9d63-5254006e85c2
https://testing.hpdd.intel.com/test_sets/78c3db02-b5d1-11e4-a70c-5254006e85c2
https://testing.hpdd.intel.com/test_sets/815c59b4-b3a7-11e4-add6-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ]

master, build# 3264, 2.7.64 tag
Hard Failover: EL6.7 Server/Client
https://testing.hpdd.intel.com/test_sets/7b412132-9edd-11e5-87a9-5254006e85c2

Comment by Sarah Liu [ 15/Dec/15 ]

Hit this issue for every hard failover configs(6), if the patch for LU-5526 can not be landed soon, could we have a separated patch for this particular problem?

Per the discussion in triage call, with separated patch the test would still hit LU-5526 every time, so no improvement in this case with separated patch

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for hardfailover: EL6.7 Server/Client
https://testing.hpdd.intel.com/test_sets/3e92c154-bc93-11e5-8f65-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for hardfailover: EL6.7 Server/SLES11 SP3 Clients
https://testing.hpdd.intel.com/test_sets/762762d0-ba4c-11e5-9a07-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 09/Feb/16 ]

Another instance found for hardfailover : EL6.7 Server/Client, tag 2.7.66, master build 3314
https://testing.hpdd.intel.com/test_sessions/7c5e8006-cb2d-11e5-b3e8-5254006e85c2

Another instance found for hardfailover : EL6.7 Server/Client - ZFS, tag 2.7.66, master build 3314
https://testing.hpdd.intel.com/test_sessions/766ea3ec-cb55-11e5-b49e-5254006e85c2

Another instance found for hardfailover : EL7 Server/Client, tag 2.7.66, master build 3314
https://testing.hpdd.intel.com/test_sessions/8d13249a-ca8f-11e5-9609-5254006e85c2

Another instance found for hardfailover : EL7 Server/SLES11 SP3 Client, tag 2.7.66, master build 3316
https://testing.hpdd.intel.com/test_sessions/2fbf67e4-cd4c-11e5-b1fa-5254006e85c2

Another instance found for hardfailover : EL7 Server/Client - ZFS, tag 2.7.66, master build 3314
https://testing.hpdd.intel.com/test_sessions/f0dd9616-ca6e-11e5-9609-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ]

Another instance found on b2_8 for failover testing , build# 6.
https://testing.hpdd.intel.com/test_sessions/0aed3028-da39-11e5-a8a6-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/eaf85780-d65e-11e5-afe8-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/54ec62da-d99d-11e5-9ebe-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/eb9f29ec-d8da-11e5-83e2-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/2f0aa9f6-d5a5-11e5-9cc2-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/c5a8e44c-d9c7-11e5-85dd-5254006e85c2

Comment by Hongchao Zhang [ 21/Feb/17 ]

the patch https://review.whamcloud.com/#/c/13668/ has been updated.

Comment by Alexey Lyashkov [ 22/May/17 ]

Hongchao,

i don't have an access to gerrit now, but you patch
https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=commitdiff;h=daa98c46817c98d6fbf70dafa9fbdde678f8b9ba;hp=32d1a1c5d610d054ad4609c1cf332172e8310805
is bad.

Looks, You can't use a

+ /* Do sync create if the seq is about to used up */
+ if (fid_seq_is_idif(seq) || fid_seq_is_mdt0(seq)) {
+ if (unlikely(oid >= IDIF_MAX_OID - 1))
+ sync = 1;

because ost id in this case need to account lower 16 bits from seq, please look to the ost id macros.

Comment by Hongchao Zhang [ 25/May/17 ]

the patch has been updated as per the review feedback.

Comment by Hongchao Zhang [ 23/Jun/17 ]

the patch https://review.whamcloud.com/#/c/13668/ has been updated.

Comment by James Casper [ 26/Sep/17 ]

2.11.0:
https://testing.hpdd.intel.com/test_sessions/e6578085-2eed-486d-8601-e5214bac4bb0

Comment by Hongchao Zhang [ 11/Oct/17 ]

the patch https://review.whamcloud.com/#/c/13668/ has been updated

Comment by Sergey Cheremencev [ 18/Mar/19 ]

Hi,

Description of the problem looks similar with I've already fixed in https://review.whamcloud.com/#/c/33836/ .
Please look carefully and If I am right this could be resolved as a dup of LU-11765.

Comment by Hongchao Zhang [ 18/Mar/19 ]

Hi Sergey,

Thanks!
This should be the same as LU-11765, only the fixing ways are different, the patch in this ticket recreates the object
if it doesn't exist and the patch in LU-11765 returns -EAGAIN to notify the caller to retry.

Comment by Hongchao Zhang [ 26/Mar/19 ]

Resolved as duplicate of LU-11765

Generated at Sat Feb 10 01:58:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.