[LU-6200] Failover recovery-mds-scale test_failover_ost: test_failover_ost returned 1 Created: 03/Feb/15 Updated: 26/Mar/19 Resolved: 26/Mar/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.8.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.10.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Hongchao Zhang |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | p4hc | ||
| Environment: |
client and server: lustre-master build # 2835 RHEL6 |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 17329 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/be3ebe76-a817-11e4-93dd-5254006e85c2. The sub-test test_failover_ost failed with the following error: test_failover_ost returned 1 client 3 shows tar: etc/sysconfig/quota_nld: Cannot write: No such file or directory tar: etc/sysconfig/quota_nld: Cannot utime: No such file or directory tar: etc/sysconfig/sandbox: Cannot write: No such file or directory tar: etc/sysconfig/nfs: Cannot write: No such file or directory tar: Exiting with failure status due to previous errors |
| Comments |
| Comment by Jian Yu [ 03/Feb/15 ] |
|
Hi Hongchao, Is this similar to |
| Comment by Andreas Dilger [ 03/Feb/15 ] |
|
Hongchao, what is the severity of this bug? Is it something that will break normal failover for users or is it only affecting testing? |
| Comment by Andreas Dilger [ 03/Feb/15 ] |
|
Is this a new regression from a recently landed patch? |
| Comment by Hongchao Zhang [ 04/Feb/15 ] |
|
Hi Andreas, 11:14:39:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff8800448fd980 x1491531070413056/t30064787885(30064787885) o2->lustre-OST0000-osc-ffff880037b57c00@10.2.4.145@tcp:28/4 lens 440/400 e 0 to 0 dl 1422443475 ref 2 fl Interpret:R/4/0 rc -2/-2 11:14:39:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) Skipped 12 previous similar messages 11:14:39:Lustre: lustre-OST0000-osc-ffff880037b57c00: Connection restored to lustre-OST0000 (at 10.2.4.145@tcp) 11:14:39:Lustre: Skipped 2 previous similar messages 11:14:39:Lustre: 2377:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422443458/real 1422443458] req@ffff88004303c680 x1491531070428380/t0(0) o8->lustre-OST0001-osc-ffff880037b57c00@10.2.4.141@tcp:28/4 lens 400/544 e 0 to 1 dl 1422443483 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 11:14:39:Lustre: 2377:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 26 previous similar messages 11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff88004293c680 x1491531070413096/t30064787867(30064787867) o2->lustre-OST0001-osc-ffff880037b57c00@10.2.4.145@tcp:28/4 lens 440/400 e 0 to 0 dl 1422443544 ref 2 fl Interpret:R/4/0 rc -2/-2 11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) Skipped 7 previous similar messages 11:14:40:Lustre: lustre-OST0001-osc-ffff880037b57c00: Connection restored to lustre-OST0001 (at 10.2.4.145@tcp) 11:14:40:Lustre: 2377:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422443593/real 1422443593] req@ffff880043240080 x1491531070431228/t0(0) o8->lustre-OST0005-osc-ffff880037b57c00@10.2.4.141@tcp:28/4 lens 400/544 e 0 to 1 dl 1422443619 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 11:14:40:Lustre: 2377:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 46 previous similar messages 11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff880042d84c80 x1491531070413176/t30064787863(30064787863) o2->lustre-OST0003-osc-ffff880037b57c00@10.2.4.145@tcp:28/4 lens 440/400 e 0 to 0 dl 1422443660 ref 2 fl Interpret:R/4/0 rc -2/-2 11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) Skipped 17 previous similar messages 11:14:40:Lustre: lustre-OST0003-osc-ffff880037b57c00: Connection restored to lustre-OST0003 (at 10.2.4.145@tcp) 11:14:40:Lustre: Skipped 1 previous similar message 11:14:40:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) @@@ status -2, old was 0 req@ffff880042dbec80 x1491531070412976/t30064787924(30064787924) o2->lustre-OST0005-osc-ffff880037b57c00@10.2.4.145@tcp:28/4 lens 440/400 e 0 to 0 dl 1422443725 ref 2 fl Interpret:R/4/0 rc -2/-2 11:31:06:LustreError: 2377:0:(client.c:2809:ptlrpc_replay_interpret()) Skipped 16 previous similar messages 11:31:06:Lustre: lustre-OST0005-osc-ffff880037b57c00: Connection restored to lustre-OST0005 (at 10.2.4.145@tcp) how about recreating those missing objects in "ofd_setattr_hdl" just like "ofd_preprw_write"? |
| Comment by Gerrit Updater [ 06/Feb/15 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/13668 |
| Comment by Hongchao Zhang [ 27/Feb/15 ] |
|
the patch for this ticket has been merged with the patch for https://testing.hpdd.intel.com/test_sets/09d2e80a-b798-11e4-9d63-5254006e85c2 |
| Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Sarah Liu [ 15/Dec/15 ] |
|
Hit this issue for every hard failover configs(6), if the patch for Per the discussion in triage call, with separated patch the test would still hit |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for hardfailover: EL6.7 Server/Client |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for hardfailover: EL6.7 Server/SLES11 SP3 Clients |
| Comment by Saurabh Tandan (Inactive) [ 09/Feb/16 ] |
|
Another instance found for hardfailover : EL6.7 Server/Client, tag 2.7.66, master build 3314 Another instance found for hardfailover : EL6.7 Server/Client - ZFS, tag 2.7.66, master build 3314 Another instance found for hardfailover : EL7 Server/Client, tag 2.7.66, master build 3314 Another instance found for hardfailover : EL7 Server/SLES11 SP3 Client, tag 2.7.66, master build 3316 Another instance found for hardfailover : EL7 Server/Client - ZFS, tag 2.7.66, master build 3314 |
| Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ] |
|
Another instance found on b2_8 for failover testing , build# 6. |
| Comment by Hongchao Zhang [ 21/Feb/17 ] |
|
the patch https://review.whamcloud.com/#/c/13668/ has been updated. |
| Comment by Alexey Lyashkov [ 22/May/17 ] |
|
Hongchao, i don't have an access to gerrit now, but you patch Looks, You can't use a + /* Do sync create if the seq is about to used up */ because ost id in this case need to account lower 16 bits from seq, please look to the ost id macros. |
| Comment by Hongchao Zhang [ 25/May/17 ] |
|
the patch has been updated as per the review feedback. |
| Comment by Hongchao Zhang [ 23/Jun/17 ] |
|
the patch https://review.whamcloud.com/#/c/13668/ has been updated. |
| Comment by James Casper [ 26/Sep/17 ] |
|
2.11.0: |
| Comment by Hongchao Zhang [ 11/Oct/17 ] |
|
the patch https://review.whamcloud.com/#/c/13668/ has been updated |
| Comment by Sergey Cheremencev [ 18/Mar/19 ] |
|
Hi, Description of the problem looks similar with I've already fixed in https://review.whamcloud.com/#/c/33836/ . |
| Comment by Hongchao Zhang [ 18/Mar/19 ] |
|
Hi Sergey, Thanks! |
| Comment by Hongchao Zhang [ 26/Mar/19 ] |
|
Resolved as duplicate of |