[LU-5785] recovery-mds-scale test_failover_ost: test_failover_ost returned 1 Created: 22/Oct/14  Updated: 27/Apr/15  Resolved: 27/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Duplicate Votes: 0
Labels: None
Environment:

client and server: lustre-master build # 2690
client is SLES11 SP3


Issue Links:
Duplicate
duplicates LU-4621 recovery-mds-scale: test_failover_ost Resolved
is duplicated by LU-5485 first mount always fail with avoid_as... Resolved
Related
is related to LU-5782 recovery-mds-scale test_failover_ost: Resolved
Severity: 3
Rank (Obsolete): 16237

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/548b249c-565a-11e4-97f1-5254006e85c2.

The sub-test test_failover_ost failed with the following error:

test_failover_ost returned 1
Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost5 has failed over 1 times, and counting...
Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting...
Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 3 clients reconnect
LustreError: 360:0:(ldlm_lib.c:1730:check_for_next_transno()) lustre-OST0006: waking for gap in transno, VBR is OFF (skip: 4295345303, ql: 2, comp: 1, conn: 3, next: 4295345312, last_committed: 4295345240)
Lustre: lustre-OST0006: Recovery over after 0:22, of 3 clients 3 recovered and 0 were evicted.
LustreError: 30883:0:(ldlm_lockd.c:1288:ldlm_handle_enqueue0()) ### delayed lvb init failed (rc -2) ns: filter-lustre-OST0006_UUID lock: ffff88006744d080/0x8abfa2edda4a4ccc lrc: 2/0,0 mode: --/PR res: [0xde56:0x0:0x0].0 rrc: 1 type: EXT [0->0] (req 0->0) flags: 0x40000000000000 nid: local remote: 0x7a6f60280c307e9a expref: -99 pid: 30883 timeout: 0 lvb_type: 0
LustreError: 30888:0:(ofd_io.c:600:ofd_preprw_write()) lustre-OST0006: BRW to missing obj 0x0:56912
LustreError: 30888:0:(ofd_io.c:600:ofd_preprw_write()) Skipped 1 previous similar message
Lustre: DEBUG MARKER: /usr/sbin/lctl mark Duration:               86400


 Comments   
Comment by Jodi Levi (Inactive) [ 24/Oct/14 ]

Mike,
Could you please have a look at this one and comment?
Thank you!

Comment by Andreas Dilger [ 24/Oct/14 ]

The console logs reports:

23:57:55:Lustre: lustre-OST0006: deleting orphan objects from 0x0:56878 to 0x0:56897
00:19:16:Lustre: lustre-OST0006: deleting orphan objects from 0x0:57116 to 0x0:57142

so it doesn't look like orphan cleanup caused the object to be removed. Maybe there is something in the debug logs?

I wonder if it makes sense to dump the debug log to shared storage just before the reboot to try and capture more useful information for debugging?

It looks like this test is run several times a day, but almost all of them are failing for one reason or another. It makes sense to investigate and fix some of these failures.

Comment by Jian Yu [ 22/Nov/14 ]

On Client 3 (onyx-45vm6):

    tar: etc/sysconfig/network/ifroute-lo: Cannot write: No such file or directory
    tar: etc/sysconfig/network/routes: Cannot write: No such file or directory
    tar: etc/sysconfig/network/if-down.d/ndp-proxy: Cannot stat: No such file or directory
    tar: Exiting with failure status due to previous errors

It looks like this is a duplicate of LU-4621.

Hi Hongchao,
Could you please confirm this?

Generated at Sat Feb 10 01:54:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.