[LU-5785] recovery-mds-scale test_failover_ost: test_failover_ost returned 1 Created: 22/Oct/14 Updated: 27/Apr/15 Resolved: 27/Apr/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
client and server: lustre-master build # 2690 |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 16237 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/548b249c-565a-11e4-97f1-5254006e85c2. The sub-test test_failover_ost failed with the following error: test_failover_ost returned 1 Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost5 has failed over 1 times, and counting... Lustre: DEBUG MARKER: ost5 has failed over 1 times, and counting... Lustre: lustre-OST0006: Will be in recovery for at least 1:00, or until 3 clients reconnect LustreError: 360:0:(ldlm_lib.c:1730:check_for_next_transno()) lustre-OST0006: waking for gap in transno, VBR is OFF (skip: 4295345303, ql: 2, comp: 1, conn: 3, next: 4295345312, last_committed: 4295345240) Lustre: lustre-OST0006: Recovery over after 0:22, of 3 clients 3 recovered and 0 were evicted. LustreError: 30883:0:(ldlm_lockd.c:1288:ldlm_handle_enqueue0()) ### delayed lvb init failed (rc -2) ns: filter-lustre-OST0006_UUID lock: ffff88006744d080/0x8abfa2edda4a4ccc lrc: 2/0,0 mode: --/PR res: [0xde56:0x0:0x0].0 rrc: 1 type: EXT [0->0] (req 0->0) flags: 0x40000000000000 nid: local remote: 0x7a6f60280c307e9a expref: -99 pid: 30883 timeout: 0 lvb_type: 0 LustreError: 30888:0:(ofd_io.c:600:ofd_preprw_write()) lustre-OST0006: BRW to missing obj 0x0:56912 LustreError: 30888:0:(ofd_io.c:600:ofd_preprw_write()) Skipped 1 previous similar message Lustre: DEBUG MARKER: /usr/sbin/lctl mark Duration: 86400 |
| Comments |
| Comment by Jodi Levi (Inactive) [ 24/Oct/14 ] |
|
Mike, |
| Comment by Andreas Dilger [ 24/Oct/14 ] |
|
The console logs reports: 23:57:55:Lustre: lustre-OST0006: deleting orphan objects from 0x0:56878 to 0x0:56897 00:19:16:Lustre: lustre-OST0006: deleting orphan objects from 0x0:57116 to 0x0:57142 so it doesn't look like orphan cleanup caused the object to be removed. Maybe there is something in the debug logs? I wonder if it makes sense to dump the debug log to shared storage just before the reboot to try and capture more useful information for debugging? It looks like this test is run several times a day, but almost all of them are failing for one reason or another. It makes sense to investigate and fix some of these failures. |
| Comment by Jian Yu [ 22/Nov/14 ] |
|
On Client 3 (onyx-45vm6): tar: etc/sysconfig/network/ifroute-lo: Cannot write: No such file or directory
tar: etc/sysconfig/network/routes: Cannot write: No such file or directory
tar: etc/sysconfig/network/if-down.d/ndp-proxy: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
It looks like this is a duplicate of Hi Hongchao, |