[LU-5115] replay-single test_73b: @@@@@@ FAIL: post-failover df: 1 Created: 28/May/14 Updated: 07/Sep/16 Resolved: 07/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Niu Yawei (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14102 | ||||||||
| Description |
|
https://maloo.whamcloud.com/test_sets/82faf914-e65d-11e3-9a4f-52540035b04c |
| Comments |
| Comment by Andreas Dilger [ 28/May/14 ] |
|
In the future, please use "Raise bug" from the Maloo page so that it will automatically triage this failure in the future. |
| Comment by Niu Yawei (Inactive) [ 29/May/14 ] |
Ok, thanks. This looks like a test script problem: shadow-40vm5: CMD: shadow-40vm5.shadow.whamcloud.com lctl get_param -n at_max
shadow-40vm6: CMD: shadow-40vm6.shadow.whamcloud.com lctl get_param -n at_max
shadow-40vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
shadow-40vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
shadow-40vm5: stat: cannot read file system information for `/mnt/lustre': Input/output error
replay-single test_73b: @@@@@@ FAIL: post-failover df: 1
wait_clients_import_state() reported shadow-40vm5 client recovered in 0 second? that looks suspicious to me, look at the log for this client: 00000001:02000400:1.0:1401256333.271532:0:23799:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 00000001:02000400:0.0:1401256333.271547:0:23800:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 00000001:02000400:1.0:1401256333.422776:0:23883:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec 00000001:02000400:0.0:1401256333.422803:0:23882:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec Checking vm5 MDC state, and found it's FULL. 00000001:02000400:1.0:1401256409.934172:0:23908:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
00000001:02000400:1.0:1401256410.078821:0:23932:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
00000080:00200000:0.0:1401256411.194194:0:23955:0:(llite_lib.c:1657:ll_statfs()) VFS Op: at 4299377545 jiffies
00000080:00000004:0.0:1401256411.194198:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275701, max_age 4299376545
00800000:00000004:0.0:1401256411.194203:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275699, max_age 4299376545
00000100:00080000:0.0:1401256411.194224:0:23955:0:(client.c:1403:ptlrpc_send_new_req()) @@@ req from PID 23955 waiting for recovery: (FULL != REPLAY) req@ffff88006e57f400 x1469319256038592/t0(0) o41->lustre-MDT0000-mdc-ffff88007d721000@10.1.5.230@tcp:12/10 lens 224/368 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
vm6 MDC state become FULL, and statfs was issued, however, the vm5 was actually in REPLAY state, that's why the statfs on vm5 failed. 00000100:00080000:1.0:1401256419.512383:0:23956:0:(import.c:1283:ptlrpc_invalidate_import_thread()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from EVICTED to RECOVER 00000100:00080000:1.0:1401256419.512384:0:23956:0:(import.c:1394:ptlrpc_import_recovery_state_machine()) reconnected to lustre-MDT0000_UUID@10.1.5.230@tcp 00000100:00080000:1.0:1401256419.512386:0:23956:0:(import.c:1399:ptlrpc_import_recovery_state_machine()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from RECOVER to FULL 00000100:02000000:1.0:1401256419.512389:0:23956:0:(import.c:1407:ptlrpc_import_recovery_state_machine()) lustre-MDT0000-mdc-ffff88007d721000: Connection restored to lustre-MDT0000 (at 10.1.5.230@tcp) 00000080:00020000:0.0:1401256419.513378:0:23955:0:(llite_lib.c:1611:ll_statfs_internal()) md_statfs fails: rc = -5 Finally, vm5 recovered, however, the statfs has failed. Look at following script: fail() {
local facets=$1
local clients=${CLIENTS:-$HOSTNAME}
facet_failover $* || error "failover: $?"
wait_clients_import_state "$clients" "$facets" FULL
clients_up || error "post-failover df: $?"
}
It's quite possible that when wait_clients_import_state() is called, some clients' MDC state is still in FULL, I'm not sure what's the better way to improve the script, maybe wait a longer time before calling the wait_clients_import_state()? |
| Comment by Jian Yu [ 17/Aug/16 ] |
|
More failure instances on master branch: |
| Comment by Saurabh Tandan (Inactive) [ 06/Sep/16 ] |
|
Is it a duplicate of LU-6670? |
| Comment by Niu Yawei (Inactive) [ 07/Sep/16 ] |
|
yes, it's quite possible they are related. |
| Comment by Hongchao Zhang [ 07/Sep/16 ] |
|
in "fail()", the import state of all clients will be checked and waited to become the specified state, and if some client MDC state is in FULL, it should not affect |
| Comment by Niu Yawei (Inactive) [ 07/Sep/16 ] |
|
Indeed, I think you are right. |
| Comment by Niu Yawei (Inactive) [ 07/Sep/16 ] |
|
Dup of LU-6670 |