In the future, please use "Raise bug" from the Maloo page so that it will automatically triage this failure in the future.
Ok, thanks.
This looks like a test script problem:
shadow-40vm5: CMD: shadow-40vm5.shadow.whamcloud.com lctl get_param -n at_max
shadow-40vm6: CMD: shadow-40vm6.shadow.whamcloud.com lctl get_param -n at_max
shadow-40vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
shadow-40vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
shadow-40vm5: stat: cannot read file system information for `/mnt/lustre': Input/output error
replay-single test_73b: @@@@@@ FAIL: post-failover df: 1
wait_clients_import_state() reported shadow-40vm5 client recovered in 0 second? that looks suspicious to me, look at the log for this client:
00000001:02000400:1.0:1401256333.271532:0:23799:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
00000001:02000400:0.0:1401256333.271547:0:23800:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
00000001:02000400:1.0:1401256333.422776:0:23883:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
00000001:02000400:0.0:1401256333.422803:0:23882:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
Checking vm5 MDC state, and found it's FULL.
00000001:02000400:1.0:1401256409.934172:0:23908:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
00000001:02000400:1.0:1401256410.078821:0:23932:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
00000080:00200000:0.0:1401256411.194194:0:23955:0:(llite_lib.c:1657:ll_statfs()) VFS Op: at 4299377545 jiffies
00000080:00000004:0.0:1401256411.194198:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275701, max_age 4299376545
00800000:00000004:0.0:1401256411.194203:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275699, max_age 4299376545
00000100:00080000:0.0:1401256411.194224:0:23955:0:(client.c:1403:ptlrpc_send_new_req()) @@@ req from PID 23955 waiting for recovery: (FULL != REPLAY) req@ffff88006e57f400 x1469319256038592/t0(0) o41->lustre-MDT0000-mdc-ffff88007d721000@10.1.5.230@tcp:12/10 lens 224/368 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
vm6 MDC state become FULL, and statfs was issued, however, the vm5 was actually in REPLAY state, that's why the statfs on vm5 failed.
00000100:00080000:1.0:1401256419.512383:0:23956:0:(import.c:1283:ptlrpc_invalidate_import_thread()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from EVICTED to RECOVER
00000100:00080000:1.0:1401256419.512384:0:23956:0:(import.c:1394:ptlrpc_import_recovery_state_machine()) reconnected to lustre-MDT0000_UUID@10.1.5.230@tcp
00000100:00080000:1.0:1401256419.512386:0:23956:0:(import.c:1399:ptlrpc_import_recovery_state_machine()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from RECOVER to FULL
00000100:02000000:1.0:1401256419.512389:0:23956:0:(import.c:1407:ptlrpc_import_recovery_state_machine()) lustre-MDT0000-mdc-ffff88007d721000: Connection restored to lustre-MDT0000 (at 10.1.5.230@tcp)
00000080:00020000:0.0:1401256419.513378:0:23955:0:(llite_lib.c:1611:ll_statfs_internal()) md_statfs fails: rc = -5
Finally, vm5 recovered, however, the statfs has failed.
Look at following script:
fail() {
local facets=$1
local clients=${CLIENTS:-$HOSTNAME}
facet_failover $* || error "failover: $?"
wait_clients_import_state "$clients" "$facets" FULL
clients_up || error "post-failover df: $?"
}
It's quite possible that when wait_clients_import_state() is called, some clients' MDC state is still in FULL, I'm not sure what's the better way to improve the script, maybe wait a longer time before calling the wait_clients_import_state()?
Dup of LU-6670