[LU-5115] replay-single test_73b: @@@@@@ FAIL: post-failover df: 1 Created: 28/May/14  Updated: 07/Sep/16  Resolved: 07/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Niu Yawei (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6670 Hard Failover recovery-small test_28:... Open
Severity: 3
Rank (Obsolete): 14102

 Description   

https://maloo.whamcloud.com/test_sets/82faf914-e65d-11e3-9a4f-52540035b04c



 Comments   
Comment by Andreas Dilger [ 28/May/14 ]

In the future, please use "Raise bug" from the Maloo page so that it will automatically triage this failure in the future.

Comment by Niu Yawei (Inactive) [ 29/May/14 ]

In the future, please use "Raise bug" from the Maloo page so that it will automatically triage this failure in the future.

Ok, thanks.

This looks like a test script problem:

shadow-40vm5: CMD: shadow-40vm5.shadow.whamcloud.com lctl get_param -n at_max
shadow-40vm6: CMD: shadow-40vm6.shadow.whamcloud.com lctl get_param -n at_max
shadow-40vm5: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
shadow-40vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
shadow-40vm5: stat: cannot read file system information for `/mnt/lustre': Input/output error
 replay-single test_73b: @@@@@@ FAIL: post-failover df: 1

wait_clients_import_state() reported shadow-40vm5 client recovered in 0 second? that looks suspicious to me, look at the log for this client:

00000001:02000400:1.0:1401256333.271532:0:23799:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
00000001:02000400:0.0:1401256333.271547:0:23800:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
00000001:02000400:1.0:1401256333.422776:0:23883:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
00000001:02000400:0.0:1401256333.422803:0:23882:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec

Checking vm5 MDC state, and found it's FULL.

00000001:02000400:1.0:1401256409.934172:0:23908:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
00000001:02000400:1.0:1401256410.078821:0:23932:0:(debug.c:445:libcfs_debug_mark_buffer()) DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 76 sec
00000080:00200000:0.0:1401256411.194194:0:23955:0:(llite_lib.c:1657:ll_statfs()) VFS Op: at 4299377545 jiffies
00000080:00000004:0.0:1401256411.194198:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275701, max_age 4299376545
00800000:00000004:0.0:1401256411.194203:0:23955:0:(obd_class.h:1294:obd_statfs()) osfs 4299275699, max_age 4299376545
00000100:00080000:0.0:1401256411.194224:0:23955:0:(client.c:1403:ptlrpc_send_new_req()) @@@ req from PID 23955 waiting for recovery: (FULL != REPLAY)  req@ffff88006e57f400 x1469319256038592/t0(0) o41->lustre-MDT0000-mdc-ffff88007d721000@10.1.5.230@tcp:12/10 lens 224/368 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1

vm6 MDC state become FULL, and statfs was issued, however, the vm5 was actually in REPLAY state, that's why the statfs on vm5 failed.

00000100:00080000:1.0:1401256419.512383:0:23956:0:(import.c:1283:ptlrpc_invalidate_import_thread()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from EVICTED to RECOVER
00000100:00080000:1.0:1401256419.512384:0:23956:0:(import.c:1394:ptlrpc_import_recovery_state_machine()) reconnected to lustre-MDT0000_UUID@10.1.5.230@tcp
00000100:00080000:1.0:1401256419.512386:0:23956:0:(import.c:1399:ptlrpc_import_recovery_state_machine()) ffff88007dac8000 lustre-MDT0000_UUID: changing import state from RECOVER to FULL
00000100:02000000:1.0:1401256419.512389:0:23956:0:(import.c:1407:ptlrpc_import_recovery_state_machine()) lustre-MDT0000-mdc-ffff88007d721000: Connection restored to lustre-MDT0000 (at 10.1.5.230@tcp)
00000080:00020000:0.0:1401256419.513378:0:23955:0:(llite_lib.c:1611:ll_statfs_internal()) md_statfs fails: rc = -5

Finally, vm5 recovered, however, the statfs has failed.

Look at following script:

fail() {
        local facets=$1
        local clients=${CLIENTS:-$HOSTNAME}

        facet_failover $* || error "failover: $?"
        wait_clients_import_state "$clients" "$facets" FULL
        clients_up || error "post-failover df: $?"
}

It's quite possible that when wait_clients_import_state() is called, some clients' MDC state is still in FULL, I'm not sure what's the better way to improve the script, maybe wait a longer time before calling the wait_clients_import_state()?

Comment by Jian Yu [ 17/Aug/16 ]

More failure instances on master branch:
https://testing.hpdd.intel.com/test_sets/f29243fe-627f-11e6-b5b1-5254006e85c2
https://testing.hpdd.intel.com/test_sets/38583f22-63ea-11e6-b5b1-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 06/Sep/16 ]

Is it a duplicate of LU-6670?

Comment by Niu Yawei (Inactive) [ 07/Sep/16 ]

yes, it's quite possible they are related.

Comment by Hongchao Zhang [ 07/Sep/16 ]

in "fail()", the import state of all clients will be checked and waited to become the specified state, and if some client MDC state is in FULL, it should not affect
the check of other client's MDC import, Will "pdsh" or other similar tools wait all the nodes to finish the execution of the command?

Comment by Niu Yawei (Inactive) [ 07/Sep/16 ]

Indeed, I think you are right.

Comment by Niu Yawei (Inactive) [ 07/Sep/16 ]

Dup of LU-6670

Generated at Sat Feb 10 01:48:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.